A First Toy Example: Predicting Bike Ride Duration

The best way to learn Machine Learning is to actually do it.

There’s a point in every beginner’s ML journey where the concepts start to feel slippery.

You understand the definition of supervised learning. You’ve read that regression predicts continuous values. You know that features go in and a prediction comes out. But the moment you try to apply any of that to a real problem, the abstraction starts to feel hollow.

That’s not a knowledge gap. That’s a practice gap.

The fastest way to close it is to work through a concrete example – one problem, from raw data to a working model, with every decision explained as it’s made. Not a sanitized academic exercise. A real problem with real choices and real tradeoffs.

That’s exactly what this article is.

The problem: predicting how long a bike ride will take, given a set of inputs you’d know before starting. It’s simple enough to understand completely. It’s complex enough to illustrate every core concept you need. And for cyclists who track their data, it’s immediately relevant.

By the end, you’ll understand what features are and how to choose them, what a regression model is actually doing, how to evaluate whether a model is working, and what the full arc of a basic ML project looks like from start to finish.

No prior ML experience required. Every term explained when it first appears.

Step 1: Define the Problem

Before any data is collected or any algorithm is selected, the problem needs a precise definition. This is a step many beginners skip – and it almost always causes problems later.

In plain English, the goal is: given information available before a ride begins, predict how long that ride will take in minutes.

Let’s be more specific:

Input: Information known before the ride – planned distance, planned elevation gain, rider’s recent fitness level, weather conditions
Output: Ride duration in minutes (a number, not a category)
Type of ML problem: Supervised regression

Supervised because we’ll train on historical ride data where the actual duration is known – that’s the label. Regression because the output is a continuous number (minutes elapsed), not a discrete category.

This is exactly the kind of problem supervised learning handles well:

Clear, measurable output ✅
Historical labeled examples exist ✅
Learnable patterns in the data ✅
Genuine value over simply guessing ✅

If any of those criteria feel unfamiliar, the article on “What Makes a Problem a “Good” ML Problem” covers them in depth.

Step 2: Collect and Understand the Data

For this toy example, imagine a cyclist who has been recording their rides for two years using an Apple Watch. Each ride in the dataset is one row. Each row contains information about that ride.

Here’s what the raw dataset might look like (simplified):

Ride ID	Distance (km)	Elevation Gain (m)	Avg Temp (°C)	Fitness Score	Wind Speed (km/h)	Duration (min)
001	42.1	380	18	74	12	98
002	28.5	110	22	71	5	58
003	65.3	820	14	76	20	178
004	33.0	240	9	68	28	89
005	19.8	60	25	72	8	40

The column we’re trying to predict – Duration – is the target variable (also called the label or dependent variable). Every other column is a potential feature (also called a predictor or independent variable).

Let’s say this dataset contains 400 rides over two years. That’s not enormous, but it’s enough to demonstrate the concepts clearly – and for a relatively simple regression problem like this, it may be sufficient to build a useful model.

Step 3: Explore and Understand the Features

Before building anything, good data science practice involves exploratory data analysis (EDA) – a process of examining the data to understand distributions, relationships, and potential issues.

For our toy example, a few key questions arise immediately:

Does each feature actually relate to duration?

Intuitively:

Distance – longer distance almost certainly means longer duration. Strong expected relationship.
Elevation Gain – more climbing means slower average speed. Strong expected relationship.
Temperature – performance degrades in extreme heat or cold. Moderate expected relationship.
Fitness Score – a fitter rider covers the same distance faster. Moderate expected relationship.
Wind Speed – headwinds slow riders down, but we don’t know wind direction here. Weaker, noisier relationship.

In a real project, you’d visualize these relationships – scatter plots, correlation matrices – to confirm whether the patterns you expect are actually present in the data. For our toy example, assume the expected relationships hold.

Are there any obvious data quality issues?

Missing values: Do any rides have blank entries? If so, those rows need to be handled – either filled in (imputed) or removed.
Outliers: Is there a ride listed as 8 km long but taking 4 hours? That might be a GPS error, a coffee stop, or a medical emergency – it’s probably not a useful training example.
Consistency: Are units consistent throughout? (All distances in km, all temperatures in Celsius?)

Data cleaning is unglamorous but essential. In real-world projects, it often consumes 60-80% of total project time. Our toy dataset is pre-cleaned, but it’s worth naming this step explicitly because it never disappears in practice.

Step 4: Feature Engineering

Feature engineering is the process of transforming raw data into the input format that gives the model the best chance of learning meaningful patterns.

Sometimes raw features are sufficient. Sometimes new features derived from raw data are more informative. Here are a few examples relevant to our problem:

Elevation-to-distance ratio:
Instead of feeding elevation gain and distance as separate features, we could create a new feature: elevation per kilometer (elevation gain ÷ distance). This captures gradient – how steep the ride is – which is arguably more predictive of duration than raw elevation alone. A 500 m elevation gain over 10 km is brutal; the same gain over 80km is modest.

Temperature deviation from optimal:
Rather than raw temperature, a feature representing how far the temperature is from an optimal range (say, 15-20°C) might be more predictive. Performance degrades as temperature deviates from that range in either direction. Raw temperature doesn’t capture this non-linear relationship as cleanly.

Rolling fitness average:
Instead of a single-point fitness score, a rolling average of the rider’s fitness scores over the previous 4 weeks might be more stable and predictive than any single measurement.

For our toy example, we’ll keep it simple and use:

Distance (km)
Elevation gain (m)
Elevation per km (derived feature)
Average temperature (°C)
Current fitness score
Wind speed (km/h)

That gives the model six input features to work with.

Step 5: Split the Data

Here’s a critical concept that catches many beginners off guard: you must never evaluate your model on the same data you trained it on.

If you do, you’re not measuring how well the model predicts new, unseen rides – you’re measuring how well it memorized the training examples. A model can score 100% on training data by simply memorizing every row, while being completely useless on any new data it hasn’t seen.

The solution is to split the dataset before training:

Training set (~80%): The data the model actually learns from. In our case, about 320 rides.
Test set (~20%): Data held back entirely during training. Used only at the end to evaluate performance on unseen examples. About 80 rides.

The test set simulates what happens in the real world: the model receives a new ride it’s never seen before and must predict the duration based only on its learned patterns.

Some projects also use a third split – a validation set – to tune model settings during development without contaminating the test set. For our toy example, the train/test split is sufficient.

Step 6: Choose and Train a Model

Now for the part most beginners think is the whole job – but is actually just one step among many.

For a regression problem like this, several algorithm options are available. Two natural starting points:

Linear Regression

Linear regression assumes that the relationship between each feature and the output can be approximated as a straight line. The model learns a coefficient (a weight) for each feature – essentially, how much each feature contributes to the predicted duration.

A simplified version of what linear regression is doing:

Predicted Duration = (a × Distance) + (b × Elevation) + (c × Elevation per km) + (d × Temperature) + (e × Fitness Score) + (f × Wind Speed) + intercept

Where a, b, c, d, e, f are coefficients the model learns from the training data. The intercept is the baseline prediction when all features are zero.

After training on 320 rides, the model might learn coefficients that look like this:

Feature	Learned Coefficient	Interpretation
Distance (km)	+1.8	Each additional km adds ~1.8 minutes
Elevation gain (m)	+0.04	Each additional meter of climbing adds ~0.04 minutes
Elevation per km	+3.2	Steeper gradients add significant time
Temperature deviation	+0.3	Each degree from optimal adds ~0.3 minutes
Fitness score	-0.9	Higher fitness reduces duration
Wind speed (km/h)	+0.25	Stronger winds add time

These coefficients are intuitive – they reflect what any experienced cyclist would expect. When ML produces outputs that align with domain knowledge, that’s a good sign the model has learned something real.

Decision Tree Regression

A decision tree takes a different approach. Instead of a weighted formula, it learns a series of if-then rules that partition the data into progressively smaller groups.

For example, the tree might learn:

If distance > 50km AND elevation > 500m → predicted duration ≈ 145 min
If distance > 50km AND elevation ≤ 500m → predicted duration ≈ 110 min
If distance ≤ 50km AND fitness score > 70 → predicted duration ≈ 65 min

Decision trees are highly interpretable – you can follow the exact path of reasoning the model used. They can also capture non-linear relationships that linear regression might miss. However, they can overfit (memorize training data too closely) if not properly constrained.

For our toy example, we’ll use linear regression as the primary model. It’s transparent, interpretable, and appropriate for a first project where the goal is understanding – not squeezing out maximum predictive accuracy.

Step 7: Evaluate the Model

Training is complete. Now the model is applied to the test set – 80 rides it has never seen – and its predictions are compared to the actual durations.

Three common evaluation metrics for regression models:

Mean Absolute Error (MAE)
The average difference between predicted and actual duration, in the same units as the output (minutes). If MAE = 7.2, the model’s predictions are off by an average of 7.2 minutes.

Simple, intuitive, and directly meaningful. “Our predictions are off by about 7 minutes on average.”

Root Mean Squared Error (RMSE)
Similar to MAE, but larger errors are penalized more heavily. If a model is occasionally wildly wrong, RMSE will be higher than MAE. This metric is useful when large errors are particularly costly.

R² (R-squared)
A measure of how much of the variation in ride duration the model explains, expressed as a proportion between 0 and 1.

R² = 0: The model explains nothing. It’s no better than always predicting the average duration.
R² = 1: The model explains all variation. Perfect predictions.
R² = 0.85: The model explains 85% of the variation in duration.

For our toy example, let’s say the model produces:

MAE: 8.4 minutes
RMSE: 12.1 minutes
R²: 0.87

How to interpret this: The model explains 87% of the variation in ride duration. On average, its predictions are off by about 8 minutes. For rides where large errors occur, the gap is closer to 12 minutes.

Is that good? It depends on the use case. For planning purposes – “will this ride take roughly 90 minutes or roughly 3 hours?” – an average error of 8 minutes is genuinely useful. For race-level pacing decisions, it might not be precise enough. Context determines acceptable accuracy.

Step 8: Interpret and Interrogate the Results

A number on a scorecard isn’t the end of evaluation. Good ML practice involves understanding where the model works well and where it struggles.

A few diagnostic questions worth asking:

Does the model perform differently on different types of rides?

Perhaps it predicts flat rides accurately but consistently underestimates climbing rides. This might suggest that elevation-related features need better engineering, or that the training set contained proportionally fewer long, hilly rides.

Are there systematic biases?

If the model consistently over-predicts duration for highly fit riders and under-predicts for less fit riders, the fitness score feature isn’t being weighted correctly. This calls for more data or a different approach to encoding fitness.

What do the residuals look like?

A residual is the difference between a predicted value and the actual value. Plotting residuals reveals patterns. If residuals are randomly scattered around zero – no trend, no shape – the model is doing its job. If residuals show a curve, a funnel shape, or a systematic bias, there’s structure the model hasn’t captured.

These diagnostic steps often lead back to earlier stages – refining features, collecting more data, or trying a more flexible algorithm. Machine learning development is iterative, not linear.

Step 9: What Would Come Next

This toy example stops at a working, evaluated model. In a real deployment scenario, several additional steps would follow:

Cross-validation: Instead of a single train/test split, the data is split multiple times in different configurations to get a more reliable estimate of model performance.

Hyperparameter tuning: Most algorithms have settings (called hyperparameters) that control how the model learns. Tuning these can improve performance beyond default settings.

Model comparison: Linear regression is not necessarily the best model for this problem. Trying gradient boosted trees, random forests, or other algorithms – and comparing their test set performance – is standard practice.

Deployment: A trained model can be embedded in an app, a tool, or a pipeline so that it generates predictions automatically when new ride inputs are entered – before the ride even begins.

Monitoring and retraining: Rider fitness changes over time. A model trained on two-year-old data will gradually become less accurate as the rider’s physiology and training patterns evolve. Periodic retraining on fresh data maintains accuracy.

What This Example Taught

Walk back through the steps and look at what each one actually demonstrated:

Step	Core Concept Illustrated
Define the problem	Target variable, supervised regression, problem framing
Collect data	Features vs. labels, dataset structure
Explore the data	Exploratory data analysis, data quality
Engineer features	Feature transformation, domain knowledge in ML
Split the data	Training vs. test sets, preventing data leakage
Train the model	Coefficients, how algorithms learn from data
Evaluate performance	MAE, RMSE, R², what “good” accuracy means
Interpret results	Residuals, bias detection, iterative improvement
Next steps	The full production pipeline beyond the model itself

That’s the complete conceptual skeleton of a supervised ML project. Every production ML system in the world – fraud detection, medical imaging, recommendation engines – runs on this same skeleton. The data is bigger, the algorithms more complex, the infrastructure more elaborate. But the structure is identical.

Start Small, Think Clearly, Build Forward

Toy examples have a reputation for being too simple to matter. That’s wrong.

A toy example done well teaches something a textbook definition never can: the texture of an ML project. The messiness of real data. The judgment calls in feature engineering. The interpretive work that follows a test set result. The iterative nature of improvement.

Predicting bike ride duration isn’t changing the world. But the process of doing it carefully – defining the problem, understanding the data, making explicit modeling choices, evaluating honestly, and knowing what questions to ask of the results – that process scales to every ML problem that does.

The difference between a beginner and a practitioner isn’t knowing more algorithms. It’s having walked through enough examples to recognize the shape of a problem before the first line of code is written.

This was one example. Keep building.

Explore the Cosmos

Support Vector Machines: Carving Clarity from Complex Data

The Ultimate Guide to k-Nearest Neighbors (KNN): Data Science’s Most Intuitive Algorithm k-Nearest Neighbors Intuition

The Power in Numbers: How Ensemble Models Transform Data Discovery

From Python Script to Personal Finance Engine: Inside FinFortress

Support Vector Machines: Carving Clarity from Complex Data

The Ultimate Guide to k-Nearest Neighbors (KNN): Data Science’s Most Intuitive Algorithm k-Nearest Neighbors Intuition

A First Toy Example: Predicting Bike Ride Duration

Step 1: Define the Problem

Step 2: Collect and Understand the Data

Step 3: Explore and Understand the Features

Step 4: Feature Engineering

Step 5: Split the Data

Step 6: Choose and Train a Model

Step 7: Evaluate the Model

Step 8: Interpret and Interrogate the Results

Step 9: What Would Come Next

What This Example Taught

Start Small, Think Clearly, Build Forward

Comments

Leave a Reply Cancel reply

Support Vector Machines: Carving Clarity from Complex Data

The Ultimate Guide to k-Nearest Neighbors (KNN): Data Science’s Most Intuitive Algorithm k-Nearest Neighbors Intuition

The Power in Numbers: How Ensemble Models Transform Data Discovery

From Python Script to Personal Finance Engine: Inside FinFortress

Support Vector Machines: Carving Clarity from Complex Data

The Ultimate Guide to k-Nearest Neighbors (KNN): Data Science’s Most Intuitive Algorithm k-Nearest Neighbors Intuition