The best way to learn Machine Learning is to actually do it.
There’s a point in every beginner’s ML journey where the concepts start to feel slippery.
You understand the definition of supervised learning. You’ve read that regression predicts continuous values. You know that features go in and a prediction comes out. But the moment you try to apply any of that to a real problem, the abstraction starts to feel hollow.
That’s not a knowledge gap. That’s a practice gap.
The fastest way to close it is to work through a concrete example – one problem, from raw data to a working model, with every decision explained as it’s made. Not a sanitized academic exercise. A real problem with real choices and real tradeoffs.
That’s exactly what this article is.
The problem: predicting how long a bike ride will take, given a set of inputs you’d know before starting. It’s simple enough to understand completely. It’s complex enough to illustrate every core concept you need. And for cyclists who track their data, it’s immediately relevant.
By the end, you’ll understand what features are and how to choose them, what a regression model is actually doing, how to evaluate whether a model is working, and what the full arc of a basic ML project looks like from start to finish.
No prior ML experience required. Every term explained when it first appears.

Step 1: Define the Problem
Before any data is collected or any algorithm is selected, the problem needs a precise definition. This is a step many beginners skip – and it almost always causes problems later.
In plain English, the goal is: given information available before a ride begins, predict how long that ride will take in minutes.
Let’s be more specific:
- Input: Information known before the ride – planned distance, planned elevation gain, rider’s recent fitness level, weather conditions
- Output: Ride duration in minutes (a number, not a category)
- Type of ML problem: Supervised regression
Supervised because we’ll train on historical ride data where the actual duration is known – that’s the label. Regression because the output is a continuous number (minutes elapsed), not a discrete category.
This is exactly the kind of problem supervised learning handles well:
- Clear, measurable output ✅
- Historical labeled examples exist ✅
- Learnable patterns in the data ✅
- Genuine value over simply guessing ✅
If any of those criteria feel unfamiliar, the article on “What Makes a Problem a “Good” ML Problem” covers them in depth.
Step 2: Collect and Understand the Data
For this toy example, imagine a cyclist who has been recording their rides for two years using an Apple Watch. Each ride in the dataset is one row. Each row contains information about that ride.
Here’s what the raw dataset might look like (simplified):
| Ride ID | Distance (km) | Elevation Gain (m) | Avg Temp (°C) | Fitness Score | Wind Speed (km/h) | Duration (min) |
| 001 | 42.1 | 380 | 18 | 74 | 12 | 98 |
| 002 | 28.5 | 110 | 22 | 71 | 5 | 58 |
| 003 | 65.3 | 820 | 14 | 76 | 20 | 178 |
| 004 | 33.0 | 240 | 9 | 68 | 28 | 89 |
| 005 | 19.8 | 60 | 25 | 72 | 8 | 40 |
The column we’re trying to predict – Duration – is the target variable (also called the label or dependent variable). Every other column is a potential feature (also called a predictor or independent variable).
Let’s say this dataset contains 400 rides over two years. That’s not enormous, but it’s enough to demonstrate the concepts clearly – and for a relatively simple regression problem like this, it may be sufficient to build a useful model.
Step 3: Explore and Understand the Features
Before building anything, good data science practice involves exploratory data analysis (EDA) – a process of examining the data to understand distributions, relationships, and potential issues.
For our toy example, a few key questions arise immediately:
Does each feature actually relate to duration?
Intuitively:
- Distance – longer distance almost certainly means longer duration. Strong expected relationship.
- Elevation Gain – more climbing means slower average speed. Strong expected relationship.
- Temperature – performance degrades in extreme heat or cold. Moderate expected relationship.
- Fitness Score – a fitter rider covers the same distance faster. Moderate expected relationship.
- Wind Speed – headwinds slow riders down, but we don’t know wind direction here. Weaker, noisier relationship.
In a real project, you’d visualize these relationships – scatter plots, correlation matrices – to confirm whether the patterns you expect are actually present in the data. For our toy example, assume the expected relationships hold.
Are there any obvious data quality issues?
- Missing values: Do any rides have blank entries? If so, those rows need to be handled – either filled in (imputed) or removed.
- Outliers: Is there a ride listed as 8 km long but taking 4 hours? That might be a GPS error, a coffee stop, or a medical emergency – it’s probably not a useful training example.
- Consistency: Are units consistent throughout? (All distances in km, all temperatures in Celsius?)
Data cleaning is unglamorous but essential. In real-world projects, it often consumes 60-80% of total project time. Our toy dataset is pre-cleaned, but it’s worth naming this step explicitly because it never disappears in practice.
Step 4: Feature Engineering
Feature engineering is the process of transforming raw data into the input format that gives the model the best chance of learning meaningful patterns.
Sometimes raw features are sufficient. Sometimes new features derived from raw data are more informative. Here are a few examples relevant to our problem:
Elevation-to-distance ratio:
Instead of feeding elevation gain and distance as separate features, we could create a new feature: elevation per kilometer (elevation gain ÷ distance). This captures gradient – how steep the ride is – which is arguably more predictive of duration than raw elevation alone. A 500 m elevation gain over 10 km is brutal; the same gain over 80km is modest.
Temperature deviation from optimal:
Rather than raw temperature, a feature representing how far the temperature is from an optimal range (say, 15-20°C) might be more predictive. Performance degrades as temperature deviates from that range in either direction. Raw temperature doesn’t capture this non-linear relationship as cleanly.
Rolling fitness average:
Instead of a single-point fitness score, a rolling average of the rider’s fitness scores over the previous 4 weeks might be more stable and predictive than any single measurement.
For our toy example, we’ll keep it simple and use:
- Distance (km)
- Elevation gain (m)
- Elevation per km (derived feature)
- Average temperature (°C)
- Current fitness score
- Wind speed (km/h)
That gives the model six input features to work with.
Step 5: Split the Data
Here’s a critical concept that catches many beginners off guard: you must never evaluate your model on the same data you trained it on.
If you do, you’re not measuring how well the model predicts new, unseen rides – you’re measuring how well it memorized the training examples. A model can score 100% on training data by simply memorizing every row, while being completely useless on any new data it hasn’t seen.
The solution is to split the dataset before training:
- Training set (~80%): The data the model actually learns from. In our case, about 320 rides.
- Test set (~20%): Data held back entirely during training. Used only at the end to evaluate performance on unseen examples. About 80 rides.
The test set simulates what happens in the real world: the model receives a new ride it’s never seen before and must predict the duration based only on its learned patterns.
Some projects also use a third split – a validation set – to tune model settings during development without contaminating the test set. For our toy example, the train/test split is sufficient.
Step 6: Choose and Train a Model
Now for the part most beginners think is the whole job – but is actually just one step among many.
For a regression problem like this, several algorithm options are available. Two natural starting points:
Linear Regression
Linear regression assumes that the relationship between each feature and the output can be approximated as a straight line. The model learns a coefficient (a weight) for each feature – essentially, how much each feature contributes to the predicted duration.
A simplified version of what linear regression is doing:
Predicted Duration = (a × Distance) + (b × Elevation) + (c × Elevation per km) + (d × Temperature) + (e × Fitness Score) + (f × Wind Speed) + intercept
Where a, b, c, d, e, f are coefficients the model learns from the training data. The intercept is the baseline prediction when all features are zero.
After training on 320 rides, the model might learn coefficients that look like this:
| Feature | Learned Coefficient | Interpretation |
| Distance (km) | +1.8 | Each additional km adds ~1.8 minutes |
| Elevation gain (m) | +0.04 | Each additional meter of climbing adds ~0.04 minutes |
| Elevation per km | +3.2 | Steeper gradients add significant time |
| Temperature deviation | +0.3 | Each degree from optimal adds ~0.3 minutes |
| Fitness score | -0.9 | Higher fitness reduces duration |
| Wind speed (km/h) | +0.25 | Stronger winds add time |
These coefficients are intuitive – they reflect what any experienced cyclist would expect. When ML produces outputs that align with domain knowledge, that’s a good sign the model has learned something real.
Decision Tree Regression
A decision tree takes a different approach. Instead of a weighted formula, it learns a series of if-then rules that partition the data into progressively smaller groups.
For example, the tree might learn:
- If distance > 50km AND elevation > 500m → predicted duration ≈ 145 min
- If distance > 50km AND elevation ≤ 500m → predicted duration ≈ 110 min
- If distance ≤ 50km AND fitness score > 70 → predicted duration ≈ 65 min
Decision trees are highly interpretable – you can follow the exact path of reasoning the model used. They can also capture non-linear relationships that linear regression might miss. However, they can overfit (memorize training data too closely) if not properly constrained.
For our toy example, we’ll use linear regression as the primary model. It’s transparent, interpretable, and appropriate for a first project where the goal is understanding – not squeezing out maximum predictive accuracy.
Step 7: Evaluate the Model
Training is complete. Now the model is applied to the test set – 80 rides it has never seen – and its predictions are compared to the actual durations.
Three common evaluation metrics for regression models:
Mean Absolute Error (MAE)
The average difference between predicted and actual duration, in the same units as the output (minutes). If MAE = 7.2, the model’s predictions are off by an average of 7.2 minutes.
Simple, intuitive, and directly meaningful. “Our predictions are off by about 7 minutes on average.”
Root Mean Squared Error (RMSE)
Similar to MAE, but larger errors are penalized more heavily. If a model is occasionally wildly wrong, RMSE will be higher than MAE. This metric is useful when large errors are particularly costly.
R² (R-squared)
A measure of how much of the variation in ride duration the model explains, expressed as a proportion between 0 and 1.
- R² = 0: The model explains nothing. It’s no better than always predicting the average duration.
- R² = 1: The model explains all variation. Perfect predictions.
- R² = 0.85: The model explains 85% of the variation in duration.
For our toy example, let’s say the model produces:
- MAE: 8.4 minutes
- RMSE: 12.1 minutes
- R²: 0.87
How to interpret this: The model explains 87% of the variation in ride duration. On average, its predictions are off by about 8 minutes. For rides where large errors occur, the gap is closer to 12 minutes.
Is that good? It depends on the use case. For planning purposes – “will this ride take roughly 90 minutes or roughly 3 hours?” – an average error of 8 minutes is genuinely useful. For race-level pacing decisions, it might not be precise enough. Context determines acceptable accuracy.
Step 8: Interpret and Interrogate the Results
A number on a scorecard isn’t the end of evaluation. Good ML practice involves understanding where the model works well and where it struggles.
A few diagnostic questions worth asking:
Does the model perform differently on different types of rides?
Perhaps it predicts flat rides accurately but consistently underestimates climbing rides. This might suggest that elevation-related features need better engineering, or that the training set contained proportionally fewer long, hilly rides.
Are there systematic biases?
If the model consistently over-predicts duration for highly fit riders and under-predicts for less fit riders, the fitness score feature isn’t being weighted correctly. This calls for more data or a different approach to encoding fitness.
What do the residuals look like?
A residual is the difference between a predicted value and the actual value. Plotting residuals reveals patterns. If residuals are randomly scattered around zero – no trend, no shape – the model is doing its job. If residuals show a curve, a funnel shape, or a systematic bias, there’s structure the model hasn’t captured.
These diagnostic steps often lead back to earlier stages – refining features, collecting more data, or trying a more flexible algorithm. Machine learning development is iterative, not linear.
Step 9: What Would Come Next
This toy example stops at a working, evaluated model. In a real deployment scenario, several additional steps would follow:
Cross-validation: Instead of a single train/test split, the data is split multiple times in different configurations to get a more reliable estimate of model performance.
Hyperparameter tuning: Most algorithms have settings (called hyperparameters) that control how the model learns. Tuning these can improve performance beyond default settings.
Model comparison: Linear regression is not necessarily the best model for this problem. Trying gradient boosted trees, random forests, or other algorithms – and comparing their test set performance – is standard practice.
Deployment: A trained model can be embedded in an app, a tool, or a pipeline so that it generates predictions automatically when new ride inputs are entered – before the ride even begins.
Monitoring and retraining: Rider fitness changes over time. A model trained on two-year-old data will gradually become less accurate as the rider’s physiology and training patterns evolve. Periodic retraining on fresh data maintains accuracy.
What This Example Taught
Walk back through the steps and look at what each one actually demonstrated:
| Step | Core Concept Illustrated |
| Define the problem | Target variable, supervised regression, problem framing |
| Collect data | Features vs. labels, dataset structure |
| Explore the data | Exploratory data analysis, data quality |
| Engineer features | Feature transformation, domain knowledge in ML |
| Split the data | Training vs. test sets, preventing data leakage |
| Train the model | Coefficients, how algorithms learn from data |
| Evaluate performance | MAE, RMSE, R², what “good” accuracy means |
| Interpret results | Residuals, bias detection, iterative improvement |
| Next steps | The full production pipeline beyond the model itself |
That’s the complete conceptual skeleton of a supervised ML project. Every production ML system in the world – fraud detection, medical imaging, recommendation engines – runs on this same skeleton. The data is bigger, the algorithms more complex, the infrastructure more elaborate. But the structure is identical.
Start Small, Think Clearly, Build Forward
Toy examples have a reputation for being too simple to matter. That’s wrong.
A toy example done well teaches something a textbook definition never can: the texture of an ML project. The messiness of real data. The judgment calls in feature engineering. The interpretive work that follows a test set result. The iterative nature of improvement.
Predicting bike ride duration isn’t changing the world. But the process of doing it carefully – defining the problem, understanding the data, making explicit modeling choices, evaluating honestly, and knowing what questions to ask of the results – that process scales to every ML problem that does.
The difference between a beginner and a practitioner isn’t knowing more algorithms. It’s having walked through enough examples to recognize the shape of a problem before the first line of code is written.
This was one example. Keep building.

Leave a Reply