Imagine you’ve just been handed a large, unfamiliar city and told to navigate it without a map. You could start driving immediately – picking roads at random, hoping they lead somewhere useful. Or you could climb to a high point first, get an overview of the layout, identify the major landmarks, spot the dead ends, and then start navigating with actual information.
Exploratory Data Analysis – EDA – is climbing to the high point first.
It’s the practice of examining your data before doing anything else with it. Before building a model. Before running a statistical test. Before engineering features or selecting algorithms. Just looking. Asking questions. Getting a feel for what you actually have.
It sounds almost too simple to be important. It isn’t. EDA is consistently cited by experienced data scientists as one of the highest-leverage activities in any data project – and consistently skipped by beginners who are impatient to get to the “real” work.
This article will show you what EDA actually involves, why it matters so much, and how to think about it – without requiring a single equation.

What EDA Actually Is
Exploratory Data Analysis was formally introduced as a concept by the statistician John Tukey in his 1977 book of the same name. Tukey’s core argument was radical for its time: before you test a hypothesis, before you fit a model, you should explore your data openly and without preconceptions.
His approach was visual, intuitive, and deliberately non-mathematical. He wanted analysts to look at data – to let patterns, anomalies, and structures reveal themselves before being forced into predetermined frameworks.
Decades later, his insight holds. EDA is not a formal procedure with fixed steps. It’s a mindset and a set of practices oriented around one central question:
What does my data actually contain, and does it match what I assumed it would contain?
That question sounds simple. The gap between assumption and reality, when you actually go looking, is almost never simple.
Why EDA Is Non-Negotiable
Here’s a scenario that plays out constantly in real data projects:
A team receives a dataset, assumes it’s clean and representative, and spends three weeks building a model. The model performs poorly. They try different algorithms. Still poor. They tune hyperparameters. Still poor.
Eventually someone actually looks at the raw data and discovers that 40% of the values in a key column are missing – filled in with a default value of zero that looks like valid data but isn’t. The model has been learning patterns from corrupted inputs the entire time.
Three weeks of work, undone by something that thirty minutes of EDA would have caught on day one.
This is not an unusual story. It’s standard. And it illustrates the core reason EDA matters:
You cannot make good decisions about data you haven’t examined.
Model selection, feature choices, evaluation strategy, data cleaning priorities – all of these decisions depend on understanding what your data actually contains. EDA is how you get that understanding.
Beyond catching problems, EDA also surfaces opportunities. Patterns you didn’t expect. Relationships between variables that suggest useful features. Natural groupings in the data that inform modeling strategy. Outliers that turn out to be the most interesting cases in the entire dataset.
EDA is where data surprises you – and in data science, surprises are almost always informative.
The Four Questions EDA Answers
Rather than presenting EDA as a checklist of techniques, it’s more useful to frame it as four fundamental questions you’re trying to answer about your data.
Question 1: What Do I Actually Have?
Before anything else, you need a basic inventory of your dataset.
- How many rows (samples) and columns (features) are there?
- What type of information does each column contain – numbers, categories, dates, text?
- What is each column supposed to represent?
- Are there columns you don’t understand yet?
This sounds trivially simple. It often isn’t. Real-world datasets arrive with cryptic column names, ambiguous units, mixed data types, and documentation that’s incomplete or missing entirely. You need to know what you’re working with before you can work with it.
A practical tool here is simply looking at the first few rows of your data – what data scientists call the “head” of the dataset. Does it match what you expected? Are the values in each column sensible? Does anything look immediately wrong?
What you’re looking for: A clear mental map of the dataset’s structure – its shape, its contents, and any immediate red flags.
Question 2: What’s Missing, Broken, or Weird?
This is data quality assessment, and it is almost always the most important phase of EDA.
Real-world data is messy. That’s not a cliché – it’s a structural feature of how data gets generated and collected. Sensors fail. Forms get submitted with blank fields. Databases get migrated and values get corrupted. Human operators make errors. Systems record placeholder values that get mistaken for real data.
Missing values are the most common issue. But missing values are not all the same:
- A heart rate reading of zero probably isn’t a real measurement – it’s a missing value disguised as a number
- A blank field in a “secondary phone number” column might be genuinely absent – the person doesn’t have a second phone
- A missing value in a “date of discharge” column for a hospital dataset might mean the patient is still admitted – which is critical information, not missing information
Understanding why values are missing is as important as knowing that they’re missing. The pattern of missingness carries information.
Outliers are the second major category. An outlier is a value that sits far outside the expected range for a variable. But here’s the critical nuance: outliers are not automatically errors.
Consider cycling power output data. If your dataset shows one ride with an average power of 850 watts when everyone else averages between 150 and 300 watts – that’s an outlier. But is it an error? It could be:
- A data entry mistake (someone typed 850 instead of 185)
- A sensor malfunction
- An actual elite-level athlete who genuinely produces that power
- A very short maximal sprint that got averaged incorrectly
You cannot know which without investigating. EDA is where you find these cases and ask the question. Models trained on uninvestigated outliers often learn the wrong lessons – the outlier either corrupts the model or gets ignored in ways that matter.
Duplicates are the third common issue – the same observation appearing multiple times in the dataset. Duplicates inflate your sample size artificially and can make model performance metrics look better than they are.
What you’re looking for: A clear picture of data quality issues – where values are missing, where values look wrong, where observations appear to be duplicated – before any of this silently corrupts your analysis.
Question 3: What Does Each Variable Look Like on Its Own?
Once you have a basic quality picture, you want to understand each variable individually. This is called univariate analysis – looking at one variable at a time.
The primary tool here is visualization, not calculation. You want to see the shape of each variable’s distribution.
For numerical variables, the key visual is a histogram – a bar chart that shows how frequently different value ranges appear. Looking at a histogram tells you:
- Is the distribution roughly symmetric? Most values cluster around a central point with roughly equal tails on either side.
- Is it skewed? Most values cluster at the low end with a long tail stretching to the right (right-skewed), or vice versa.
- Is it bimodal? There are two distinct humps – suggesting the data might actually contain two different populations mixed together.
- Are there impossible values? Negative ages. Heart rates of zero. Distances of -5km. These are data quality issues wearing the costume of valid numbers.
For categorical variables, the equivalent visual is a bar chart showing how frequently each category appears. This immediately reveals:
- Class imbalance: One category appears vastly more often than others. In a fraud detection dataset, “legitimate transaction” might represent 99% of all cases. This has massive implications for how you build and evaluate your model – a point we explored in depth in our article on [bias] in ML systems.
- Rare categories: Some categories appear so infrequently they may not provide enough examples for a model to learn from
- Unexpected categories: Values that shouldn’t exist in the column – typos, inconsistent formatting, categories that were supposed to be merged
What you’re looking for: The shape, range, and character of each variable individually – its distribution, its typical values, its extremes, and any anomalies that suggest data quality issues or interesting structure.
Question 4: How Do Variables Relate to Each Other?
This is where EDA becomes genuinely exciting – and where the most valuable insights for modeling typically emerge.
Bivariate analysis examines relationships between pairs of variables. The central question is: when one variable changes, does another tend to change with it?
For two numerical variables, the primary visual tool is a scatter plot – a graph where each observation is plotted as a point, with one variable on the horizontal axis and the other on the vertical axis. Scatter plots reveal:
- Positive relationships: As one variable increases, the other tends to increase too
- Negative relationships: As one variable increases, the other tends to decrease
- Non-linear relationships: The relationship exists but follows a curve rather than a straight line – critical for deciding whether a linear model is appropriate
- No relationship: The points scatter randomly with no discernible pattern
- Clusters: Groups of observations that behave distinctly differently from the rest
For a numerical variable and a categorical variable, a useful visual is a box plot – a compact summary that shows the range, median, and spread of a numerical variable separately for each category. A box plot lets you quickly see: “does this numerical variable behave differently across these categories?”
For example: does resting heart rate differ meaningfully between cyclists who train more than 10 hours per week and those who train fewer than 5? A box plot answers that question visually in seconds.
For two categorical variables, a heatmap of counts shows how frequently different combinations of categories appear together.
The relationship that matters most: features vs. target
In a supervised learning project, the most important bivariate relationships are between each feature and your target variable. These relationships are the signal your model is trying to learn. Examining them visually tells you:
- Which features appear to have meaningful relationships with the target
- Whether those relationships are linear or curved
- Whether some features show almost no relationship with the target at all – suggesting they may not be useful
This directly informs feature engineering – knowing which raw features carry signal helps you decide how to transform and combine them into inputs that a model can learn from effectively.
What you’re looking for: Relationships between variables – particularly between features and the target – that reveal what drives the outcome you’re trying to predict.
The Most Important EDA Visuals (And What Each One Tells You)
| Visual | Best Used For | What It Reveals |
| Histogram | Single numerical variable | Distribution shape, skew, outliers, impossible values |
| Bar chart | Single categorical variable | Category frequency, class imbalance, rare/unexpected values |
| Scatter plot | Two numerical variables | Relationships, correlations, clusters, non-linearity |
| Box plot | Numerical vs. categorical | Distribution differences across groups |
| Heatmap | Correlations across all numerical variables | Which variables move together – potential redundancy or signal |
| Line chart | Numerical variable over time | Trends, seasonality, sudden shifts, data collection gaps |
| Pair plot | Multiple numerical variables at once | Overview of all pairwise relationships simultaneously |
What Good EDA Looks Like in Practice
EDA isn’t a linear process with a defined start and end. It’s iterative and conversational – you look at something, it raises a question, you look at something else to answer that question, which raises another question.
A realistic EDA session on a cycling performance [dataset] might go like this:
- First look: Check the shape – 3,400 rows, 18 columns. Look at the first few rows. Something looks odd in the “average power” column – some values are zero.
- Investigate zeros: How many zeros? Filter the data – 47 rows have zero average power. These can’t be real cycling rides. Are they rest days accidentally included? Data collection errors? Flag them for removal or investigation.
- Distribution check: Plot a histogram of ride durations. Mostly between 30 and 180 minutes – makes sense. But there’s a cluster of rides under 5 minutes. Short warmups? Accidental recordings? Worth checking.
- Relationship exploration: Scatter plot of average power vs. heart rate. Expect a positive relationship – higher power should mean higher heart rate. The relationship is there, but there’s a cluster of high-heart-rate, low-power observations that don’t fit. What are those? Turns out they’re all from one specific month – possibly illness affecting the rider’s efficiency.
- Target variable analysis: Look at the distribution of the target variable – say, “performance level” scored 1-10. It’s heavily skewed toward the middle (5-7). Very few 1s and 10s. This imbalance will need to be addressed in modeling.
Each observation raises a question. Each question leads to another look. This is EDA – not a procedure, but a dialogue with your data.
Common EDA Mistakes Worth Avoiding
Skipping it entirely.
The most common mistake. “I’ll just clean obvious issues and start modeling.” This almost always costs more time than it saves.
Treating it as a checklist.
EDA done mechanically – running standard plots without actually thinking about what they show – misses the point entirely. The goal is understanding, not completion.
Ignoring what you don’t expect.
When something surprising shows up, the temptation is to rationalize it away. “That’s probably just noise.” Sometimes it is. Often it isn’t. Unexpected findings deserve investigation, not dismissal.
Confusing correlation with causation during EDA.
Two variables moving together doesn’t mean one causes the other. EDA surfaces relationships – explaining them is a separate, careful task.
Stopping too early.
The first pass of EDA rarely reveals everything important. Real insights often emerge on the second or third look, after earlier findings have reshaped what questions you’re asking.
EDA and the Bigger Picture
EDA sits at a specific and critical point in the data science workflow: after data collection, before modeling. But its influence extends throughout the entire project.
Good EDA:
- Prevents wasted weeks building models on corrupted data
- Surfaces the features most likely to be predictive
- Reveals data quality issues before they quietly corrupt model performance
- Informs preprocessing decisions – how to handle missing values, outliers, skewed distributions
- Sets realistic expectations about what the data can and cannot support
Every decision you make downstream – about features, about model complexity, about evaluation strategy – is made better by having genuinely explored your data first.
The experienced data scientist who spends two days on EDA before touching a model is not being slow. They’re being fast – in the way that matters, over the timeframe that matters.
Summary: EDA at a Glance
| Phase | Core Question | Key Tools |
| Inventory | What do I actually have? | Head/tail view, shape check, column types |
| Quality check | What’s missing, broken, or weird? | Missing value counts, outlier plots, duplicate checks |
| Univariate analysis | What does each variable look like alone? | Histograms, bar charts, box plots |
| Bivariate analysis | How do variables relate to each other? | Scatter plots, box plots, heatmaps, pair plots |
What Comes Next
Once you’ve genuinely explored your data – understood its shape, its quality, its distributions, and its key relationships – you’re ready to make informed decisions about the next phase: data cleaning, feature engineering, and model selection.
Without EDA, those decisions are guesses. With it, they’re informed choices grounded in what the data actually contains.
The map has been drawn. Now you can navigate.

Leave a Reply