What Is a Dataset, Really? Rows, Columns, Samples & Features – Explained Properly

You’ve probably seen the word “dataset” thrown around constantly – in machine learning tutorials, in news articles about AI, maybe even in your own work. But here’s a question worth pausing on: do you actually know what a dataset is, structurally?

Not just “a collection of data.” That’s the dictionary answer, and it’s not very useful.

To truly understand how machine learning works, how models learn, and why data quality matters so much, you need to understand what a dataset looks like under the hood – the rows, the columns, the samples, the features. These aren’t interchangeable words. They mean specific things. And when you understand them properly, a lot of other concepts in data science click into place automatically.

This article will walk you through all of it. No assumed knowledge. No skipped steps. By the end, you’ll look at any dataset and immediately understand its structure, its language, and what it’s actually telling you.

The Dataset: A Structured Collection of Information

Let’s start with the most honest definition:

A dataset is a structured collection of observations about a subject, organized so that patterns can be identified, analyzed, or learned from.

The word “structured” is doing a lot of work there. Raw information isn’t a dataset. A pile of notes isn’t a dataset. A dataset has shape – a deliberate arrangement that makes analysis possible.

The most common shape? A table.

Think of a simple spreadsheet. Rows going across. Columns going down. If you can picture that, you already understand the fundamental architecture of most datasets used in data science and machine learning today.

But the words we use to describe those rows and columns matter – because they carry meaning beyond just “horizontal” and “vertical.”

Rows = Samples (Also Called Observations or Data Points)

Each row in a dataset represents one complete observation. One thing that was measured, recorded, or captured.

In data science, we call these rows samples – though you’ll also hear the terms observations, instances, or data points used interchangeably.

Here’s what that looks like in practice:

AgeWeight (kg)Resting HR (bpm)Fitness Level
Row 1347258High
Row 2518874Medium
Row 3286562High

Each row is one person. One sample. One complete snapshot of that individual’s recorded information.

If your dataset has 10,000 rows, you have 10,000 samples – 10,000 individual observations. This is often what people mean when they talk about dataset “size.” More samples generally means more information for a model to learn from, which is why large datasets are considered valuable.

Quick analogy: Think of a dataset like a doctor’s patient files. Each file folder is one patient – one sample. Open any folder and you’ll find the same categories of information: age, weight, blood pressure, diagnosis. Every patient is a row. Every category is a column.

Columns = Features (Also Called Variables, Attributes, or Predictors)

If rows are the who or what being observed, columns are the what we measured.

Each column represents one specific type of information recorded across all samples. In data science, we call these columns features – though you’ll also encounter the terms variables, attributes, predictors, or dimensions.

Using the same table:

  • Age is a feature
  • Weight is a feature
  • Resting Heart Rate is a feature
  • Fitness Level is a feature

Every sample has a value for each feature. That intersection – one row, one column – is called a cell, and it holds a single data point: one observation’s measurement of one variable.

Features are the raw material that machine learning algorithms actually work with. They’re the inputs. The signals. The information a model uses to find patterns or make predictions.

The Target Variable: The Column That’s Different

In most supervised machine learning tasks, one column plays a special role: the target variable (also called the label, output, or dependent variable).

This is the thing you’re trying to predict or explain.

In our example above, Fitness Level might be the target – the thing we want a model to predict, based on the other features (age, weight, resting heart rate).

The remaining features – the inputs – are often called independent variables or simply the features. The target is what you’re aiming at. Everything else is what you’re using to aim.

TypeAlso CalledRole
Feature columnsPredictors, inputs, attributesWhat the model learns from
Target columnLabel, output, dependent variableWhat the model tries to predict

This distinction – features vs. target – is fundamental to understanding how supervised learning works. More on that in our article on [machine learning].

Dimensions: What “High-Dimensional” Actually Means

Here’s a term you’ll hear constantly in data science: dimensionality.

The number of features (columns) in your dataset is its dimensionality. A dataset with 5 columns has 5 dimensions. A dataset with 500 columns has 500 dimensions.

When data scientists talk about the curse of dimensionality or high-dimensional data, they’re talking about datasets with many, many features. This creates real problems – more on that in a dedicated article – but for now, just know that dimensions = number of features = number of columns (excluding the target).

Analogy: Imagine you’re trying to describe a person. If you record just height and weight, you’re working in 2 dimensions – easy to visualize, easy to work with. Add age, fitness level, sleep hours, daily steps, stress score, blood pressure… suddenly you’re in 8 dimensions. You can’t picture it, but a machine learning algorithm can navigate it mathematically.

The Shape of a Dataset

Data scientists almost always refer to a dataset’s shape as its first point of reference. Shape is simply:

(number of rows) × (number of columns)

Or: (samples) × (features)

A dataset with 1,500 people and 12 measured variables has a shape of 1,500 × 12. This single expression tells you an enormous amount about what you’re working with before you’ve looked at a single value.

In Python’s pandas library – the most widely used data manipulation tool in data science – you can retrieve this instantly with:

df.shape

# Output: (1500, 12)

It’s one of the first things any data scientist checks when encountering a new dataset.

Types of Features: Not All Columns Are Equal

Here’s where it gets slightly more nuanced – and importantly more useful.

Features (columns) don’t all contain the same type of information, and this matters for how you process and analyze them.

Numerical Features

These contain numbers that represent measurable quantities.

  • Continuous: Can take any value within a range. Examples: weight (72.4 kg), heart rate (64 bpm), distance (42.2 km)
  • Discrete: Whole numbers only. Examples: number of training sessions per week, number of children

Categorical Features

These represent groups or categories, not quantities.

  • Nominal: Categories with no inherent order. Examples: blood type (A, B, O, AB), country, color
  • Ordinal: Categories with a meaningful order, but no precise numerical spacing. Examples: fitness level (Low, Medium, High), satisfaction rating (Poor, Fair, Good, Excellent)

Why Does This Matter?

Because most machine learning algorithms expect numbers – and they expect you to treat numerical and categorical data differently. Feeding a model the word “High” when it expects a number will cause errors. Understanding your feature types is step one of any data preprocessing workflow.

A Real-World Example: Cycling Performance Dataset

Let’s ground all of this in something concrete.

Imagine you’re building a dataset to analyze cycling performance – something very relevant if you use tools like our Apple Health Cycling Analyzer.

Rider IDAgeVO2 EstimateAvg Power (W)Ride Duration (min)Elevation Gain (m)HR Drift (%)Performance Level
0013852.1187948204.2Trained
0024544.81521124309.7Recreational
0033161.3224781,1402.1Elite

Here’s what you can now identify immediately:

  • Samples (rows): Each row is one cyclist – one observation
  • Features (columns): Age, VO2 estimate, average power, ride duration, elevation gain, HR drift
  • Target variable: Performance Level (what we might want to predict)
  • Feature types: All numerical except Performance Level, which is ordinal-categorical
  • Dataset shape: 3 × 7 (tiny example – real datasets have thousands of rows)

This is exactly how [training data] is constructed before it’s fed to a machine learning model. The structure you’ve just learned is the foundation everything else is built on.

Common Misconceptions Worth Clearing Up

“More data is always better.”
More samples helps – up to a point. But more features can actually hurt performance if those features are irrelevant, redundant, or noisy. Quality and relevance matter as much as quantity.

“A dataset is just a spreadsheet.”
A spreadsheet is one representation of a dataset. Datasets can also be stored in databases, JSON files, image folders, audio libraries, or text corpora. The table structure is the most common format in structured data – but datasets take many forms.

“Rows and columns can be used interchangeably depending on the software.”
Almost never true in practice. The row = sample, column = feature convention is nearly universal in data science. Flipping it causes real problems when feeding data into algorithms.

“The target variable is always in the last column.”
It often is, by convention – but not always. Always check the documentation for any dataset you’re working with.

Why This Matters: The Foundation of Everything

Understanding the anatomy of a dataset isn’t just academic housekeeping. It’s the lens through which every other concept in data science becomes legible.

  • Model training is the process of learning patterns across samples and features
  • Overfitting often happens when there are too few samples relative to features
  • Feature engineering is the art of transforming and creating columns to improve model performance
  • Data cleaning is largely about ensuring each cell contains the right type of value for its feature

When a data scientist says “we need more data,” they usually mean more samples. When they say “we need better features,” they mean better columns. When they say the dataset is “too wide,” they mean too many features relative to samples.

This vocabulary – samples, features, shape, dimensionality, target – is the shared language of data science. Now you speak it.

Summary: The Core Concepts at a Glance

TermAlso Known AsWhat It Means
DatasetData table, training setStructured collection of observations
RowSample, observation, instanceOne complete data point
ColumnFeature, variable, attributeOne measured property across all samples
Target variableLabel, outputThe column you’re trying to predict
ShapeDimensions(rows × columns) = (samples × features)
DimensionalityNumber of dimensionsNumber of features in the dataset

What Comes Next?

Now that you understand what a dataset actually is, the natural next step is understanding what kind of dataset you’re working with – structured vs. unstructured, labeled vs. unlabeled, balanced vs. imbalanced. These distinctions determine which tools and algorithms are appropriate.

We also recommend exploring how datasets connect to the full machine learning pipeline: from raw data collection, through cleaning and feature engineering, all the way to model evaluation. Each stage builds directly on the foundation you’ve just established.

Understanding the dataset is step one. Everything else in data science and machine learning follows from here.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *