Using Pandas and Numpy for Machine Learning

In machine learning, the quality of your model's predictions is tightly linked to the quality of the data you feed it. Pandas and NumPy are the workhorses that make that preparation possible. They don't train models or tune hyperparameters, but they shape, clean, and transform raw information into something a model can actually learn from.

NumPy handles the heavy lifting for numerical computation: fast, memory-efficient array operations, vectorized math, and tools for generating or transforming data in bulk. Pandas builds on that foundation, adding labeled data structures, flexible indexing, and a rich toolkit for cleaning and reshaping datasets.

In this article, we'll explore how to use Pandas and NumPy together to prepare data for machine learning. You'll learn how to clean and transform datasets, engineer meaningful features, avoid common pitfalls like data leakage, and connect seamlessly with libraries like scikit-learn. By the end, you'll have a set of practical techniques you can apply to your own projects, from quick experiments to production-bound pipelines.

Foundations: NumPy and Pandas, Side by Side

Before diving into preprocessing or feature engineering, it's worth getting comfortable with the two libraries at the core of most ML data pipelines. NumPy gives you the raw computational power, while Pandas adds structure and convenience. Used together, they cover almost every stage of in-memory data handling.

NumPy in Practice

NumPy arrays (ndarray) are dense, homogeneous collections of values. They're designed for speed, using contiguous memory layouts that allow vectorized operations, meaning you operate on whole arrays without writing explicit Python loops.

import numpy as np
 
# Create a 2D NumPy array
X = np.array([[1, 2, 3],
              [4, 5, 6]])
 
# Simple vectorized operation
X_scaled = X * 10  # multiplies every element by 10
 
# Broadcasting example: adding a row vector to each row
bias = np.array([0.1, 0.2, 0.3])
X_biased = X + bias

Gotcha to watch: Shapes matter. Operations that work on (n_samples, n_features) may break if your array becomes 1D (shape == (n,)). Always check X.shape before passing data to an ML model.

Pandas in Practice

Pandas introduces the Series (a labeled 1D array) and the DataFrame (a labeled 2D table). These structures keep track of row and column labels, align data automatically during operations, and make data inspection easier.

import pandas as pd
 
# Create a simple DataFrame
df = pd.DataFrame({
    "age": [25, 32, 47],
    "income": [50000, 64000, 120000],
    "plan": ["basic", "premium", "premium"]
})
 
# Vectorized filtering
high_income = df[df["income"] > 60000]
 
# Adding a derived column
df["age_group"] = np.where(df["age"] < 30, "young", "adult")

Gotcha to watch: Pandas' automatic index alignment can save you from mismatched data, but it can also introduce subtle bugs if you unintentionally merge or align on the wrong index.

Rule of thumb: NumPy is best for raw numerical operations and matrix math, Pandas for labeled data manipulation and cleaning. In practice, you'll often move between them using .to_numpy() and pd.DataFrame() when needed.

Data Preprocessing with Pandas and NumPy

Once you've loaded your data, the first major job is to get it into a state your model can use. In most ML projects, preprocessing takes far more time than model training, and it's where Pandas and NumPy shine.

Data Cleaning

Messy data will sink a model before it even starts learning. Missing values, duplicates, and outliers can skew training results or cause outright errors.

# Drop rows with any missing values
df = df.dropna()
 
# Fill missing values with the median
df["age"] = df["age"].fillna(df["age"].median())
 
# Remove duplicates
df = df.drop_duplicates()
 
# Cap outliers using quantiles
low, high = df["income"].quantile([0.01, 0.99])
df["income"] = df["income"].clip(lower=low, upper=high)

Pro tip: If you're using scikit-learn, consider SimpleImputer for missing values so the same logic applies during training and prediction.

Types That Matter

ML libraries expect consistent, machine-friendly types. Pandas' flexible dtypes are great for analysis but can cause trouble during model training.

# Convert strings to datetime
df["signup_date"] = pd.to_datetime(df["signup_date"], utc=True)
 
# Make categorical columns memory-efficient
df["plan"] = df["plan"].astype("category")
 
# Ensure numeric columns are actually numeric
df["income"] = pd.to_numeric(df["income"], errors="coerce")

Watch out: Mixed dtypes in a single column (object with both numbers and strings) will often break downstream code.

Feature Engineering

Raw columns rarely capture all the patterns in your data. Feature engineering transforms them into something more informative.

# One-hot encode categorical data
df = pd.get_dummies(df, columns=["plan"], drop_first=True)
 
# Normalize numeric columns with NumPy
df["income_norm"] = (df["income"] - df["income"].mean()) / df["income"].std()
 
# Create features from dates
df["signup_month"] = df["signup_date"].dt.month
 
# Conditional features with np.where
df["is_high_income"] = np.where(df["income"] > 80000, 1, 0)

The goal isn't just to create more features, it's to create better ones that capture useful patterns without leaking future information into your training set.

Performance Tips

Poorly written preprocessing can become a bottleneck, especially in iterative ML workflows.

# Avoid apply for element-wise operations
df["name_len"] = df["name"].str.len()  # Vectorized
# Bad: df["name_len"] = df["name"].apply(len)
 
# Work in NumPy when possible for large-scale math
arr = df["income"].to_numpy()
arr_scaled = (arr - arr.mean()) / arr.std()

Favor vectorized operations over loops or .apply.
Minimize unnecessary copies of large DataFrames.
Drop unused columns early to save memory.

Mini Case: Preprocessing the Titanic Dataset

Let's put the cleaning and feature engineering ideas into action using the classic Titanic dataset.

import pandas as pd
import numpy as np
 
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
 
# Fill missing Age with median
df["Age"] = df["Age"].fillna(df["Age"].median())
# Fill missing Embarked with mode
df["Embarked"] = df["Embarked"].fillna(df["Embarked"].mode()[0])
# Drop Cabin entirely
df = df.drop(columns=["Cabin"])
 
# Fix types
df["Sex"] = df["Sex"].astype("category")
df["Embarked"] = df["Embarked"].astype("category")
 
# Create features
df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
df["IsAlone"] = np.where(df["FamilySize"] == 1, 1, 0)
df["Title"] = df["Name"].str.extract(r",\s*([^\.]+)\.")
df["Title"] = df["Title"].astype("category")
 
# Encode categories
df = pd.get_dummies(df, columns=["Sex", "Embarked", "Title"], drop_first=True)
 
# Scale numeric features
for col in ["Age", "Fare", "FamilySize"]:
    df[col] = (df[col] - df[col].mean()) / df[col].std()

By working through this example, you've taken raw, inconsistent CSV data and turned it into a clean, well-typed dataset with thoughtfully engineered features. Along the way, you saw how Pandas makes it easy to fix missing values, reshape columns, and encode categories, while NumPy handles the quick, vectorized math that keeps transformations fast. These same techniques apply to almost any ML dataset you'll work with. Now that the data is structured and consistent, the natural next step is to explore it, looking for patterns, relationships, and potential signals that will guide how you model it later.

Exploratory Data Analysis (EDA) with Pandas

With the dataset cleaned and features engineered, it's time to get familiar with its structure and uncover patterns that might influence modeling decisions.

# Summary stats for numeric columns
print(df.describe())
 
# Category balance
print(df["IsAlone"].value_counts(normalize=True))
 
# Correlation with target
print(df.corr(numeric_only=True)["Survived"].sort_values(ascending=False))

import matplotlib.pyplot as plt
 
# Age histogram
df["Age"].hist(bins=20)
plt.xlabel("Age")
plt.ylabel("Count")
plt.title("Passenger Age Distribution")
plt.show()
 
# Survival rate by class
df.groupby("Pclass")["Survived"].mean().plot(kind="bar")
plt.ylabel("Survival Rate")
plt.title("Survival Rate by Class")
plt.show()

# Survival rate by class and sex
survival_table = df.groupby(["Pclass", "Sex_male"])["Survived"].mean()
print(survival_table)

Evaluation & Validation with Pandas, NumPy, and Scikit-learn

Exploration tells you what's in the data, but evaluation and validation ensure your model will generalize beyond it.

from sklearn.model_selection import train_test_split
 
X = df.drop(columns=["Survived"])
y = df["Survived"]
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
 
model = LogisticRegression(max_iter=1000)
scores = cross_val_score(model, X, y, cv=5, scoring="roc_auc")
print(f"Mean AUC: {scores.mean():.3f}")

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
 
pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale", StandardScaler()),
    ("model", LogisticRegression(max_iter=1000))
])

from sklearn.metrics import roc_auc_score
 
pipe.fit(X_train, y_train)
y_pred_proba = pipe.predict_proba(X_test)[:, 1]
print(f"Test ROC-AUC: {roc_auc_score(y_test, y_pred_proba):.3f}")

Integrating Pandas and NumPy with ML Libraries

Most machine learning libraries work natively with NumPy arrays. Pandas DataFrames add labeling and convenience but often need to be converted.

# From DataFrame to NumPy
X_np = X_train.to_numpy()
y_np = y_train.to_numpy()
 
X_single = X_train[["Fare"]].to_numpy()  # Keeps 2D shape

numeric_cols = X_train.select_dtypes(include=["number"]).columns
X_numeric = X_train[numeric_cols].to_numpy()

# Boolean masking
mask = X_np[:, numeric_cols.get_loc("Age")] > 0
X_np_filtered = X_np[mask]
 
# Fancy indexing
cols_to_keep = [numeric_cols.get_loc(c) for c in ["Age", "Fare"]]
X_reduced = X_np[:, cols_to_keep]

from sklearn.preprocessing import OneHotEncoder
 
encoder = OneHotEncoder(sparse_output=True, handle_unknown="ignore")
encoded = encoder.fit_transform(X_train[["Embarked"]])

Light MLOps touches: make your work repeatable

A single successful notebook is less valuable than a process you can repeat next week on a new sample of data.

import numpy as np, random
SEED = 42
np.random.seed(SEED)
random.seed(SEED)

from pathlib import Path
import json, hashlib
 
run_dir = Path("runs").joinpath("titanic_run_01")
run_dir.mkdir(parents=True, exist_ok=True)
 
Xy = X.copy()
Xy["Survived"] = y
Xy.to_parquet(run_dir / "titanic_model_table.parquet", index=False)
 
h = hashlib.sha256((run_dir / "titanic_model_table.parquet").read_bytes()).hexdigest()
with open(run_dir / "data_fingerprint.json", "w") as f:
    json.dump({"sha256": h, "rows": len(Xy), "cols": Xy.shape[1]}, f, indent=2)

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import joblib
 
num = X.select_dtypes(include=["number"]).columns.tolist()
cat = X.select_dtypes(exclude=["number"]).columns.tolist()
 
pre = ColumnTransformer([
    ("num", Pipeline([("impute", SimpleImputer(strategy="median")),
                      ("scale", StandardScaler())]), num),
    ("cat", Pipeline([("impute", SimpleImputer(strategy="most_frequent")),
                      ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=True))]), cat)
])
 
pipe = Pipeline([("pre", pre), ("clf", LogisticRegression(max_iter=1000, random_state=SEED))])
pipe.fit(X_train, y_train)
 
joblib.dump(pipe, run_dir / "pipeline.joblib")

import sys, platform, sklearn, pandas as pd
 
env = {
    "python": sys.version.split()[0],
    "platform": platform.platform(),
    "numpy": np.__version__,
    "pandas": pd.__version__,
    "scikit_learn": sklearn.__version__,
}
with open(run_dir / "env.json", "w") as f:
    json.dump(env, f, indent=2)

from sklearn.metrics import roc_auc_score
 
y_pred_proba = pipe.predict_proba(X_test)[:, 1]
metrics = {
    "metric": "roc_auc",
    "value": float(roc_auc_score(y_test, y_pred_proba)),
    "rows_train": int(len(X_train)),
    "rows_test": int(len(X_test)),
    "seed": SEED,
    "data_sha256": h,
}
with open(run_dir / "metrics.json", "w") as f:
    json.dump(metrics, f, indent=2)

loaded = joblib.load(run_dir / "pipeline.joblib")
sample = X_test.iloc[:5].copy()
print(loaded.predict(sample))

Bringing it all together

You started with raw concepts and ended with a repeatable workflow: loading data, cleaning it, engineering features, exploring patterns, validating without leakage, integrating with modeling libraries, and keeping your work reproducible. Along the way, Pandas gave you expressive tools for handling structured data, NumPy handled the fast math, and scikit-learn tied it all together. This isn't just about running code once, it's about creating a process you can trust and reuse. Whether your next dataset is another CSV or a feed from a production system, you now have a framework to get it model-ready with confidence.