Feature Engineering Basics
Feature engineering is one of the most important steps in building effective machine learning models. It's the craft of turning raw, often messy data into meaningful inputs, features, that give algorithms something substantial to learn from. Even the most sophisticated models can falter if the inputs are weak or irrelevant, while simpler models can often outperform expectations when given the right set of features.
Think of it as the difference between cooking with whatever happens to be in the fridge and shopping thoughtfully for ingredients that suit the recipe. In the same way, the features you choose and refine can determine whether your model produces bland guesses or well-seasoned, accurate predictions. In this section, we'll explore what features are, the different types you'll encounter, and why these decisions matter. To make it concrete, we'll draw examples from the Titanic dataset.
Understanding features
In machine learning, features are the measurable signals your model listens to when it tries to make sense of the world.
Types of features
Type | Description | Example (Titanic dataset) |
---|---|---|
Categorical | Distinct groups or labels | Sex : male or female |
Numerical | Measurable quantities | Age : passenger's age in years |
Ordinal | Ranked categories where order matters | Pclass : passenger class (1, 2, 3) |
From raw data to features
Here's an example from the Titanic dataset. At this point in your workflow, you've loaded and explored your data and are starting to create new features. The goal is to take a column that seems purely descriptive and make it predictive.
# Extract passenger title from Titanic Name
titles = df["Name"].str.extract(r",s*(?P<Title>[^.]+).")
df = df.join(titles)
Splitting out a passenger's title can reveal correlations with survival rates, while surnames can help identify family groups.
Why feature choice matters
Once you have a collection of features, you'll want to check if some are redundant. Highly correlated features can confuse certain models and add unnecessary complexity.
corr_matrix = df.corr(numeric_only=True)
print(corr_matrix["SalePrice"].sort_values(ascending=False))
This quick correlation check can guide early pruning before you invest in further engineering.
A practical process that actually works
Now you're entering the “systematic cleanup and creation” stage. Here you're refining raw columns into forms that will actually help your model.
# Create a new Age feature based on the year sold
housing["Age"] = housing["YrSold"] - housing["YearBuilt"]
# Fill missing LotFrontage values using neighborhood-specific medians
housing["LotFrontage"] = housing.groupby("Neighborhood")["LotFrontage"].transform(lambda x: x.fillna(x.median()))
The Age
feature is more directly related to property value than YearBuilt
. Imputing LotFrontage
with medians keeps the feature useful while respecting neighborhood differences.
Techniques for transforming and creating features
Here you've moved into the transformation stage,reshaping your features so algorithms can interpret them effectively.
# Create ratio of living area to lot size
housing["AreaRatio"] = housing["GrLivArea"] / housing["LotArea"]
# Standardize numerical features to have mean 0 and variance 1
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
housing[["LotArea", "GrLivArea"]] = scaler.fit_transform(housing[["LotArea", "GrLivArea"]])
# Apply log transform to reduce skew in SalePrice
import numpy as np
housing["SalePriceLog"] = np.log1p(housing["SalePrice"])
# Encode categorical values numerically
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
housing["Neighborhood_encoded"] = le.fit_transform(housing["Neighborhood"])
These transformations can make patterns more apparent and ensure different scales don't bias the model.
Feature selection and dimensionality reduction
At this point, you've engineered and transformed features. Now it's time to decide which ones to keep and whether to reduce dimensionality.
# Model-based feature importance
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X, y)
importances = model.feature_importances_
# Permutation importance
from sklearn.inspection import permutation_importance
result = permutation_importance(model, X, y, n_repeats=10, random_state=42)
# Principal Component Analysis for dimensionality reduction
from sklearn.decomposition import PCA
pca = PCA(n_components=10)
X_reduced = pca.fit_transform(X)
These tools help you focus on the features that actually move the needle and cut down the rest, which can improve speed and generalization.
Evaluating feature importance
You're now in the analysis phase,understanding what's driving your model's predictions. This helps with trust, refinement, and communication with stakeholders.
# Tree-based model feature importance
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor()
model.fit(X, y)
print(model.feature_importances_)
Quickly see which features the model leans on most.
# SHAP values for model-agnostic explanation
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X)
SHAP shows not just how much a feature matters, but the direction of its influence.
# Permutation importance
from sklearn.inspection import permutation_importance
result = permutation_importance(model, X, y, n_repeats=10, random_state=42)
print(result.importances_mean)
This approach tests each feature's contribution by seeing how performance changes when it's shuffled.
The takeaway: importance scores are clues, not final judgments. Combine them with your domain knowledge and validation results before making changes.
Bringing It All Together
By now, you've seen how feature engineering is less about blindly applying transformations and more about making thoughtful choices at each stage of the journey. You start by understanding what features are and how they behave, then move on to cleaning and crafting them, transforming them into model-friendly shapes, and finally deciding which ones deserve a place in the final dataset.
Every dataset is different, but the mindset remains the same: stay curious, test your assumptions, and let both your domain knowledge and your metrics guide you. With practice, you'll start seeing feature engineering not as a chore, but as the creative backbone of your machine learning work.