Feature engineering is one of the most important steps in building effective machine learning models. It's the craft of turning raw, often messy data into meaningful inputs, features, that give algorithms something substantial to learn from. Even the most sophisticated models can falter if the inputs are weak or irrelevant, while simpler models can often outperform expectations when given the right set of features.
Think of it as the difference between cooking with whatever happens to be in the fridge and shopping thoughtfully for ingredients that suit the recipe. In the same way, the features you choose and refine can determine whether your model produces bland guesses or well-seasoned, accurate predictions. In this section, we'll explore what features are, the different types you'll encounter, and why these decisions matter. To make it concrete, we'll draw examples from the Titanic dataset.
Understanding features
In machine learning, features are the measurable signals your model listens to when it tries to make sense of the world.
Types of features
Type | Description | Example (Titanic dataset) |
|---|---|---|
Categorical | Distinct groups or labels | |
Numerical | Measurable quantities | |
Ordinal | Ranked categories where order matters | |
From raw data to features
Here's an example from the Titanic dataset. At this point in your workflow, you've loaded and explored your data and are starting to create new features. The goal is to take a column that seems purely descriptive and make it predictive.
# Extract passenger title from Titanic Nametitles = df["Name"].str.extract(r",s*(?P<Title>[^.]+).")df = df.join(titles)Splitting out a passenger's title can reveal correlations with survival rates, while surnames can help identify family groups.
Why feature choice matters
Once you have a collection of features, you'll want to check if some are redundant. Highly correlated features can confuse certain models and add unnecessary complexity.
corr_matrix = df.corr(numeric_only=True)print(corr_matrix["SalePrice"].sort_values(ascending=False))This quick correlation check can guide early pruning before you invest in further engineering.
A practical process that actually works
Now you're entering the “systematic cleanup and creation” stage. Here you're refining raw columns into forms that will actually help your model.
# Create a new Age feature based on the year soldhousing["Age"] = housing["YrSold"] - housing["YearBuilt"]
# Fill missing LotFrontage values using neighborhood-specific medianshousing["LotFrontage"] = housing.groupby("Neighborhood")["LotFrontage"].transform(lambda x: x.fillna(x.median()))The Age feature is more directly related to property value than YearBuilt. Imputing LotFrontage with medians keeps the feature useful while respecting neighborhood differences.
Techniques for transforming and creating features
Here you've moved into the transformation stage,reshaping your features so algorithms can interpret them effectively.
# Create ratio of living area to lot sizehousing["AreaRatio"] = housing["GrLivArea"] / housing["LotArea"]
# Standardize numerical features to have mean 0 and variance 1from sklearn.preprocessing import StandardScalerscaler = StandardScaler()housing[["LotArea", "GrLivArea"]] = scaler.fit_transform(housing[["LotArea", "GrLivArea"]])
# Apply log transform to reduce skew in SalePriceimport numpy as nphousing["SalePriceLog"] = np.log1p(housing["SalePrice"])
# Encode categorical values numericallyfrom sklearn.preprocessing import LabelEncoderle = LabelEncoder()housing["Neighborhood_encoded"] = le.fit_transform(housing["Neighborhood"])These transformations can make patterns more apparent and ensure different scales don't bias the model.
Feature selection and dimensionality reduction
At this point, you've engineered and transformed features. Now it's time to decide which ones to keep and whether to reduce dimensionality.
# Model-based feature importancefrom sklearn.ensemble import RandomForestRegressormodel = RandomForestRegressor()model.fit(X, y)importances = model.feature_importances_
# Permutation importancefrom sklearn.inspection import permutation_importanceresult = permutation_importance(model, X, y, n_repeats=10, random_state=42)
# Principal Component Analysis for dimensionality reductionfrom sklearn.decomposition import PCApca = PCA(n_components=10)X_reduced = pca.fit_transform(X)These tools help you focus on the features that actually move the needle and cut down the rest, which can improve speed and generalization.
Evaluating feature importance
You're now in the analysis phase,understanding what's driving your model's predictions. This helps with trust, refinement, and communication with stakeholders.
# Tree-based model feature importancefrom sklearn.ensemble import GradientBoostingRegressormodel = GradientBoostingRegressor()model.fit(X, y)print(model.feature_importances_)Quickly see which features the model leans on most.
# SHAP values for model-agnostic explanationimport shapexplainer = shap.TreeExplainer(model)shap_values = explainer.shap_values(X)shap.summary_plot(shap_values, X)SHAP shows not just how much a feature matters, but the direction of its influence.
# Permutation importancefrom sklearn.inspection import permutation_importanceresult = permutation_importance(model, X, y, n_repeats=10, random_state=42)print(result.importances_mean)This approach tests each feature's contribution by seeing how performance changes when it's shuffled.
The takeaway: importance scores are clues, not final judgments. Combine them with your domain knowledge and validation results before making changes.
Bringing It All Together
By now, you've seen how feature engineering is less about blindly applying transformations and more about making thoughtful choices at each stage of the journey. You start by understanding what features are and how they behave, then move on to cleaning and crafting them, transforming them into model-friendly shapes, and finally deciding which ones deserve a place in the final dataset.
Every dataset is different, but the mindset remains the same: stay curious, test your assumptions, and let both your domain knowledge and your metrics guide you. With practice, you'll start seeing feature engineering not as a chore, but as the creative backbone of your machine learning work.