Feature Engineering for Materials Science: From Experimental Logs to ML-Ready Data

2025-06-26·1 min read·Machine Learning

The Real Bottleneck

Most materials ML tutorials start with a clean CSV and jump straight to model fitting. In reality, your data lives in lab notebooks, instrument exports, and Excel sheets. Feature engineering is where 80% of the work happens.


import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder

From Raw Data to Feature Matrix

Typical raw data from a biomaterials experiment:

Sample	Fiber_dia(nm)	Orientation(deg)	Crosslink(%)	Treatment	Youngs_mod(GPa)
--------	--------------	-----------------	-------------	-----------	-----------------
A1	45	12	2.5	methanol	2.8
A2	52	22	1.8	ethanol	1.9


df = pd.DataFrame({
    "fiber_diameter_nm": [45, 52, 48, 55, 50, 47],
    "orientation_std_deg": [12, 22, 8, 18, 15, 20],
    "crosslink_density_pct": [2.5, 1.8, 3.2, 2.0, 4.1, 1.5],
    "treatment": ["methanol", "ethanol", "methanol", "water", "methanol", "ethanol"],
    "youngs_modulus_GPa": [2.8, 1.9, 3.5, 2.3, 4.0, 1.7],
})

Encoding Categorical Variables


encoder = OneHotEncoder(sparse_output=False, drop="first")
treatment_encoded = encoder.fit_transform(df[["treatment"]])
treatment_cols = encoder.get_feature_names_out(["treatment"])

df_encoded = pd.concat([
    df.drop("treatment", axis=1),
    pd.DataFrame(treatment_encoded, columns=treatment_cols)
], axis=1)

Domain-Driven Interaction Features

This is where materials knowledge adds value:


def engineer_materials_features(df):
    df = df.copy()
    # Crosslink x fiber diameter synergy
    df["crosslink_x_diameter"] = (
        df["crosslink_density_pct"] * df["fiber_diameter_nm"]
    )
    # Orientation disorder vs diameter ratio
    df["orient_diameter_ratio"] = (
        df["orientation_std_deg"] / (df["fiber_diameter_nm"] + 1e-6)
    )
    # Log-transform skewed features
    for col in ["crosslink_density_pct", "fiber_diameter_nm"]:
        if col in df.columns:
            df[f"log_{col}"] = np.log1p(df[col])
    return df

Handling Missing Values

For small datasets, avoid blind mean imputation:


def smart_impute(df, group_col="treatment"):
    df = df.copy()
    numeric_cols = df.select_dtypes(include=np.number).columns
    for col in numeric_cols:
        if df[col].isna().sum() == 0: continue
        if group_col in df.columns:
            df[col] = df.groupby(group_col)[col].transform(
                lambda x: x.fillna(x.median())
            )
        df[col] = df[col].fillna(df[col].median())
    return df

Feature Selection for Small Datasets

With <100 samples, avoid automated selection:

Correlation analysis: drop |r| > 0.95
Variance threshold: drop near-constant features
Domain curation: only physically justified features


corr_matrix = df_feat[feature_cols].corr().abs()
upper_tri = corr_matrix.where(
    np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
to_drop = [col for col in upper_tri.columns 
           if any(upper_tri[col] > 0.95)]

Checklist

All features numeric
No target leakage
Missing values handled
Categoricals encoded
Interaction features have physical justification

References

Kuhn, M. & Johnson, K. (2019). Feature Engineering and Selection. CRC Press.
Ramprasad, R. et al. (2017). Machine learning in materials informatics. npj Computational Materials, 3, 54.