PastPaperHero | Machine learning and simulation - Cross-validation regularization and feature selection

Learning Outcomes

This article explains how cross-validation, regularization, and feature selection are applied in machine learning and simulation for CFA Level 2 quantitative analysis. It clarifies the purpose of cross-validation in estimating out-of-sample performance, guiding model selection, and identifying overfitting in portfolio risk, return forecasting, and classification tasks. It details how common regularization techniques—Ridge, Lasso, and Elastic Net—control model complexity, shrink or eliminate coefficients, and improve robustness of regression and classification models used in financial contexts. It distinguishes among filter, wrapper, and embedded feature selection methods, emphasizing how they reduce dimensionality, improve interpretability, and interact with cross-validation to avoid information leakage. It highlights how feature engineering for numerical and text-based data can strengthen predictive power while maintaining a disciplined validation process. It reinforces the exam-relevant logic for comparing candidate models, interpreting penalty terms, and recognizing when a model is likely overfit or underfit. It also develops your ability to spot common exam pitfalls, such as data snooping, hyperparameter over-tuning, and incorrect use of validation or test sets.

CFA Level 2 Syllabus

For the CFA Level 2 exam, you are required to understand how machine learning models are validated and improved, with a focus on the following syllabus points:

Describing the role of cross-validation in assessing model performance and preventing overfitting
Explaining regularization techniques that limit model complexity and stabilize estimates
Applying feature selection and engineering methods for structured and text-based financial data
Evaluating the impact of these methods on model accuracy and robustness in financial modeling tasks

Test Your Knowledge

Attempt these questions before reading this article. If you find some difficult or cannot remember the answers, remember to look more closely at that area during your revision.

A quant analyst at an asset manager is building several machine learning models to predict next-month excess returns on a universe of 800 stocks. She has 10 years of monthly data and wants to avoid overfitting while selecting hyperparameters for models that include penalized regression and tree-based methods.

To obtain an unbiased estimate of the final model’s out-of-sample performance, the analyst should most appropriately:
1. Use k-fold cross-validation on the full dataset and report the average cross-validation error.
2. Split the data into training, validation, and test sets, using cross-validation only within the training set for tuning.
3. Tune hyperparameters by minimizing error on the test set and then refit the model on all data.
4. Randomly shuffle all observations and select the model with the highest in-sample R2R^2R2.
When tuning the Lasso penalty parameter λ\lambdaλ with cross-validation, the analyst is primarily trading off:
1. Higher in-sample accuracy versus lower interpretability.
2. Bias versus variance in out-of-sample predictions.
3. Linear versus non-linear relationships in the features.
4. Underfitting versus multicollinearity in the training data.
Suppose the analyst runs several models and notices that one specification has extremely low training error but much higher validation error than the others. The most appropriate conclusion is that the model:
1. Is underfit and needs more features.
2. Has data leakage from the test set into the training process.
3. Is overfit and will likely perform poorly on new data.
4. Has been regularized too strongly by a large penalty term.

A credit-risk team is developing a bankruptcy prediction model using both structured financial ratios and text features from annual reports. They start with 400 candidate variables, including hundreds of text tokens, and want a parsimonious model that generalizes well.

If the team first removes variables with near-zero variance and then drops tokens with very low correlation to default, they are primarily using:
1. Wrapper methods for feature selection.
2. Embedded methods implemented inside the classifier.
3. Filter methods applied before model training.
4. Dimensionality reduction using principal components.
The team then estimates a logistic regression with an Elastic Net penalty. A key advantage of Elastic Net in this setting is that it:
1. Ensures all coefficients are exactly zero or one, improving interpretability.
2. Uses only L1 regularization, which is ideal when predictors are highly correlated.
3. Combines L1 and L2 penalties, handling groups of correlated predictors better than pure Lasso.
4. Guarantees the model has the lowest possible training error.
During development, the team mistakenly runs Lasso on the full dataset to select variables and then evaluates model performance using cross-validation on that same data. The main issue with this workflow is that it:
1. Introduces multicollinearity into the selected model.
2. Underestimates out-of-sample error due to information leakage.
3. Over-penalizes text-based features relative to numeric features.
4. Violates the assumptions needed to apply logistic regression.

Introduction

Machine learning models often process large, complex data sets with many potential predictive variables. Without care, models may simply memorize the training data (“overfitting”), leading to weak performance on new data. This article covers core techniques that test, constrain, and refine machine learning models for reliable predictions in the CFA exam context: cross-validation, regularization, and feature selection.

In the Level 2 curriculum, these tools appear both in the dedicated machine learning reading and in readings on backtesting and simulation. In practice, they underpin applications such as:

Forecasting equity or bond returns
Classifying borrowers as likely defaulters or non-defaulters
Estimating portfolio risk measures, including those derived from Monte Carlo simulations

The common goal is to obtain models that generalize: they should perform well not only in-sample (on the data used to fit them) but also out-of-sample (on genuinely new data). Cross-validation provides an estimate of this generalization performance, regularization controls model complexity, and feature selection ensures that only informative variables are used.

Key Term: overfitting
Overfitting is a modeling error in which a model learns non-generalizable patterns (noise) from the training data, leading to poor out-of-sample predictions.

Key Term: underfitting
Underfitting occurs when a model is too simple to capture important relationships in the data, resulting in high error on both training and validation sets.

Cross-Validation

Most machine learning algorithms can fit the historical training data extremely well. However, if a model has too many parameters relative to the amount of useful variation in the data, it may learn random noise or idiosyncrasies in the sample. This reduces the model’s accuracy for new cases (out-of-sample generalization).

Key Term: cross-validation
Cross-validation is a procedure where the available data is repeatedly split into training and validation sets to assess out-of-sample model performance and select hyperparameters.

Cross-validation is essential for:

Assessing how well a model is likely to predict new, unseen data
Selecting model hyperparameters and tuning complexity
Avoiding selection of models that are “lucky fits” to the training sample

The most common form is k-fold cross-validation: the data are split into $k$ equal parts; for each of $k$ rounds, the model is trained on $k-1$ parts and tested on the remaining part. The results are averaged.

Key Term: k-fold cross-validation
k-fold cross-validation partitions the data into $k$ folds, trains on $k-1$ folds, validates on the remaining fold, and repeats this process $k$ times, averaging the performance.

Key Term: hyperparameter
A hyperparameter is a model setting chosen before training (such as the penalty strength in regularization) that is not directly estimated from the data and is typically tuned using cross-validation.

Cross-validation is closely related to the familiar finance ideas of using an out-of-sample period in a backtest or holding out a “test window” in a historical simulation. The key difference is that cross-validation systematically rotates the validation role among subsets of the data to obtain a more stable estimate of generalization error.

Key Term: train–validation–test split
A train–validation–test split divides data into a training set used to fit models, a validation set used for tuning and model selection, and a final test set used once for unbiased performance evaluation.

In practice, an effective workflow is:

Split the data into training and test sets.
Within the training set, use k-fold cross-validation to choose model type and hyperparameters.
After selecting the final model, evaluate it once on the test set to obtain an unbiased out-of-sample performance measure.

Worked Example 1.1

A financial analyst is building a default prediction model for loans with a dataset of 5,000 observations. How should they use cross-validation to choose among several machine learning models?

Answer:
The analyst could set $k = 5$ for 5-fold cross-validation. Each model is trained on 4,000 loans and validated on the remaining 1,000, repeating this for each fold. The model with the best average validation performance (e.g., lowest misclassification rate or highest area under the ROC curve) is selected, minimizing overfitting risk.

Cross-validation, bias–variance, and model complexity

Key Term: model complexity
Model complexity refers to the flexibility of a model, often related to the number of parameters or the functional form, which affects its ability to fit fine-grained patterns.

More complex models:

Typically achieve lower training error (they can fit more patterns)
Are at higher risk of overfitting, leading to higher validation and test errors

Simpler models:

May have higher training error
Often generalize better if the data set is limited or noisy

Cross-validation helps locate the “sweet spot” that balances this bias–variance trade-off by choosing a model (including its hyperparameters) that minimizes validation error rather than training error.

Time-series considerations

For time-series models (e.g., forecasting returns or macro variables), random shuffling of data for k-fold cross-validation can introduce look-ahead bias because future observations may influence the model used to predict the past.

In such cases, more appropriate variants include:

Rolling (walk-forward) validation: Train on an initial window (e.g., years 1–5), validate on the next period (year 6), then roll the window forward.
Expanding window: Train on data up to time $t$ , validate on $t+1$ , then expand the training set as time progresses.

These methods respect the time ordering and mimic real-world forecasting, where only past data are available.

Information leakage and data snooping

Key Term: information leakage
Information leakage occurs when data from the validation or test sets are inadvertently used during model training, leading to overly optimistic performance estimates.

Key Term: data snooping
Data snooping is the improper reuse of the same dataset to both develop and evaluate models or trading strategies, inflating apparent performance due to chance discoveries.

Common sources of leakage include:

Performing feature selection or scaling using the full dataset, then applying cross-validation
Engineering variables (e.g., z-scores) using means and standard deviations computed on all observations
Using future information (such as realized returns) to construct predictors in a historical backtest

Exam Warning

A frequent error is accidentally “leaking” information between training and validation sets. Never use validation (or test) data to fit the model, select features, or engineer variables—this causes overoptimistic accuracy estimates and is conceptually similar to data-snooped trading rules in backtesting.

Regularization

Adding more features or parameters typically increases model complexity. More complex models can always improve the fit to training data, but this does not guarantee good predictions for new data. Regularization forces the model to be simpler by penalizing large coefficients.

Key Term: regularization
Regularization is a technique used to constrain or shrink the parameter estimates of a model by adding a penalty for complexity to the loss function, reducing overfitting.

Key Term: penalized regression
Penalized regression refers to regression models (linear or logistic) that include a penalty term in the loss function for large or numerous coefficients, such as Lasso (L1), Ridge (L2), or Elastic Net.

In a standard linear regression, we minimize the sum of squared errors (SSE). With regularization, we minimize:

\text{SSE} + \lambda \times \text{Penalty}(\beta)

where $\lambda \ge 0$ is a hyperparameter controlling the strength of the penalty and $\beta$ denotes the vector of coefficients.

Key Term: regularization parameter ( $\lambda$ )
The regularization parameter $\lambda$ scales the penalty term in a regularized model; higher $\lambda$ imposes stronger shrinkage on coefficients and increases bias but reduces variance.

Two main forms are:

Ridge regression (L2 penalty): Penalizes the sum of squared coefficients, shrinking them towards zero but rarely reaching exactly zero.
Lasso regression (L1 penalty): Penalizes the sum of absolute coefficients, setting many of them exactly to zero (automatic feature selection).
Elastic Net: Combines both L1 and L2 penalties.

Key Term: Ridge regression
Ridge regression minimizes SSE plus $\lambda \sum_j \beta_j^2$ , shrinking coefficients toward zero and stabilizing estimates, particularly when predictors are correlated.

Key Term: Lasso regression
Lasso regression minimizes SSE plus $\lambda \sum_j |\beta_j|$ , shrinking some coefficients exactly to zero and thus performing variable selection.

Key Term: Elastic Net
Elastic Net uses a weighted combination of L1 and L2 penalties, balancing variable selection (L1) with coefficient stability for groups of correlated predictors (L2).

Choosing $\lambda$ with cross-validation

The regularization parameter $\lambda$ is not known ex-ante. Too small a $\lambda$ yields a model close to ordinary least squares (risking overfitting), while too large a $\lambda$ yields a very simple, underfit model. Cross-validation is used to:

Evaluate a grid of candidate $\lambda$ values
Compute the validation error for each
Select the $\lambda$ that minimizes average validation error (or a slightly larger $\lambda$ that gives a simpler model with similar performance)

This process is directly tested in the curriculum: candidates are expected to interpret why a particular $\lambda$ was chosen and how it affects the number and magnitude of non-zero coefficients.

Worked Example 1.2

Suppose a model includes 50 economic indicators, but only a small subset are truly predictive of returns. Which regularization technique helps identify and remove unnecessary features?

Answer:
Lasso regression (L1 regularization) will shrink some coefficients to zero, automatically selecting only the most relevant features and excluding the unimportant ones during model fitting. In practice, the analyst would use cross-validation to choose $\lambda$ so that the number of selected indicators balances interpretability and validation performance.

Regularization and classical model selection criteria

Even before machine learning, econometricians used criteria that penalize model complexity:

Key Term: adjusted $R^2$
Adjusted $R^2$ modifies $R^2$ by penalizing the inclusion of additional predictors, increasing only if a new variable improves model fit more than expected by chance.

Key Term: Akaike information criterion (AIC)
AIC is a model selection statistic that balances goodness of fit against the number of parameters, with lower AIC indicating a preferred model for forecasting.

Key Term: Bayesian information criterion (BIC)
BIC is similar to AIC but applies a stronger penalty for additional parameters, favoring more parsimonious models when the goal is explaining the data.

In the curriculum, you may be asked to compare models using adjusted $R^2$ , AIC, or BIC, or to explain why a model with a slightly lower $R^2$ but lower AIC/BIC might be preferred. Conceptually, these information criteria play a similar role to regularization: they discourage over-complex models by penalizing unnecessary parameters.

Feature Selection and Engineering

Feature selection improves model generalization by keeping only variables that provide real predictive value. Reducing the feature count also decreases computational demands and improves interpretability.

Key Term: feature selection
Feature selection is the process of identifying and retaining only the most informative inputs (features) from the available dataset for model training, based on their relationship to the target variable.

Key Term: feature engineering
Feature engineering consists of activities performed to create, transform, or extract new attributes from raw data to improve model performance.

Feature selection methods include:

Filter methods: Use statistical criteria to test the relevance of each feature to the target (e.g., correlation, chi-squared test).
Wrapper methods: Evaluate subsets of features by training models and comparing results (e.g., recursive feature elimination).
Embedded methods: Integrate feature selection into the model training process itself (e.g., Lasso regression).

Key Term: filter method
A filter method ranks or screens features using simple statistics (such as correlation or mutual information with the target) before any model is trained.

Key Term: wrapper method
A wrapper method searches over subsets of features by repeatedly fitting a model and evaluating performance, treating feature selection as a search problem around the learning algorithm.

Key Term: embedded method
An embedded method performs feature selection as part of model training, where the learning algorithm itself decides which features to retain (for example, through regularization or tree-based split criteria).

Filter methods are computationally cheap and useful for initial screening (e.g., removing variables with near-zero variance or extremely low association with the target). Wrapper methods are more expensive and vulnerable to overfitting if they do not use cross-validation, but can explore complex interactions among features. Embedded methods, such as Lasso and Elastic Net, tend to scale better to high-dimensional data and are widely used in financial machine learning.

Feature engineering for structured data

From the structured-data standpoint (financial ratios, macro variables, market data), feature engineering may involve:

Scaling variables (e.g., standardizing to zero mean and unit variance) so that regularization treats coefficients comparably
Creating interaction terms (e.g., leverage × volatility) if economically justified
Transforming skewed variables (e.g., log-transforming market capitalization)
Encoding categorical variables as binary indicators

Key Term: one-hot encoding
One-hot encoding converts a categorical feature with $k$ categories into $k$ binary dummy variables, each indicating the presence or absence of a category.

A disciplined process requires that any transformations (such as standardization or one-hot encoding) be fitted on the training data only, and then applied to the validation and test sets using the same parameters (means, standard deviations, category levels) to avoid leakage.

Feature selection and engineering for text data

Textual financial data (analyst reports, MD&A sections, news) often enter models via a bag-of-words representation.

Key Term: bag-of-words
Bag-of-words (BOW) is a representation of text where each document is described by counts or weights of individual tokens (words or phrases), ignoring word order.

Raw BOW matrices are typically high-dimensional and sparse. Feature selection and engineering are therefore important:

Remove stop words (very common words such as “the”, “and”) and extremely rare tokens
Eliminate tokens that occur in both defaulting and non-defaulting firms at similar rates, as they do not help discriminate classes
Consider multiword expressions (n-grams) such as “liquidity crisis” when phrases carry more meaning than individual words

Key Term: term frequency–inverse document frequency (TF‑IDF)
TF‑IDF is a weighting scheme for text features that increases with a token’s frequency in a document but decreases with its frequency across all documents, emphasizing discriminative words.

High-frequency words that remain (after removing stop words) and low-frequency words that convey little information can both be removed using filter methods, reducing noise and improving the signal-to-noise ratio in models such as penalized logistic regression.

Worked Example 1.3

An analyst is modeling bankruptcy risk for listed companies using 100 financial ratios and market variables. What steps should they take before final model training?

Answer:
The analyst should use filter methods (such as removing features with very low correlation or mutual information with the bankruptcy outcome), then apply Lasso or Elastic Net regression as an embedded selection tool during model fitting, retaining only variables that improve out-of-sample prediction, as measured by cross-validation. Any scaling or one-hot encoding should be fitted within each training fold and then applied to the corresponding validation fold to prevent information leakage.

Worked Example 1.4

A team builds a sentiment score from annual report text to help predict equity returns. They start with a BOW of 20,000 tokens, remove stop words, compute TF‑IDF weights, and then keep only the top 500 tokens ranked by mutual information with next-year excess returns. They then fit a Ridge regression model using these 500 text features plus standard risk factors.

Which feature-selection methods are being used, and how do they interact with regularization?

Answer:
The initial removal of stop words and low-information tokens using mutual information is a filter method. Ridge regression does not set coefficients exactly to zero, so it is not performing variable selection but rather shrinking coefficients of all 500 text features toward zero. The combination of filter selection plus L2 regularization yields a model that is parsimonious enough to estimate (500 features instead of 20,000) yet still retains all selected features with stabilized coefficient estimates.

Revision Tip

Always perform feature selection and feature engineering based only on the training set portion in each cross-validation fold to prevent information leakage into the validation step. In an exam vignette, if feature selection is done once on the full dataset and then cross-validation is applied, you should recognize this as a biased validation procedure.

Summary

Cross-validation, regularization, and feature selection are core machine learning tools for CFA candidates. They work together to ensure model accuracy and robustness by confirming performance out-of-sample, limiting complexity, and removing irrelevant features.

In financial applications:

Cross-validation plays a similar role to robust backtesting, providing an estimate of how a model will perform on unseen data, including under different market regimes.
Regularization stabilizes regression and classification estimates, especially when features are numerous or correlated, and can automatically exclude weak predictors.
Feature selection and engineering, including for text, improve generalization and interpretability by encoding economically meaningful information while keeping dimensionality manageable.

These techniques reduce overfitting risk, support defensible financial modeling, and help you interpret exam questions about model tuning, choice of penalty terms, and correct use of training, validation, and test sets.

Key Point Checklist

This article has covered the following key knowledge points:

Cross-validation tests how a model will perform on unseen data and is essential for model selection and tuning.
Train–validation–test splits and k-fold cross-validation should be structured to avoid look-ahead bias, especially for time-series data.
Overfitting and underfitting can be diagnosed by comparing training and validation errors; cross-validation helps find the complexity level that balances bias and variance.
Regularization (Lasso, Ridge, Elastic Net) penalizes excessive model complexity and, in the case of Lasso and Elastic Net, can perform variable selection.
The regularization parameter $\lambda$ is a hyperparameter selected by cross-validation; stronger penalties shrink coefficients more aggressively.
Information criteria such as adjusted $R^2$ , AIC, and BIC also penalize model complexity and are conceptually related to regularization.
Feature selection removes non-informative or redundant variables, improving generalization and interpretability; filter, wrapper, and embedded methods have different trade-offs.
Feature engineering, including one-hot encoding for categorical data and TF‑IDF for text, can materially improve predictive performance when done within a disciplined validation framework.
Model tuning, selection, and feature engineering steps must always avoid using validation or test data during training to ensure fair out-of-sample performance estimates and to avoid data snooping.

Key Terms and Concepts

overfitting
underfitting
cross-validation
k-fold cross-validation
hyperparameter
train–validation–test split
model complexity
information leakage
data snooping
regularization
penalized regression
regularization parameter ( $\lambda$ )
Ridge regression
Lasso regression
Elastic Net
adjusted $R^2$
Akaike information criterion (AIC)
Bayesian information criterion (BIC)
feature selection
feature engineering
filter method
wrapper method
embedded method
one-hot encoding
bag-of-words
term frequency–inverse document frequency (TF‑IDF)

Machine learning and simulation - Cross-validation regulariz...

Learning Outcomes

CFA Level 2 Syllabus

Test Your Knowledge

Introduction

Cross-Validation

Worked Example 1.1

Cross-validation, bias–variance, and model complexity

Time-series considerations

Information leakage and data snooping

Exam Warning

Regularization

Choosing $\lambda$ with cross-validation

Worked Example 1.2

Regularization and classical model selection criteria

Feature Selection and Engineering

Feature engineering for structured data

Feature selection and engineering for text data

Worked Example 1.3

Worked Example 1.4

Revision Tip

Summary

Key Point Checklist

Key Terms and Concepts

Assistant

Machine learning and simulation - Cross-validation regulariz...

Learning Outcomes

CFA Level 2 Syllabus

Test Your Knowledge

Introduction

Cross-Validation

Worked Example 1.1

Cross-validation, bias–variance, and model complexity

Time-series considerations

Information leakage and data snooping

Exam Warning

Regularization

Choosing λ\lambdaλ with cross-validation

Worked Example 1.2

Regularization and classical model selection criteria

Feature Selection and Engineering

Feature engineering for structured data

Feature selection and engineering for text data

Worked Example 1.3

Worked Example 1.4

Revision Tip

Summary

Key Point Checklist

Key Terms and Concepts

Assistant

Choosing $\lambda$ with cross-validation