Two Faces of Model Evaluation and Validation in Machine Learning

Abstract image of 2 shapes

In the field of data science or more specifically Machine Learning (ML), you might expect that certain foundational concepts and terms are well established and commonly understood. After working in this domain for years, I can point to many examples where this is not the case.

When I moved to ML from another technical domain, I was surprised with the amount of new concepts and especially new meanings for common terms, like “training” a model. The typical ML workflow involves collecting data, training and evaluating the model from that data, deploying a model to produce sound predictions and monitoring the model’s performance in the production state. One particular topic that confused me as I was starting out was trying to figure out what people meant by model evaluation and model validation. Even within the field of practice, these terms can have different meanings.

It means one thing from the point of view of the model developer.

Model developers care about the performance of the model.

Evaluating a model during the training phase is one way that a model developer can build trust in their model. This involves analyzing how the trained model responds to data that it has not seen before. In doing this, the model developer is trying to answer the question. What is a robust machine learning model for this case?

This perspective focuses on the model performance

Evaluation is enabled by segmenting data to build the model and run experiments

Generally speaking, there are two categories of approaches for evaluating a ML model, both involve segmenting available data into sets to run training experiments. The method that appears to be the most common results in segmenting data into training, validation and test sets. The training data set is used to build the initial ML model, the validation data set is used to tune model’s hyperparameters (think knobs and switches) and select the best model. Finally, the test set is used to learn if the model will predict with similar performance as the final model when new or unseen data is fed to it for predictions. A model developer will then use metrics to actually quantify the performance of a ML model based on the type of problem they want to solve for. For example, is it a supervised or unsupervised ML model? If the model is supervised, what techniques were applied (e.g classification or regression)? A quick search on model evaluation and model validation surfaces informative articles and courses from stellar practitioners who provide overviews of performance measures. A few that come to mind are:

From these publications and courses you can learn about common performance metrics used for classification (e.g. accuracy, confusion matrix, log loss, AUC, f-measure) and regression (e.g. root mean squared error, mean absolute error) 

The model developer perspective is useful but it focuses mostly on the model, not the data or the application.

Machine learning models are just one component in a software application.

When I started out in data science, the primary users worked in research and development. AI was not yet in the mainstream. This first description of model evaluation and validation was my foundation of understanding these terms, until one day I started hearing them used in other, much more expansive contexts.

It means a bigger thing from the point of view of the model validator.

Model validation is concerned with the integrity of the data, model and the application, especially in regulated sectors like banking.

As I branched out to learn about AI/ML applications in specific industries, model evaluation and validation began to take on different meanings. The banking or financial services sector topped the list for practice areas with an expanded view of model evaluation and validation. Financial Institutions have established Model Risk Management (MRM) practices that need to be supplemented with new knowledge and insights to meet the demands of AI/ML.

Compliance plays a key role for defining model evaluation and validation in banking.

Model validation takes on a broader meaning when you consider the context of use.

Applying Machine Learning in the banking sector opens up opportunities to use models with more accurate predictive power and insights. Compared to traditional models used in the banking industry, ML models are less transparent and more complex. This makes meeting the stringent regulatory requirements in the banking sector very difficult because various regulations spell out what companies need to do to ensure their models are in compliance. In this case, the definition of model validation extends beyond the model performance to the upstream data and even to the downstream model documentation and usage scenarios.

The person who is responsible for model validation is not the developer

Model validators need to be independent of the model development

Another distinction lies in who is responsible for model validation. In the most basic ML workflows, data scientists or model developers would engage in model evaluation and validation steps in an effort to improve the performance of the model. In the case of model validation in the banking sector, it would be internal (or external) people working to understand how fit the model is for use against specified criteria like the regulations.

These perspectives are different but complementary.

Regulations that are concerned with the context of use for AI/ML models are the key differentiator in these perspectives.

I don’t want to paint a picture that these two perspectives are at odds, they are very complementary. In fact, you can’t get to the more stringent description of model evaluation and validation until you satisfy the view for the model developer. In both the model-developer and the model validator views of model evaluation validation are grounded in a well performing model. The more expansive view gets you an application that is fit for use in the designated application according to one or more regulatory or business requirements. These distinctions are useful to me now because I can understand what perspectives to anchor my practice in. 


Biswas, P. (2021, Jun. 7). AI/ML Model Validation Framework. It’s more than a simple MRM issue. Towards Data Science.

Chauhan, N.  (2020, May 28). Model Evaluation Metrics in Machine Learning. KD Nuggets, 11(9), 907-923.

Intro to Machine Learning Lesson 4: Model Validation. Kaggle.

Khalusova, M. (2019, May 8). Machine Learning Model Evaluation Metrics. Anacondacon.

Mutuvi, S. (2019, April 16). Introduction to Machine Learning Model Evaluation. Heartbeat.

Featured presenter, Olivier Blais, Co-founder and Head of Data Science | Moov AI (2020, Mar. 20). The Comprehensive Guide to Model Validation Framework: What is a Robust Machine Learning Model? ODSC Blog.

Srinivasa C., Castiglione, G. (2020, Dec. 2). Model Validation: a vital tool for building trust in AI. Borealis AI Blog.

Validation of Machine Learning Models: Challenges and Alternatives. Protivity Blog.