Can you trust your model?
Best Practices For Selecting and Evaluating Predictive Models
by Marco Vriens
Model selection and evaluation via cross-validation
In both commercial practice and academic papers, I sometimes encounter analytics work with no report on model fit or how a “best” model was selected, or any analysis on how strong the model is (evaluation). What does that even mean?
This topic is important in commercial practice of advanced analytics, because the stakeholders and decision-makers who have to use the model (read: implement the recommendations from the model you believe they should use) are going to have to trust the model. They face what is referred sometimes as information usage risk (see Grover & Vriens, 2006), because they know if for some reason the recommendations don’t result in the desired outcomes, the analyst may “oops”, but ultimately it is the decision maker who will feel the most impact.
The topic of how to build trust for your analytical results and to make sure the results are valid is a broad topic, but one angle to look at this is the angle of cross-validation (not to be confused with external validation).
What are some of the best practices to consider for selecting and evaluating predictive models? Here are some thoughts.
In-sample fit First, in any type of predictive model, practitioners usually look at what we refer to as in-sample fit, i.e. how well does the model fit the data on which it was run? Users need to be aware that in most situations one can come up with several (if not many) models that all “fit the data” roughly equally well. By the way, there are quite a few “in-sample” fit measures (especially for classification models (see our book From Data to Decision for more on this topic). One potential problem with in-sample fit is that often, one can increase the fit of the model by just adding variables to the model, which raises the risk of over-fitting. Over-fitting is defined as a situation where a model really fits the data at hand very well but does a very poor job predicting any new data. For this reason, in-sample fit it is not really a good guidance to select the best model, let alone to determine whether we have a good model at hand. So, researchers and data scientists turn to approaches that use data that was not used to estimate the predictive model to evaluate the predictive model. In my view, there are three types of approaches: (2) the two-step hold-out accuracy approach, (3) the cross-validation approach (and its varieties such as k-fold cross validation, k-fold averaging cross-validation, and sometimes the nested k-I-fold cross-validation), and (4) the three-step approach (which I have seen mostly in Machine Learning applications) These concepts are similar but not identical to each other, and the differences between can be important. Below, a brief description of the cross-validation methods.
Hold-out accuracy Say I have a data set on which I want to estimate my predictive model (e.g. maybe a linear regression model, maybe a Decision Tree, etc.). I will take a random sub-sample of my dataset, say size of 90%. The other 10% I set aside (we call this the hold-out part). On this 90%, I run the different models that, based on their in-sample fit, may all look acceptable. Then, I will use each of the models to predict the “hold-out” remaining 10% of the data and pick the best one. It is possible to have two models that differ vastly in in-sample fit but perform very similar on a new set of (hold-out) data. For example, Mullainathan & Spiess (2017), identify several models to predict house values. A Lasso regression approach yielded a 46% in-sample fit, and a Random Forest approach yielded 85.1% fit. Yet, when both models were evaluated on a hold-out set the difference was 43.3% versus 45.5%. A much smaller difference. And, in fact, the model with the best hold-out accuracy was not the model with the best in-sample fit.
Cross-validation & k-fold cross validation In this approach I split the dataset in two equal parts (two random subsets of 50%). I can now compare the modeling results across two datasets and see which model is most robust. For example, say I identify three different models in sub-dataset 1 that all fit roughly equally well, or all fit decently. I now run the same analysis on the second sub-dataset. Do I identify the same models, and if so, how close are the coefficients? The model that is most robust across the two datasets is the one we select. Sometimes, the original data is split up in three parts, say size 45%, 45%, and 10%. The first 45% is called the training sample because we “train” our model on this data. In this step we may find several possible models that all are acceptable from an “in-sample fit” perspective. The second 45% is called the test sample. The models selected in step 1 will now be used to predict the data in this second 45% part. Whichever model does that the best will be selected. Then, the selected model will be used to predict values in the validation part (the 10% of the data; this part is called the validation set). In the k-fold cross validation approach the dataset is split in K even parts. For example, if I have a sample of N=1000 then I can create 10 random sub-samples of size 100 (without replacement, so each observation only belongs to one of the 10 parts). We take out one of the 10 parts and develop our best model on the remaining 90%. The 90% is called the training set, the 10% is called the test set. We then evaluate how this model predicts the remaining 10%. We do this for each of the 10 parts. In the statistical literature, this method is described and implemented slightly differently: If we have p independent variables, we can run 2P-1 possible models. So, with K parts we will run K*(2P-1) possible models. We compare each of these models on mean-squared error (MSE) and we select the model with the lowest MSE value. In the k-fold averaging cross-validation. This approach is slightly different from #4 (below) the difference being that in each of the K sub-datasets an optimal model is selected among the 2P-1, then model parameters are averaged across the K “optimal” models. So, this can be thought of as a kind of ensemble method.
The three-step approach In basic standard linear regression, we may have to decide which and how many independent variables to retain in the model, but there is usually not much else. When I move to Decision Trees, I have to decide on which variables, but also which splitting method and when to stop (until purity or some other criterion). In the three-step approach we will divide a given dataset up in three parts: A training sample, a test sample, and a validation sample (e.g. Guyon et al., 2007). For example, a sample of N=1000 can be split up in a training sub sample of 500, a test sub sample of 250, and a validation sub-sample of 250. The training sample is used to estimate/train a set of possible models that all show decent in-sample fit. These are then “tested” on the test sample, and a “best model” is selected. However, given that both training and test data were used to arrive at my “best” model, I still need to independently verify that my model is good. So this best model is used to predict the values in the validation sample.
In commercial practice we often do not use the three-step approach. It adds time to the analysis that is not always available. Also, usually other considerations play a role as well in the model selection process, such as what is the average size of the coefficients, how intuitive is the model, and how parsimonious is the model (the latter is a topic in and of itself so we will leave that for now).
Thanks to Dr. Lambert Schomaker, professor Artificial Intelligence, University of Groningen, for some insights in this topic and for suggesting some references.
Disclaimer Blogs are not scientific/academic papers. The topic of cross-validation has attracted some serious statistical papers and a few papers in the Machine Learning Literature. Below we refer to a few if you want to dive deeper into the topic.
Grover, R. & Vriens, M. (2006). Trusted Advisor: How it lays the foundation for insights. In: Grover, R. & Vriens, M. (eds.), Handbook of Marketing Research (2006), Sage Publications.
Guyon, I., Li, J., Mader, T., Pletscher, P.A., Schneider, G., Uhr, M. (2007). Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark. Pattern Recognition, 28, 1438-1444.
Jung, Y. and Hu, J. (2015). A k-fold averaging cross-validation procedure. Journal of Non-parametric Statistics, 27, 2, 167-179.
Mullainathan, S. & Spiess, J. (2017). Machine Learning: An applied econometric approach. Journal of Economic Perspectives, 31, 2, 87-106.
Shao, J. (1993). Linear model selection by cross-validation. Journal of the American Statistical Association, 88, 486-494.
Vriens, M., Chen, S. & Vidden, C. (2018). From Data to Decision: Handbook for the modern business analyst. Cognella Academic Publishing.