Train-Test split and Cross-validation

Building an optimum model which neither underfits nor overfits the dataset takes effort. To know the performance of our model on unseen data, we can split the dataset into train and test sets and also perform cross-validation.

Train-Test split


To know the performance of a model, we should test it on unseen data. For that purpose, we partition dataset into training set (around 70 to 90% of the data) and test set (10 to 30%).  In sklearn, we use train_test_split function from sklearn.model_selection.

from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=1)

  • stratify option tells sklearn to split the dataset into test and training set in such a fashion that the ratio of class labels in the variable specified (y in this case) is constant. If there 40% 'yes' and 60% 'no' in y, then in both y_train and y_test, this ratio will be same. This is helpful in achieving fair split when data is imbalanced.
  • test_size option helps to determine the size of test set (0.2=20%)
  • Further there is shuffle option (by default shuffle=True) which shuffles the data before splitting.

Cross-validation (CV): why we need it?


When we have to tune hyperparameters of a model, to know whether the value of hyperparameter that we chose is optimal or not, we have to run the model on test set. But if we use the test set more than once, then the information from test dataset leaks to the model. This leads to over-fitting or byhearting the value of dependent variable. To avoid that, we use cross-validation.

We use one more test set, that is called validation set to tune the hyperparameters.  Following picture depicts the 3-fold CV. K-fold CV corresponds to subdividing the dataset into k folds such that each fold gets the chance to be in both training set and validation set.

3-fold CV
CV-accuracy is computed by averaging accuracy levels in these three folds.

By default, the cross val score function uses StratifiedKFold for classification and KFold for other tasks. There are different ways to perform CV:
  • Leave One Out (LOO): Computationally expensive. Suppose if there are three observations, then one observation will be left out from the training model and that observation will be the entire validation set. Since there are three observations, there will be three training sets and three validation sets (each observation will be the validation set once).
  • Leave P Out (LPO): Same as LOO, except P observations will be kept out instead of one.
  • ShuffleSplit: Data shuffled before splitting. Use only when observations are independent and there is no pattern unlike the time series data.
  • KFold: It is described in the above example, where folds were three.
  • StratifiedKFold: It is modified version of KFold CV for classification tasks. Here same percentage of y classes are preserved in both training and validation sets. It is useful when there is class imbalance.
  • Stratified Shuffle Split: Same as StratifiedKFold but data is shuffled before splitting.
  • GroupKFold: It is also modified version of KFold CV. Suppose you have data on four groups. Groups may be four individuals, or four groups of plants or four patients. Since we want to know the performance of our model on unseen group data, we don't want observations from same group to be in both training and testing set. GroupKFold does exactly that.
This scikit-learn document clearly explains all these different CV approaches with visualizations.


Cross-validation for Time series data


Timeseries data are not independent of each other. There is a pattern. Hence randomly splitting the data into test and train is not meaningful. There are excellent sources to know more about cross-validation for time series data (blogpostNested Cross-Validation, stackexchange answer and research paper on hv-block cross-validation).


In scikit-learn, TimeSeriesSplit approach splits the timeseries data in such a way that validation/test set follows training set as shown below.

Cross-validation for Timeseries data

There are other approaches as well. In Nested CV, we use test set which follows the validation set. In hv-block CV, it is suggested to leave few observations between training and validation sets to get an accurate measure of accuracy.

Summary


In this post, we have explored:
  • Train-test split
  • Cross-validation
  • Different approaches of CV
  • Cross-validation for Timeseries data

References


If there are any questions or suggestions feel free to share.