Dropbox Paper

Cross Validation

This is sometimes known as "postdiction" "retrodiction" or "backcasting”

Testing Multiple Date Ranges

A very easy time series cross-validation technique is one that has one train and validation split and not multiple splits. Generally when you use multiple splits you do not remove the validation data as the results is generalisable across the entire dataset, but you do not test on the validation data, you use the validation data in the training dataset to predict on the future test set. The problem with this technique is that you can’t use it to test different time periods, as you would have selected the data based on the validation sets. The best way to do validation without removing a large chunk of sequential data is an easy validation process where you use 20% of the data to train and use a random scattering of 15% of the last 80% data to do the validation and model parameter selection after which you remove it. It allows me to test the performance of the model over different periods, with different training samples and have the added ability to calculate the statistical significance of the tests due to the test folds.

Testing A Single Final Date Range

When you test a single final date range, you can use the normal split for time series. And then simply use the final model to do the testing

Random Cross Validation Inside Loop:

params = {

'boosting_type': 'gbdt',

'objective': 'regression',

'metric': 'rmsle',

'max_depth': 6,

'learning_rate': 0.1,

'verbose': 0,

'early_stopping_round': 20}

n_estimators = 100

n_iters = 5

preds_buf = []

err_buf = []

for i in range(n_iters):

x_train, x_valid, y_train, y_valid = train_test_split(X, y, test_size=0.10, random_state=i)

d_train = lgb.Dataset(x_train, label=y_train)

d_valid = lgb.Dataset(x_valid, label=y_valid)

watchlist = [d_valid]

model = lgb.train(params, d_train, n_estimators, watchlist, verbose_eval=1)

preds = model.predict(x_valid)

preds = np.exp(preds)

preds[preds < 0] = median_trip_duration

err = rmsle(np.exp(y_valid), preds)

err_buf.append(err)

print('RMSLE = ' + str(err))

preds = model.predict(X_test)

preds = np.exp(preds)

preds[preds < 0] = median_trip_duration

preds_buf.append(preds)

print('Mean RMSLE = ' + str(np.mean(err_buf)) + ' +/- ' + str(np.std(err_buf)))

# Average predictions

preds = np.mean(preds_buf, axis=0)