Model selection utilities

Utilities that help in model selection e.g. by splitting a dataset.

darts.utils.model_selection.train_test_split(data, test_size=0.25, axis=0, input_size=0, horizon=0, vertical_split_type='simple', lazy=False)[source]

Splits the provided series into training and test series.

Supports splitting along the sample axis or time axis. If the input type is single TimeSeries, then only splitting over time axis is available, thus input_size and horizon have to be provided.

When splitting over the time axis, splitter tries to greedy satisfy the requested test set size, i.e., when one of the timeseries in the sequence is too small, all samples will go to the test set and the exception will be raised.

Time axis split with 'model-aware' split type enabled tries to reclaim as many datapoints for training as possible by partially overlapping test and training set. This is possible because only the forecasted part of the test set cannot be used for training. The formula to calculate the last available timestep for the training set is the following:

train end index = ts_length - horizon - test_size

And the formula to calculate the first timestep of the test dataset is:

test start index = ts_length - horizon - test_size - input_size + 1

Parameters
  • data (Union[TimeSeries, Sequence[TimeSeries]]) – original dataset to split into training and test

  • test_size (Union[float, int, None]) – size of the test set. If the value is between 0 and 1, parameter is treated as a split proportion. Otherwise, it is treated as an absolute number of samples from each timeseries that will be in the test set. [default = 0.25]

  • axis (Optional[int]) – Axis to split the dataset on. When 0 (default), it is split on samples. Otherwise, if axis = 1, timeseries are split along the time axis. Note that for single timeseries the default option is 1 (0 makes no sense). [default: 0 for sequence of timeseries, 1 for timeseries]

  • input_size (Optional[int]) – size of the input. Only valid with vertical_split_type == 'model-aware'. [default: None]

  • horizon (Optional[int]) – forecast horizon. Only valid with vertical_split_type == 'model-aware'. [default: None]

  • vertical_split_type (Optional[str]) – can be either 'simple', where the exact number from test size will be deducted from timeseries for test set and remaining will go to training set; or 'model-aware', where you have to provide input_size and horizon as well. Note, that the second option is more efficient timestep-wise, since training and test sets will partially overlap. [default: 'simple']

  • lazy (bool) – by default, train and test datasets are returned as a list of timeseries. This may be memory inefficient if the dataset is large, so setting this flag instead allows to return a Sequence object loading the data lazily. Warning: turning lazy on disables some sanity checks for the datasets that may result in exceptions during sample generation. [default: False]

Returns

Training and test datasets tuple.

Return type

tuple of two Sequence[TimeSeries], or tuple of two Timeseries