[03] Settings, settings everywhere!
Share
Yes, there are a lot of settings but the good news is that they can be saved and used again.
We have personally applied CPD to a wide variety of time series across a range of industries. The methodology has evolved over time to cater for most of the patterns we have encountered, hence the large number of settings now available. It has reached the stage where edge cases are few and far between. You will find the algorithm to be forgiving and great results can be obtained even when the settings are not strictly optimal.
Quick Tip It is entirely possible that you can run CPD across your data simply by (i) dragging a source file onto the Input Data tab, and (ii) ensuring the correct periodicity is selected on the Time Series tab. Beyond that the default settings may well suffice. |
In this post we will provide guidance on using the best settings for your data, starting with the Input Data tab.
The Source Data box is waiting for you to drag and drop either a CSV file or an Excel file. In either case the file must contain three labelled columns:
- Series - the name of the time series
- Period - the date of the observation (typically the start of the period if not daily) or a numeric sequence if your data is ordered but does not have a date
- Quantity - a numeric value
Your source data can contain other columns but the above three columns must be included amongst them.
If your data is dated then the Period needs to be formatted as a recognisable date in Excel if that is the source, or either as dd/mm/yyyy or mm/dd/yyyy if it is a CSV file. The below extract represents an acceptable CSV format but in this instance the CPD Wizard will get agitated about whether this is daily data in the European style or monthly data in the US style.
You can assist by ensuring the correct Date Style is selected on the Input Data tab along with the correct Periodicity on the Time Series tab. The CPD Wizard can currently handle daily, weekly, monthly and quarterly data. If your data has a different periodicity (e.g. hourly) then you can apply a number sequence instead.
Multiple series' can be included in long format in the one CSV file or on the one Excel tab but they must have the same periodicity throughout (i.e. do not include weekly and monthly data in the same table). You can however include different periodicities on different Excel tabs.
After dragging an Excel file into the Source Data box the input options will automatically extend for you to choose the required Excel Tab from the drop-down menu.
If your data is dated and represents monetary amounts then there is an option to adjust your time series for inflation and use constant prices. We have provided a spreadsheet containing CPI (consumer price index) data for a limited number of countries and we will add more if users request it.
The final setting on the Input Data tab is the selection of the Currency to be displayed on your CPD plots when appropriate. You can either choose one of the options on offer or you can enter your own currency symbol.
And on to the Time Series tab.
The correct Periodicity of your data needs to be specified. The options are currently Days, Weeks, Months, Quarters and Undeclared, with the latter being used when your data is undated or fits another periodicity (in which case a numeric sequence is required).
If there are records missing from your data then there is an option to Fill Missing Values with the Latest Value observed or a Fixed Value (typically zero). This option only works for dated series and only when an entire record is missing (i.e. it does not fill NA or null values).
When your data is dated and represents monetary amounts from specific countries then there is an option to adjust your time series for inflation. Inflation Adjustment should only be considered when the series is sufficiently long enough to warrant an adjustment (typically more than two years).
A Display Units setting is available should you want to reduce the scale of your raw data. You can either divide the Quantity by a thousand (K) or by a million (M) and you can also include a currency unit (if specified on the Input Data tab) in the CPD plot labels.
The remaining three Exclusion Levels settings are used to exclude from consideration those series with only a few or low observations or those that have not had any recent observations.
The Change Point tab provides settings used by the CPD algorithm.
The first setting specifies the number of series to be included in the CPD Execution. CPD can either be run against all the series in your source data or against a random sample. Specifying a Sample Size is a good option when you are exploring your data and you have not yet settled on your desired options. When you are ready to productionise the process you can select All Series and run CPD across hundreds or even thousands of time series.
The Number of Shuffles specifies the number of boot-strapping iterations used to determine the probability that the identified change is significant at the required Confidence Level. If a very large number of series are being considered then it may be necessary to reduce the number of shuffles below the default value of 1,000.
CPD Iterations specifies the number of iterations to be used when identifying multiple change points in a series. On the first iteration the full series is tested while subsequent iterations successively split the series at the point of significant changes in order to detect further changes.
The remaining settings on this tab have been described in a separate meaningful and sustained post.
And finally, the Output Data tab specifies what outputs are produced along with how the settings are to be stored and used.
Viewing the CPD plots is a valuable exercise when exploring your data and establishing the appropriate settings. You can either view All Series or a random sample of Sample Size if you have a lot of data. When performing production runs you may choose to view None to speed up execution.
As for the CPD output you can write to a CSV file with a name of your choosing (there is no need to include the .csv extension) and/or you can Display summary results on the screen.
A very useful feature is the ability to save all your settings to a Settings File. Once you have saved your settings using the Save button you can load them by clicking the Load button at any time or you can elect to automatically load them upon program start-up. There is also an option to automatically run the program upon start-up using your saved settings. This is a handy option when performing production runs; you may even be able to set up a calling routine to automatically call the CPD executable.