How to Create Training and Test Data Sets for your Stock Market Machine Learning Pipeline
2017-12-12 - by Wilton - 964 views
One of the key components of a successful machine learning system is the Training File. The training file is a specially formatted file that is used as input into a machine learning algorithm. The machine learning algorithm uses the data in the file to "learn" how to predict success from the data. However, in order to work properly, there are some specific things you must understand.
First of all, the most common machine learning algorithms are trained to predict "success." Success, of course, is whatever we define it. With stock market algos, success generally means that the stock price rose after a period of time.
This takes us to our next concept. When running in production, the machine learning algorithm has no ability to see the future. So, when you create your training file, you must ensure that the features you provide are lagged. For stock market algos, you want to lag the features by your expected investment horizon.
The "features" are the various items that you want your algo to examine when making an investment decision. These features are typically financial metrics such as price, volume, or book value / share.
The standard format for a training file is a deliminted text file with a series of columns. The first column is the "label" column that contains a "1" to indiciate succeess or a "0" to indicate failure. The rest of the columns contain the lagged features.