Data Driven Workflow for Stock

Data is the basis

Data is the Basis of all Knowledge! That’s why we use a data driven workflow for stock tickers also. If we want to use a model to predict the price of Bitcoin, Ripple or Bonk or the price of a share of Apple or Nvidia or for that matter for any other financial asset we need data on the price history of this asset. As the saying goes: Past behavior of the best predictor of future behavior. And this is as true for the price behaviour of financial assets as it is for human beings! Given that historical price movements are rarely or never linear, we use intelligent, learning models, which we must train with at least a sufficient amount of data but preferably as much data as we can possibly get!

Step-by-step

Model training and price prediction is not a one-time event: we have to build a working model in sevral steps and we want to train it for multiple use. Thus we need access to both historical data and more recent data, also for decent forecasting we even need the most recent data. This means we need to use a data driven model for stock as well as we need this approach for other assets. There is a difference however in data handling for crytocurrencies and the use of data in training models for use with stock tickers. For cryptocurrencies the best way to start is to create and manage your own collection of historical data, thus we use a slightly different workflow, as we explain in this post.

Streamlined workflow

With stock the data driven way of working is bit more streamlined. For the big picture take a look at the scheme drawn below. We describe the 6 steps in detail underneath it.

A Streamlined Workflow

Things all starts with the data and the source of this for for stock ticker data it is Yahoo Finance.

First Step: Get Stock Data

The first step in our data driven way of working is to get access to the best source of stock ticker data. In a separate post we explain in detail how to best download the historical data you need from Yahoo Finance and store it locally, not in a separate archive but in a default formatted comma separated (.csv) file for immediate or later use.

Second Step: Preprocess the Data

In a data driven way of working it is essential that the data we use meets the quality requirements necessary for modelling. For this purpose we use a special module. The Preprocess class, described in this post, checks if enough data is available, whether it is up-to-date (and if not offers to add the latest price information), indexes it and offers the possibility to strip the fields not needed for the intended use. Finally it hands the data as a standardized Pandas Dataframe to the caller module.

Third Step: Hyperparameter Tuning

Now that we have a complete dataset in the needed format, we are ready to start training the model. This begins with the preparatory step of tuning the hyperparameters of the chosen (LSTM or ARIMA/SARIMAX) model. We descibe the optimnizing process with the Keras Tuner for the LSTM familly of models in this post. Once the best parameters are found, we save them for later re-use. After training the model we save it as well.

Fourth Step: Training the Model

Now it’s time to finally train the Model with the best parameters and a complete Dataset. The more complex the model and the bigger the Dataset, the more time and resource consuming this process becomes. To keep the workstation responsive during the training of the model, the (Python) code for this kind of applications has to be robust and implemented thread-safe. Luckily there is no need to repeat this step very often. Once we are content with the model, we save it to disk for later use. We describe this process in detail in this post for the LSTM model. For the ARIMA – SARIMAX familly of models we describe it in this post.

Fifth Step: Get Current Data for our Data driven Workflow

Once we have a model trained and saved with the best parameters we are ready to use it repeatedly for over a certain period of time (for as long as the market circumstances do not change significantly) to predict short-term price developments. For this purpose we only need a limited set of the most recent data. Thus we make use of a special module that quickly downloads the most current 500 hours of data and makes these available in the form of a standardized Pandas Dataframe, such as the model we want to reuse, expects. That is the purpose of the GetCurrent class, we describe in this post.

Sixth Step: Completing the Data Driven Workflow by Using the Model

Equipped with the price data of our stock ticker for the past 500 hours and a model previously trained and saved on this or a closely related asset, we can now easily predict the price development of this asset for the coming hours. This completes our data driven workflow as it is intended! Compared to the original training of the model, this forecasting with a limited dataset can be done very quickly, a matter of a few minutes instead of (sometimes several) hours! For the details see this post.

This completes the overview of the data driven way of working with stock ticker data. Every step is described in detail with fully working code in the posts mentioned above.

Related Stories