Data is the basis
Data is the Basis of all Knowledge! That’s why we use a data driven workflow. If we want to use a model to predict the price of Bitcoin, Ripple or Bonk or the price of a share of Apple or Nvidia or for that matter for any other financial asset we need data on the price history of this asset. As the saying goes: Past behavior of the best predictor of future behavior. And this is as true for the price behaviour of financial assets as it is for human beings! Given that historical price movements are rarely or never linear, we use intelligent, learning models, which must be trained with at least a sufficient amount of data but preferably as much data as we can possibly get!
Step-by-step
Model training and price prediction is not a one-time event: we have to build a working model in sevral steps and we want to train it for multiple use. Thus we need access to both historical data and more recent data, also for decent forecasting we even need the most recent data. This means we need to use a data driven workflow for cryptocurrency as well as we need this approach for other assets. A massive need for data is particularly true for cryptocurrency: for this we use the following data driven way of working, see the big picture below.

It all starts with the data and the source of this for cryptocurrencies is the worlds biggest exchange: Binance.
Starting the data driven workflow: Dump Historical Data
The first step in our data driven workflow is to get access to the best source of data. Binance sits on a mountain of data, containing the complete price history of almost every cryptocurrency we know today. You download the history for a specific coin as a market pair, for instance BTC-USDT, giving the opening, closing, high and low price for a certain frequency: daily, hourly or minute including the volume and the number trades. We download a specific price history in the form of monthly and daily (for the current month) CSV files and store these in our own archive. We updated this archive in a new run (replacing days for a complete month). This is detailed in a separate post.
For stock data we don’t need to create and manage our own archive: we collect them as needed from Yahoo Finance. This results in a simpler workflow as shown in this post.
Step Two: Get Complete Dataset
Once we have local historical data, as the next step in our data driven workflow, we pull in just over 20 days of the most recent current data (the most recent 500 hourly datapoints cover almost 21 days) to create an up-to-date dataset by adding it to the earlier saved historical data. This process gets detailed in this post.
Step Three: Preprocess the Dataset
In a data driven way of working it is essential that the data we use meets the quality requirements necessary for modelling. For this purpose we use a special module. The Preprocess class, described in this post, checks if the data is available, up-to-date (and if not offers to add the latest price information), indexes it and offers the possibility to strip fields not needed for the intended use. Finally it hands the data as a standardized Pandas Dataframe to the caller module.
Step Four: Hyperparameter Tuning
Now that we have a complete dataset in the desired format, we start training the model. This starts with tuning the hyperparameters of the chosen (LSTM or ARIMA/SARIMAX). For the LSTM model we describe the optimizing with the Keras Tuner in this post. Once the we find the best parameters, we save these for later re-use. We then train the model and save it as well.
Step Five: Training the Model
Now it’s time to finally train the Model with the best parameters and a complete Dataset. The more complex the model and the bigger the Dataset, the more time and resource consuming this process becomes. To keep the workstation responsive during the training of the model, the (Python) code for this kind of applications has to be robust and implemented thread-safe. Luckily there is no need to repeat this step very often. Once we are content with the model, we save it to disk for later use. We describe this process in detail in this post for the LSTM model.
Step Six: Get Current Data
Once we have a trained and saved model we can then use it repeatedly for a certain period, as long as market circumstances do not change significantly, to predict short-term price developments with a limited set of the most recent data. We use a module that quickly downloads the most current data and makes it available in the form of a standardized Pandas Dataframe, such as the model we want to reuse, expects. That is the purpose of the GetCurrent class we describe in this post.
Completing the Data Driven Workflow: Using the Model
With the data for the past 500 hours off a certain asset and a model previously trained and saved on this or a closely related asset we predict the price development of this asset for the coming hours. This completes our data driven workflow. Compared to the original training of the model, we do this forecasting with a limited dataset very quickly, a matter of a few minutes instead of (sometimes several) hours! For the details see this post.
This completes the overview of the data driven way of working with stock ticker data. Every step we describe in detail with fully working code in the posts mentioned above.