ARIMA – SARIMAX models: Optimizing for Training

ARIMA Family

Time series forecasting with ARIMA – SARIMAX. These models provide powerful tools for predicting future price values based on historical data patterns. Among the collection of statistical techniques, ARIMA (Autoregressive Integrated Moving Average) stands out as a fundamental model for capturing temporal dependencies and trends in time series data.

ARIMA and SARIMAX (Seasonal ARIMA with eXogenous variables) both belong to the statistical modeling family. These are built on the following assumptions.

Stationarity is required (or achieved via differencing). Relationships are linear and parametric (i.e., specified via lags by which older values are used to explain the newer). SARIMAX can incorporate seasonality and external regressors (e.g., volume, technical and sentiment indicators as additional predictors).

The models have the following characteristics, that explain their names: Auto Regressive, past values influence the current value. Integrated differencing to make data stationary. Moving Average, past forecast errors/residuals are used to improve the next forecast. With SARIMAX adding the Seasonal components and eXogenous variables.

The specific strengths of these models are transparency & interpretability. They excel in short-term forecasting when patterns are stable and work well when linear trends, seasonality, or cyclic behavior dominate.

 ARIMA – SARIMAX models and Differencing

An ARIMA model is denoted as: ARIMA(p, d, q) Where:

  • p = number of AR terms (lags of the series)
  • d = number of differencing needed to make the series stationary
  • q = number of MA terms (lags of the forecast errors)

A stationary time series has a constant mean, constant variance, no seasonality and an auto covariance that doesn’t change over time. Most financial time series however (stock prices, crypto) are not stationary: they have trends, they exhibit volatility clustering and mean and variance may shift over time.

Differencing (d) removes trends and stabilizes the mean of the time series. First-order differencing removes linear trend, second-order differencing (rarely needed) removes quadratic trend. We determine the need for differencing using the ‘ Augmented Dickey-Fuller (or ADF) Test ‘ on our time series data. This test, among other information, delivers a p-value: If this p-value > 0.05 → our data is not stationary → try differencing.

Differencing is a way to reveal the essence of temporal dependencies by stripping away the non-informative trends. It’s a subtle, mechanical form of data regularization — not to eliminate information, but to better isolate predictable dynamics from the backdrop of market drift or external noise.

After differencing the time series data we determine ARIMA(p,d,q) using grid search. Grid search is a way of  Hyperparameter Optimization, model optimization before training the model as we use also on LSTM models.

Grid Search is essential for identifying the best-performing configuration before final model training. You define a grid (a set of combinations) of possible hyperparameter values. The algorithm trains and evaluates a model on each combination. It selects the configuration that yields the best performance (based on a scoring metric like AIC, BIC, MSE, etc.).

For ARIMA, the parameters are:

  • p: lag order (AR part)
  • d: degree of differencing (I part)
  • q: order of moving average (MA part)

A basic grid might look like:

p = [0, 1, 2]

d = [0, 1]

q = [0, 1, 2]

Total combinations = 3 × 2 × 3 = 18 models to evaluate.

The Grid Search Algorithm works as follows:  

  • Define the search space (e.g., parameter ranges).
  • Loop through every combination.
  • Train and evaluate the model for each.
  • Store and compare performance metrics.
  • Select the best combination.

Fitting the ARIMA – SARIMAX Models & Making the Forecast

Once you’ve prepared the data through differencing and identified the optimal ARIMA(p,d,q) parameters via Grid Search, you’re ready for the two most critical steps in time series modeling: fitting the ARIMA model and using it to forecast future values. The model doesn’t just extrapolate a trend — it uses the learned structure of past data (autocorrelation and error propagation) to project forward. Note: If d > 0, the predictions are in differenced units, and must be converted (reversed differenced) back to the original scale.

SARIMAX

In our example provided here, the TrainSarimax class, we repeat these steps starting with the Grid search to see if we can improve on the ARIMA result using a SARIMAX model. This makes sense especially if the time series exhibits seasonality and we have access to exogenous variables (like technical and sentiment features).

SARIMAX has two sets of parameters:

Non-seasonal:

  • p (AR terms)
  • d (Differencing)
  • q (MA terms)

Seasonal:

  • P, D, Q, s (seasonal AR, I, MA, and period)

So the model is:

SARIMAX(p, d, q)(P, D, Q, s)

For the Grid Search we thus need a longer and more structured search.

Explanatory variables like volume, sentiment, or technical indicators should be aligned and preprocessed (normalized, no missing values).

 model = SARIMAX(y,exog=X,order=(p,d,q),seasonal_order=(P,D,Q,s))

Where ARIMA encodes memory via lags, SARIMAX extends it with rhythmicity (seasonal cycles) and awareness (external context). It’s still explainable, still fast, and often surprisingly effective in structured financial environments (like intraday crypto cycles).

The TrainSarimax Class for ARIMA – SARIMAX models

This Python class is used with the PyQT6 GUI framework and because model training can be resource intensive and time consuming it is run inside a seprate worker thread communicating with the Main Window using signals. Of course it can be used stand alone just as well.

Importing Libraries for the Statistical Models

To handle the time series data as always we we use pandas and numpy. Plotly is employed so we can plot in the browser. Several utilities are used for file I/O, datetime manipulations etc. Most important are the scipy, sklearn and statsmodel libraries that provide the core functions needed for the ARIMA – SARIMAX models.

Two sperate modules are loaded to perform specific tasks: get_setiment_data, that loads 3 different indices indicating the general market sentiment, and make_features, that assembles a vector of explanatory factors to be used in predicting the target values.

# Copyright (c) 2025 Hans De Weme
# Licensed under the MIT License (https://opensource.org/licenses/MIT).
# Class TrainSarimax
# Purpose: finding the best hyperparameters for a (S)ARIMA(X) time series models and compare predictions 
# of previous collected and preprocessed historical data 
"""
Imports necessary libraries such as NumPy, Pandas, ARIMA
Loads preprocessed data set
Optionally prepares additional predictors / features 
Divides the data into training and testing sets 
Tests the data for stationarity and if needed performs differencing
Performs a Grid Search for the best ARIMA paramaters 
Fits the ARIMA model to make future price predictions and plots these
Performs a Grid Search for the best SARIMAX paramaters
Fits the SARIMAX model to make future price predictions and plots these compared to the ARIMA predictions
"""
import numpy as np
import pandas as pd
import plotly.graph_objs as go
import plotly.io as pio
from   plotly.subplots        import make_subplots
import plotly.express as px
from   pathlib                import Path
from   itertools              import product
from   datetime               import datetime
from   pandas.tseries.offsets import DateOffset
import joblib
import scipy.stats as stats
from   sklearn.metrics        import mean_absolute_error, mean_squared_error, r2_score
import statsmodels.api as sm
from   statsmodels.tsa.stattools          import adfuller
from   statsmodels.tsa.statespace.sarimax import SARIMAX
from   PyQt6.QtCore           import QThread, pyqtSignal, pyqtSlot
from   make_features          import MakeFeatures
from   get_sentiment_data     import getSentiment
import warnings
warnings.filterwarnings('ignore')
Python

Initiating The Class

After defining the signals for communicating progress to the Main Window the TrainSarimax class initializes global settings to be used truout the class.

After defining the signals for communicating progress to the Main Window the TrainSarimax class initializes global settings to be used throughout the class, like a reference to the calling ‘parent’ , flags to indicate whether or not to save the final model to disk and whether or not features including external indicators should be engineered. Other settings are the same we use which the LSTM model classes, including  the name of the financial asset which price we want to predict, the target to predict (the closing price per data-point), the number of data points and the relative sizes of the training and test data sets.

Things start to get interesting when we call in the getSentiment module to collect the Fear & Greed Index, the S&P500 Index and the CBOE Volatility Index. The MakeFeatures module is then called to add these indices to the internal momentum indicators of the time series data like the Relative Strength Index RSI, the  Stochastic Oscillator SO, the Rate of Change RoC, the Williams indicator, several moving averages indicators, Bollinger Bands BB, the Kelter Channel and Schaff Trend Cycle STC indicator. A correlation matrix then is plotted to show the relative strength of each feature as a predictor of the target: the closing price of the asset (per hour).

For further processing 2 dataframes with identical length and the same datetime index are prepared: one with the features engineered, the other with the target to be predicted. The time series data is then split into training and testing data sets.

class TrainSarimax(QThread):       
    progress_signal     = pyqtSignal(str)                               # Signal to communicate progress (string message) back to the main thread        
    request_save_signal = pyqtSignal(str)                               # Signal to request saving the model
    response_signal     = pyqtSignal(bool)                              # Signal to receive the user's response        
    
    def __init__(self, asset, data, set, strip, parent=None): 
        super().__init__()                                              # necessary for QObject, needed for pyqtSignal  
        self.parent = parent
        self.save_model_flag = None                                     # Variable to store the user's response (Yes/No)
        self.suc = False
        self.settings = set
        self.df  = pd.DataFrame(data)
        self.exo = None
        if self.df.empty:
            print('* * * Time Series Data missing  * * * ')
            return self.suc
        self.MARKT  = asset                                            # USDT spot markt coin-pair or stock ticker to process 
        self.EXO = False                                               # If strip == False, columns needed for feature enegeneering are included, 
        if strip == False:                                             # so we need to prepare the features like RSI, SO, STC 
            self.EXO = True                                            # with SARIMAX in this case we include Exogenous sentiment factors as well! 
        self.N_INPUT       = 12                                        # number of new datapoints to predict
        self.N_TARGETS     = 1                                         # only 1 target to predict: future (close) price 
        self.TRAIN_SPLIT   = 0.2                                       # size of test data set apart from train data
        self.DIFF          = False                                     # differencing needed?  initially set to not

        # load extra sentiment data to use as exogenous feutures next to internal explanatory variables like RSI
        if self.EXO == True:
            self.progress_signal.emit('Getting Fear & Greed Sentiment Data')
            self.getdata= getSentiment()
            self.fag = True
            fg = self.getdata.get_fagi_data()
            if fg is None or fg.empty:
                self.progress_signal.emit('Failed to get Fear & Greed Sentiment Data')
                self.fag = False
            else:
                fg = fg.reindex(self.df.index, method='ffill')              # add fear & greed index to features
                self.df = self.df.join(fg)      
                self.df.drop(columns=['label'], axis = 'columns', inplace = True) 
            self.progress_signal.emit('Getting Stock Index Sentiment Data') 
            index = '^GSPC'                                                 # '^GSPC' S&P 500 index 
            colnm = 'gspc'
            self.gspc = True
            sp = self.getdata.get_indices_data(index, colnm)                # get S&P500 index
            if sp is None or sp.empty:
                self.progress_signal.emit('Failed to get S&P500 Sentiment Data')
                self.gspc = False
            else:           
                sp = sp.reindex(self.df.index, method='ffill')              # add GSPC to features
                self.df = self.df.join(sp, how='left')
                self.df['gspc'] = self.df['gspc'].ffill()
            index = '^VIX'                                                  # '^VIX'  CBOE Volatility Index 
            colnm = 'vix'
            self.vix = True
            sp = self.getdata.get_indices_data(index, colnm)                # get volatility index 
            if sp is None or sp.empty:
                self.progress_signal.emit('Failed to get CBOE Volatility Index Sentiment Data')
                self.vix = False
            else:             
                sp = sp.reindex(self.df.index, method='ffill')              # add VIX to features                                
                self.df = self.df.join(sp, how='left')
                self.df['vix'] = self.df['vix'].ffill()      
            if self.fag == False and self.gspc == False and self.vix == False:
                self.EXO = False
                self.progress_signal.emit('No Additional Features Available for Exogenous Factor with SARIMAX')
            else:
                print(self.df)
                self.progress_signal.emit('Creating Features for Exogenous Factor with SARIMAX')    
                mf = MakeFeatures(self.df)
                dfplus = mf.do_make_features()
                corr_matrix = dfplus.corr()['close'].drop('close')
                fig = px.bar(x=corr_matrix.index, y=corr_matrix.values, labels={'x': 'Features', 'y': 'Correlations'}, title='corr', color=corr_matrix.values, color_continuous_scale='Viridis')
                pio.show(fig)
                min_length = min(len(self.df), len(dfplus))                 # Ensure both arrays have the same length
                self.df  = self.df[:min_length]
                dfplus   = dfplus[:min_length]                                
                self.df  = dfplus[['close']]
                self.exo = dfplus
                print('Raw Testet with Features:')
                print(self.exo.tail())   
        
        # Split data into train and test
        self.train_size = int(len(self.df) * (1-self.TRAIN_SPLIT))
        self.train, self.test = self.df[0:self.train_size], self.df[self.train_size:len(self.df)]
        if self.EXO == True:
            self.xtrain, self.xtest = self.exo[0:self.train_size], self.exo[self.train_size:len(self.df)]
            self.xtrain_size = int(len(self.exo) * (1-self.TRAIN_SPLIT))
        self.endp   = self.test.tail(1).index.item()
        self.startp = self.test.head(1).index.item()
        print('* * * enddt   testset:  '+str(self.endp))
        print('* * * startdt testset:  '+str(self.startp))
        print('* * * length  testset:  '+str(len(self.test)))
        print('* * * length  trainset: '+str(len(self.train)))
        # save raw data 
        self.train_r = self.train
        self.test_r  = self.test
        self.suc = True
Python

Running the TrainSarimax Class for ARIMA – SARIMAX models

Testing & Differencing

After Initializing the class the Main Window calls the run() method that orchestrates this worker thread. This one is pretty straightforward: first the (Partial) Autocorrelation Function test is done and plotted. Then the ADF test is done and if necessary the data is differentiated and positioning of the test- and training sets is adjusted. From the AFC and PAFC tests suggestions are taken for parameters to be determined in the grid search.

AFC-plots
    # main method orchstrating the processing
    def run(self):
        self.do_ADF()
        self.do_GridSearch(False)
        self.do_prediction(model_label="ARIMA", prompt_to_save=True)
        self.do_GridSearch(True)
        self.do_prediction(model_label="SARIMAX", prompt_to_save=True)
        self.plot_prediction_comparison()
        self.do_forecast()        
        
    # sync time series dataframes on the index earliest shared datetime point    
    def sync_dataframes_by_start_date(self, df1, df2):
        latest_start_date = max(df1.index.min(), df2.index.min())       # Find the latest start date between the two dataframes
        df1_filtered = df1[df1.index >= latest_start_date]              # Filter both DataFrames based on the latest start date
        df2_filtered = df2[df2.index >= latest_start_date]
        return df1_filtered, df2_filtered

    # display a message in the GUI and print it on the terminal 
    def do_message(self, the_message):                                  
        self.progress_signal.emit(the_message)
        print(the_message)

    #create timestamp as string
    def time_stamp(self):
        now  = datetime.now() 
        d = now.strftime("%d")
        m = now.strftime("%m")
        j = now.strftime("%Y")
        h = now.strftime("%H")
        n = now.strftime("%M")
        nu = j+m+d+h+n
        return nu

    # Function to test stationarity
    # A stationary time series has a constant mean, variance, and autocovariance over time
    # Augmented Dickey-Fuller (ADF) test, tests if a time series has a unit root, if so: it is not stationary (differencing necessary before ARIMA!)
    # if p-value is low (typically less than 0.05), the time series is stationary
    def test_stationarity(self, timeseries):
        dftest = adfuller(timeseries, autolag='AIC')
        return dftest[1]  # p-value

    def difference(self, dataset, interval=1):                          # 1 for first-order differencing
        diff = list()
        for i in range(interval, len(dataset)):
            value = dataset[i] - dataset[i - interval]
            diff.append(value)
        return diff

    # Reconstruct original values from differenced forecast output.
    def reverse_differencing(self, forecast, base_value, target_index):
        """
        Parameters:
        - forecast: array-like of differenced predicted values
        - base_value: the last actual known value before forecast starts
        - target_index: index (DatetimeIndex) to align the result to
        Returns:
        - pd.Series of inverted forecast values with target_index
        """
        reconstructed = []
        current_value = base_value
        for yhat in forecast:
            current_value += yhat
            reconstructed.append(current_value)
        # Trim if necessary to match target index length
        if len(reconstructed) > len(target_index):
            print("⚠ Forecast longer than index — trimming forecast.")
            reconstructed = reconstructed[:len(target_index)]
        elif len(reconstructed) < len(target_index):
            print("⚠ Forecast shorter than index — truncating index.")
            target_index = target_index[:len(reconstructed)]
        return pd.Series(reconstructed, index=target_index)

    def do_ADF(self):
        from statsmodels.tsa.stattools import acf, pacf
        print("* * * Running ADF test and auto-suggesting ARIMA parameters * * *")
        self.DIFF = False
        self.d = 0
        # ACF & PACF test & plots before differencing
        lag_acf  = sm.tsa.acf(self.df['close'],  nlags=20)
        lag_pacf = sm.tsa.pacf(self.df['close'], nlags=20)
        fig = make_subplots(rows=1, cols=2, subplot_titles=('Autocorrelation Function', 'Partial Autocorrelation Function'))
        fig.add_trace(go.Bar(x=np.arange(len(lag_acf)), y=lag_acf), row=1, col=1)
        fig.add_trace(go.Bar(x=np.arange(len(lag_pacf)), y=lag_pacf), row=1, col=2)
        fig.update_layout(height=600, width=1200, title_text="ACF and PACF Plots (Before Differencing)")
        fig.update_xaxes(title_text="Lag", row=1, col=1)
        fig.update_xaxes(title_text="Lag", row=1, col=2)
        fig.update_yaxes(title_text="Autocorrelation", row=1, col=1)
        fig.update_yaxes(title_text="Partial Autocorrelation", row=1, col=2)
        pio.show(fig)

        # ADF stationarity test
        p_value = self.test_stationarity(self.df['close'])
        print(f"* * * Augmented Dickey-Fuller (ADF) p-value: {p_value:.4f}")
        if p_value < 0.05:
            print("* * * Series is stationary → no differencing needed")
            self.DIFF = False
            self.d = 0
        else:
            print("* * * Series is non-stationary → differencing applied (d=1)")
            self.do_message("Differencing not stationary Dataset, ADF value: "+str(p_value))    
            self.d = 1
            self.DIFF = True
            self.df['close'] = self.df['close'].diff().fillna(0)
            if self.EXO:
                print("Testset last rows before and after differencing:")
                print(self.exo.tail())
                if self.gspc: self.exo['gspc'] = self.exo['gspc'].diff().fillna(0)
                if self.fag:  self.exo['fag']  = self.exo['fag'].diff().fillna(0)
                if self.vix:  self.exo['vix']  = self.exo['vix'].diff().fillna(0)
                print(self.exo.tail())

        # Positioning and index logic
        self.train_size = int(len(self.df) * (1 - self.TRAIN_SPLIT))
        self.train, self.test = self.df[0:self.train_size], self.df[self.train_size:]
        if self.EXO:
            self.xtrain, self.xtest = self.exo[0:self.train_size], self.exo[self.train_size:]
            self.xtrain_size = len(self.xtrain)

        self.endp   = self.test.tail(1).index.item()
        self.startp = self.test.head(1).index.item()
        print(f"* * * Start index: {self.startp}, End index: {self.endp}")
        print(f"Type of index: {type(self.df.index)}")

        # Needed for differencing offset in prediction
        if self.DIFF:
            start_pos = max(0, self.train_size - 1)
            end_pos   = len(self.df) - 1
            self.startp = start_pos
            self.endp   = end_pos
            print(f"Updated for differencing: start = {self.startp}, end = {self.endp}")

        # Now suggest p and q based on differenced (stationary) series
        stationary_series = self.df['close']
        pacf_vals = pacf(stationary_series, nlags=20)
        acf_vals  = acf(stationary_series, nlags=20)
        self.p_range = list(range(0, np.where(pacf_vals[1:] < 0.2)[0][0] + 2)) if any(pacf_vals[1:] < 0.2) else [0, 1]
        self.q_range = list(range(0, np.where(acf_vals[1:] < 0.2)[0][0] + 2)) if any(acf_vals[1:] < 0.2) else [0, 1]

        # Check for seasonality
        self.s = None
        for s_candidate in [4, 6, 12, 24]:
            if s_candidate < len(acf_vals) and abs(acf_vals[s_candidate]) > 0.2:
                self.s = s_candidate
                print(f"📈 Detected seasonal lag ≈ {self.s} (ACF = {acf_vals[s_candidate]:.2f})")
                break

        print(f"Suggested p range: {self.p_range}")
        print(f"Suggested q range: {self.q_range}")
        print(f"Using d = {self.d}")
        if self.s:
            print(f"Using seasonal period s = {self.s}")
        else:
            print("No strong seasonality detected.")
Python

Grid Search for the Model Parameters

Using the results of the tests the parameters for the ARIMA and the SARIMAX models are tried out before using these models.

    def do_GridSearch(self, seasonal: bool = False):
        import gc

        model_type = "SARIMAX" if seasonal and getattr(self, 's', 0) else "ARIMA"
        print(f"* * * GRID SEARCH {model_type} PARAMETER TUNING * * *")
        self.do_message(f"Grid search for {model_type} parameters based on ADF/ACF")
        # Fallbacks
        ps = getattr(self, 'p_range', [1, 2])
        qs = getattr(self, 'q_range', [1, 2])
        d =  getattr(self, 'd', 1)
        s =  getattr(self, 's', 0) or 0  # Ensures s is an int, not None
        D = 1 if s > 0 else 0
        Ps = range(0, 2) if s > 0 else [0]
        Qs = range(0, 2) if s > 0 else [0]

        if seasonal and s == 0:
            print("⚠ No significant seasonality detected — switching to non-seasonal ARIMA mode")
            seasonal = False
        if seasonal:
            param_grid = product(ps, qs, Ps, Qs)
        else:
            param_grid = product(ps, qs)

        results = []
        best_aic = float("inf")
        i = 0
        
        if self.EXO:
            # Define which exo features are active
            self.exo_cols = [
                'BB_mid',
                'kc_middle',
                'fag'  if self.fag  else None,
                'gspc' if self.gspc else None,
                'vix'  if self.vix  else None
            ]
            self.exo_cols = [col for col in self.exo_cols if col is not None]
            # Align close prices with EXOG features on time index
            self.aligned_data = self.df.join(self.exo[self.exo_cols], how='inner')
            # Sanity info
            print("EXOG alignment:")
            print("df      index range:", self.df.index.min(), "to", self.df.index.max())
            print("exo     index range:", self.exo.index.min(), "to", self.exo.index.max())
            print("aligned index range:", self.aligned_data.index.min(), "to", self.aligned_data.index.max())
            print("Lengths — df:", len(self.df), "exo:", len(self.exo), "aligned:", len(self.aligned_data))
            # Needed for forecasting later
            self.startp = self.xtrain_size - 1
            self.endp   = len(self.aligned_data) - 1
        
        for param in param_grid:
            i += 1
            order = (param[0], d, param[1])
            seasonal_order = (param[2], D, param[3], s) if seasonal else (0, 0, 0, 0)
            print(f'* * * Grid-search #{i} — order={order}, seasonal_order={seasonal_order}')
            self.do_message(f'* * * Grid-search round #{i}')
            model = None
            try:
                if self.EXO:
                    model = SARIMAX(self.aligned_data['close'], exog=self.aligned_data[self.exo_cols], order=order, seasonal_order=seasonal_order).fit(disp=False)
                else:
                    model = SARIMAX(self.df['close'], order=order, seasonal_order=seasonal_order).fit(disp=False)
                aic = model.aic
                results.append([order + seasonal_order if seasonal else order, aic])
                if aic < best_aic:
                    self.best_model = model
                    best_aic = aic
                    best_param = param
                else:
                    del model
            except (ValueError, np.linalg.LinAlgError, MemoryError) as e:
                print(f"⚠️ Skipping bad combo {param}: {str(e)}")
                continue
            finally:
                gc.collect()

        result_table = pd.DataFrame(results, columns=['parameters', 'aic'])
        print(result_table.sort_values(by='aic', ascending=True).head())
        print(f"Best {model_type} params: {best_param}")
        print(self.best_model.summary())
        print("ADF test on residuals: p = %f" % adfuller(self.best_model.resid[13:])[1])
        self.residuals = self.best_model.resid
        self._plot_diagnostics()

    def _plot_diagnostics(self):
        fig = make_subplots(rows=2, cols=2, subplot_titles=('Standardized Residuals', 'Histogram + Density', 'Normal Q-Q', 'Correlogram'))
        fig.add_trace(go.Scatter(y=self.residuals, mode='lines'), row=1, col=1)
        fig.add_trace(go.Histogram(x=self.residuals, nbinsx=30, name='Histogram'), row=1, col=2)
        fig.add_trace(go.Scatter(
            x=np.linspace(min(self.residuals), max(self.residuals), 100),
            y=stats.gaussian_kde(self.residuals)(np.linspace(min(self.residuals), max(self.residuals), 100)),
            name='Density', mode='lines'), row=1, col=2)
        qq = stats.probplot(self.residuals, dist="norm")
        fig.add_trace(go.Scatter(x=qq[0][0], y=qq[1], mode='markers', name='Q-Q Plot'), row=2, col=1)
        fig.add_trace(go.Scatter(x=qq[0][0], y=qq[0][0], mode='lines', name='Reference Line'), row=2, col=1)
        acf_vals = sm.tsa.acf(self.residuals, nlags=40)
        fig.add_trace(go.Bar(x=np.arange(len(acf_vals)), y=acf_vals), row=2, col=2)
        fig.update_layout(height=1200, width=1500, title_text="Model Diagnostics")
        pio.show(fig)
Python

Predicting

With the established parameters both models after training predict the test period and are compared with the actual test set. The user is asked if the models should be saved. Finally a forecast beyond the test period is made.

For the SARIMAX model, next to the internal features, like trade volume and momentum indicators such as RSI, also external sentiment indicators, such as the ‘Fear & Greed Inex’ or the S&P Volatility Index, are gathered and incorporated. With all the features at hand a correlation matrix with the target, the closing price per datapoint, is made. This information then is used to combine promising predictors into a features vector to be used as the exogenous factor in the SARIMAX prediction.

correlation matrix
    # Generalized prediction function for ARIMA or SARIMAX models. Stores results in self.predictions_arima or self.predictions_sarimax.
    def do_prediction(self, model_label="ARIMA", prompt_to_save=True):
        self.do_message(f"* * * Running prediction for {model_label} model * * *")
        if not hasattr(self, 'best_model'):
            raise ValueError("No fitted model found. Please run grid search first.")

        test_index = self.test.index
        # Generate forecast
        if self.EXO:
            forecast = self.best_model.predict(start=self.startp, end=self.endp, exog=self.xtest[self.exo_cols])
        else:
            forecast = self.best_model.predict(start=self.startp, end=self.endp)

        # Inverse differencing
        if self.DIFF:
            self.do_message("Reversing differencing for forecast...")
            last_value = self.train_r['close'].iloc[-1]
            forecast = self.reverse_differencing(forecast, base_value=last_value, target_index=test_index)
            actual_series = self.test_r['close']
        else:
            forecast = pd.Series(forecast, index=test_index)
            actual_series = self.test['close']

        # Create predictions dataframe
        predictions = pd.DataFrame({'dt': test_index, 'actual': self.test['close'].values, 'predicted': forecast.values})

        # Store to appropriate attribute
        if model_label.upper() == "ARIMA":
            self.predictions_arima = predictions
        elif model_label.upper() == "SARIMAX":
            self.predictions_sarimax = predictions

        # Compute and print metrics
        mae = mean_absolute_error(predictions['actual'], predictions['predicted'])
        rmse = np.sqrt(mean_squared_error(predictions['actual'], predictions['predicted']))
        r2 = r2_score(predictions['actual'], predictions['predicted'])
        print(f"{model_label} Prediction Metrics:")
        print(f"MAE  = {mae:.4f}")
        print(f"RMSE = {rmse:.4f}")
        print(f"R2   = {r2:.4f}")

        # Plotly visualization of train + test + prediction
        trains = self.train_r.tail(len(self.test))                           # shortening for clearer plot
        fig = go.Figure()
        fig.add_trace(go.Scatter(x=trains.index, y=trains['close'], mode='lines', name='Train', line=dict(color='blue')))
        fig.add_trace(go.Scatter(x=actual_series.index, y=actual_series, name="Actual", line=dict(color='green')))
        fig.add_trace(go.Scatter(x=forecast.index, y=forecast, name="ARIMA Prediction", line=dict(color='red', dash='dash')))
        fig.update_layout(title=f"{model_label} Prediction vs Actual", xaxis_title="Date", yaxis_title="Price")
        pio.show(fig)

        if prompt_to_save:
            self.request_save_signal.emit(model_label)                           # Emit the signal to request model saving
            while self.save_model_flag is None:                                  # Wait for the user's response
                self.msleep(100)                                                 # Wait until the response is set            
            if self.save_model_flag:
                self.do_message("Saving the model to disk...")
                pad = self.settings['models']         
                pad = Path(pad)
                if self.EXO == True:
                    filename = self.MARKT+model_label+'_exo_'+self.time_stamp()+'.pkl'
                else:            
                    filename = self.MARKT+model_label+self.time_stamp()+'.pkl'
                full_path = pad / filename
                joblib.dump(self.best_model, full_path)                          # Save the model and scaler to disk
                print(f"✅ {model_label} model saved.")
            else:
                self.do_message("Model save skipped.") 
    
    #  Overlays ARIMA and SARIMAX predictions along with actual prices.
    def plot_prediction_comparison(self):
        if not hasattr(self, 'predictions_arima') or not hasattr(self, 'predictions_sarimax'):
            raise ValueError("Both ARIMA and SARIMAX predictions must be available.")
        actual_series = self.test_r['close']
        fig = go.Figure()
        fig.add_trace(go.Scatter(x=actual_series.index, y=actual_series, name="Actual", line=dict(color='green')))
        fig.add_trace(go.Scatter(x=self.predictions_arima['dt'], y=self.predictions_arima['predicted'], name="ARIMA", line=dict(color='blue', dash='dot')))
        fig.add_trace(go.Scatter(x=self.predictions_sarimax['dt'], y=self.predictions_sarimax['predicted'], name="SARIMAX", line=dict(color='red', dash='dash')))
        fig.update_layout(title="ARIMA vs SARIMAX Prediction Comparison", xaxis_title="Date", yaxis_title="Price")
        pio.show(fig)

    def do_forecast(self):
        from sklearn.linear_model import LinearRegression
        forecast = []
        if self.EXO == True:
            # linearly extrapolate the exogenous variable for the future period
            future_exog = pd.DataFrame(columns=self.exo_cols)
            for col in self.exo_cols:
                y = self.exo[col].tail(6).values
                X = np.arange(len(y)).reshape(-1, 1)
                model = LinearRegression().fit(X, y)
                X_future = np.arange(len(y), len(y) + self.N_INPUT).reshape(-1, 1)
                y_future = model.predict(X_future)
                future_exog[col] = y_future
            forecast_obj = self.best_model.get_forecast(steps=self.N_INPUT, exog=future_exog) 
        else:    
            forecast_obj = self.best_model.get_forecast(steps=self.N_INPUT) 
        forecast = forecast_obj.predicted_mean
        ci = forecast_obj.conf_int()                                                                                    # confidence interval for forecasts
        last_known_timestamp = pd.to_datetime(self.test_r.index[-1])                                                    # Ensure the last known timestamp is in the correct datetime format
        future_timestamps = [last_known_timestamp + DateOffset(hours=x) for x in range(1, self.N_INPUT)]                # Generate future timestamps based on the last known timestamp and hourly intervals
        if len(future_timestamps) != len(forecast):
            print("Mismatch between future timestamps and forecast length.")                                            # Check if the lengths of forecast and future_timestamps match
        if self.DIFF:
            print('* * * Reverse differencing SARIMAX forecasts to original scale')
            sarimax_forecasts = []
            last_value = self.test_r['close'].iloc[-1]                                                                      # Reverse the first forecast value using the last actual value
            sarimax_forecasts = self.reverse_differencing(forecast, base_value=last_value, target_index=future_timestamps)  # Update SARIMAX predictions with original scale values                    
            ci_lower = self.reverse_differencing(ci.iloc[:, 0],  base_value=last_value, target_index=future_timestamps)                              
            ci_upper = self.reverse_differencing(ci.iloc[:, 1],  base_value=last_value, target_index=future_timestamps)
            min_length = min(len(future_timestamps), len(sarimax_forecasts))                                                # Ensure both arrays have the same length
            future_timestamps = future_timestamps[:min_length]
            sarimax_forecasts = sarimax_forecasts[:min_length]
            forecast_df = pd.DataFrame({'dt': future_timestamps, 'Forecast': sarimax_forecasts})         # Use SARIMAX forecasts              
        else:
            forecast_df = pd.DataFrame({'dt': future_timestamps, 'Forecast': forecast.values})           # Create a DataFrame for the forecasted values
        forecast_df = forecast_df.set_index('dt')                       # Set 'dt' as the index 
        print("Structure of forecast_df:")                              # Print the structure of forecast_df for verification
        print(forecast_df.head())                                       # Display the first few rows
        print(forecast_df.columns)                                      # Display the column names
        print("Final forecast DataFrame:")                              # Print the final forecast DataFrame to verify
        print(forecast_df)
        filename = f'forecast_{self.MARKT}_next_{self.N_INPUT}_hours_{self.time_stamp()}.csv'
        forecast_df.to_csv(filename)
        print(f"✅ Forecast saved to {filename}")                       # Save the forecast to a CSV file
        df48 = self.test_r.tail(48)                                     # Plot the forecast along with the last 48 hours of historical data
        fig = go.Figure()
        fig.add_trace(go.Scatter(x=df48.index, y=df48['close'], mode='lines', name='Historical Data'))                                     # Plot historical data from df48    
        fig.add_trace(go.Scatter(x=forecast_df.index, y=forecast_df['Forecast'], mode='lines', name='Forecast Next 12 Hours'))             # Plot forecast data from forecast_df
        fig.add_trace(go.Scatter(x=ci_lower.index, y=ci_lower, name='Lower Bound', line=dict(dash='dot', color='gray')))
        fig.add_trace(go.Scatter(x=ci_upper.index, y=ci_upper, name='Upper Bound', line=dict(dash='dot', color='gray')))
        fig.update_layout(title='SARIMAX Model Forecast for the Next 12 Hours', xaxis_title='Date', yaxis_title='Price', legend_title="Legend")
        pio.show(fig)       
            
    @pyqtSlot(bool)
    def set_save_model_flag(self, flag):                                # Slot to receive the user's response
        self.save_model_flag = flag
Python
ARIMA-SARIMAX predictions
forecast

Related Stories