Feature Engineering

Feature engineering is the process of transforming raw data into relevant information for use by machine learning models. In other words, feature engineering is the process of creating predictive model features. With Machine Learning (ML) we practice forms of reinforced learning, feedback based learning, in which we use ‘independent variables’ to model the ‘dependent variable’.

The independent variables are called: features, predictors or explanatory variables. In time series, features can be:

  • Lagged variables (past values of the same series or related ones);
  • Derived indicators (technical indicators or transformations);
  • Exogenous variables (sentiment indices, macroeconomic data).

Though the terms are often used interchangeably, ‘feature’ is more ML-centric, ‘predictor’ is more statistical and ‘explanatory variable’ suggests a causal intuition (though not necessarily proven).

The dependent variable, also called outcome or target is what we’re trying to model. This can either be regression → continuous (e.g., future price, return) or classification → categorical (e.g., up/down, bull/bear, volatility regimes). Time series models often convert continuous future variables (like returns) into discrete classes (e.g., +1 if return > 0, else -1) to simplify classification.

Feature Engineering for Financial Time Series

Feature engineering is both an art and a science—particularly in time series where temporal dependencies, autocorrelation, and non-stationarity dominate. A fundamental distinction in the nature of features is that between internal and external variables.

📈 Internal Features (Endogenous)

These are features derived directly from the asset’s historical price or volume:

Momentum-Based

  • RSI (Relative Strength Index) – recent gains vs. losses.
  • Stochastic Oscillator (SO) – compares close to high-low range.
  • Rate of Change (RoC) – percent change over n periods.
  • Williams %R – similar to SO but inverted scale.

Trend-Following / Smoothing

  • Simple Moving Average (SMA) / Exponential Moving Average (EMA)
  • MACD – differential of two EMAs.
  • STC – combines MACD and cycle concepts.

Volatility-Based

  • Bollinger Bands – moving average ± k standard deviations.
  • Keltner Channel – uses ATR instead of standard deviation.

These indicators often introduce lags but can be de-noised or combined for richer signals.


🌍 External Features (Exogenous)

Market Sentiment

  • Fear & Greed Index – aggregates several behavioral metrics.
  • VIX (Volatility Index) – expected volatility of S&P500, often a “risk-off” proxy.

Macro/Meso Indicators

  • S&P500 Index – reflects broad market sentiment.
  • Interest Rates, Inflation, or Exchange Rates (macro layer).

These help contextualize the local dynamics of an asset within broader economic/psychological regimes. In this post we discuss and show you how the get some of these indices.

📊 Feature Engineering Best Practices

Several techniques can be used to enhance the predictive value of features. These include:

⚖️ Normalization

  • Crucial for distance-based models or regularized regression.
  • Scale time series features with rolling z-score, min-max, or robust scaling.

🔄 Stationarity Checks

  • Differencing or log-transforms help with non-stationary inputs.
  • Always check for autocorrelation and unit roots (ADF/KPSS tests).

🧩 Interaction Features

  • Combine indicators: e.g., RSI * Bollinger Width, or relative ratios between short- and long-term MAs.

🔀 Lookahead Bias & Data Leakage

  • Avoid using future information.
  • All engineered features should be strictly based on information available up to time t.

Our post on ARIMA – SARIMAX Models Optimizing for Training uses some of these methods.


🛠 Feature Engineering Example: Binary Classification of Price Movement

Target: Will price go up in next 3 hours?

Feature Vector at time t might include:

  • RSI_14[t], SO_14[t], BB_Width[t]
  • Price[t] / SMA_50[t] – a normalized trend feature
  • SP500_return[t], VIX[t]
  • FearGreedIndex[t]
  • HourOfDay[t], DayOfWeek[t]

We feed this into:

  • Logistic Regression
  • Tree-based models (XGBoost, LightGBM)
  • LSTM (with temporal dependencies encoded)

The MakeFeatures Class for Feature Engineering

The boundary between a feature and an assumption can be subtle in time series. When you select certain technical indicators, you are also embedding a hypothesis about how markets behave (e.g., trend-following, mean-reversion). Feature engineering, therefore, is not just a mechanical task—it’s an epistemological act that encodes your beliefs about market structure into a mathematical form.

The MakeFeatures class is utility module used in Technical Analyses and in predictive modelling. We use it for Engineering the internal Features derived directly from the asset’s historical price. It is ready for use in a PyQt6 GUI application but you can just as easy us it as part of a terminal CL script or a Jupyther Notebook.

Things start with loading the necessary libraries such as pandas, pandas_ta, and the ta library. When initiating we check that a dataframe is present.

The class has 1 callable method: do_make_features(), that orchestrates the feature engineering and cleans up the resulting dataframe before returning it to the caller.

Class MakeFeatures

# Copyright (c) 2025 Hans De Weme
# Licensed under the MIT License (https://opensource.org/licenses/MIT).
# Class MakeFeatures
# Purpose: Engineering Internal Features derived directly from the asset's historical price 
"""
Imports necessary libraries such as pandas, pandas_ta, ta
Loads preprocessed data set
Calculates predictors / features:
- Momentum-Based
•	RSI (Relative Strength Index) – recent gains vs. losses.
•	Stochastic Oscillator (SO) – compares close to high-low range.
•	Rate of Change (RoC) – percent change over n periods.
•	Williams %R – similar to SO but inverted scale.
- Trend-Following / Smoothing
•	Simple Moving Average (SMA) / Exponential Moving Average (EMA)
•	MACD – differential of two EMAs.
•	Schaff Trend Cycle (STC) – combines MACD and cycle concepts.
- Volatility-Based
•	Bollinger Bands – moving average ± k standard deviations.
•	Keltner Channel – uses ATR instead of standard deviation.
Cleans up the data frame before returning the results
"""
import pandas as pd
import pandas_ta as pta
from   ta.volatility           import BollingerBands
from   ta.trend                import STCIndicator
from   ta                      import momentum
from   PyQt6.QtCore            import QObject
import warnings
warnings.filterwarnings("ignore")

class MakeFeatures(QObject):       
    def __init__(self, data):
        super().__init__()                                              # necessary for QObject, needed for pyqtSignal  
        self.df  = pd.DataFrame(data)
        if self.df.empty:
            print('* * * Time Series Data missing  * * * ')
            return

    def do_make_features(self):
        D = self.df
        print('calculate RSI  over complete dataset')
        D['RSI'] = pta.rsi(close=D['close'], window=14)                                 # calculate RSI and SO (Stochastic Oscillator) 
        print('calculate SO   over complete dataset')
        D[['SO', 'SO3']] = pta.stoch(D['high'], D['low'], D['close'], k=14, d=3, smooth_k=3)
        print('calculate RoC  over complete dataset')
        D['RoC'] = pta.roc(D['close'], length=14)                                       # calculate Rate of Change 
        print('calculate Wil  over complete dataset')
        D['Wil'] = momentum.williams_r(D['high'], D['low'], D['close'], lbp=14)         # calculate Williams %R 
        self.calc_macd()
        self.calc_BB()
        self.calc_STC()
        self.do_kelter()
        self.clean()
        return self.df

    def calc_macd(self):
        print('calculate MACD over complete dataset')
        self.df['20_day_EM']   = self.df['close'].ewm(span=20, adjust=False).mean()
        self.df['50_day_EM']   = self.df['close'].ewm(span=50, adjust=False).mean()
        self.df['MACD']        = self.df['20_day_EM'] - self.df['50_day_EM']
        self.df['Signal_Line'] = self.df['MACD'].ewm(span=7, adjust=False).mean()

    def calc_BB(self):
        print('calculate BB   over complete dataset')
        STD_DEV = 2
        SMA_PERIOD  = 28        # Exchange is open 5 * 4 = 20 days per month, crypto exchanges 7 * 4 = 28
        indicator_bb = BollingerBands(close=self.df['close'], window=SMA_PERIOD, window_dev=STD_DEV)
        self.df['BB_mid']  = indicator_bb.bollinger_mavg()   
        self.df['BB_high'] = indicator_bb.bollinger_hband()
        self.df['BB_low']  = indicator_bb.bollinger_lband()  
        
    # KELTNER CHANNEL CALCULATION
    def get_kc(self, high, low, close, kc_lookback, multiplier, atr_lookback):
        tr1 = pd.DataFrame(high - low)
        tr2 = pd.DataFrame(abs(high - close.shift()))
        tr3 = pd.DataFrame(abs(low - close.shift()))
        frames = [tr1, tr2, tr3]
        tr = pd.concat(frames, axis = 1, join = 'inner').max(axis = 1)
        atr = tr.ewm(alpha = 1/atr_lookback).mean()
        kc_middle = close.ewm(kc_lookback).mean()
        kc_upper = close.ewm(kc_lookback).mean() + multiplier * atr
        kc_lower = close.ewm(kc_lookback).mean() - multiplier * atr
        return kc_middle, kc_upper, kc_lower

    def do_kelter(self):
        print('calculate ATR  over complete dataset')
        self.df['high']  = pd.to_numeric(self.df['high'], errors='coerce')       # Convert columns to numeric to avoid string operations
        self.df['low']   = pd.to_numeric(self.df['low'], errors='coerce')
        self.df['close'] = pd.to_numeric(self.df['close'], errors='coerce')
        self.df.dropna()                
        self.df['kc_middle'], self.df['kc_upper'], self.df['kc_lower'] = self.get_kc(self.df['high'], self.df['low'], self.df['close'], 20, 2, 10)
        
    def calc_STC(self):
         # The Schaff Trend Cycle (STC) indicator to identify market trends and potential buy or sell signals.
        stc_window_slow = 50    # window_slow is around 50 periods, is 'smoother' trend, less sensitive to price changes
        stc_window_fast = 23    # window_fast is around 23 periods to captures the shorter-term price trends
        stc_cycle = 10          # cycle indicates sensitivity fot market trends and cycli, default = 10: higher values volatile market, lower sideways
        indicator_stc = STCIndicator(close=self.df['close'], window_slow=stc_window_slow, window_fast=stc_window_fast, cycle=stc_cycle, smooth1=3, smooth2=3)
        # Add features
        self.df['STC'] = indicator_stc.stc()

    def clean(self):
        # drop columns not needed
        self.df.drop(['20_day_EM', '50_day_EM', 'Signal_Line'], axis = 'columns', inplace = True)
        self.df = self.df.dropna()          # drop NaN values
Python

Related Stories