Current Data for Price Prediction
Stock price prediction is hot, so is trying to predict the Bitcoin price for the coming hours or days. You just have to look at GeeksforGeeks, Medium or search on GitHub. Training Machine Learning models, such as advanced LSTM (Long Short Term Memory) models, or the slightly less complicated but very effective models from the ARIMA (Autoregressive Integrated Moving Average) family, or classification (Gradient Boost, Random Forest) and regression (Logistic or Ridge) models requires large amounts of data: time series consisting of many tens of thousands of data points containing years of the hourly price history of a financial asset such as a Ethereum (ETH), Microsoft (MSFT), gold or silver are the rule rather than the exception. For an introduction into using AI models for Technical Analysis see this post.
Current Bitcoin or NVIDEA Data for Price Prediction
Once we have trained a model on its history from the past years up to the present day, we can save this model and then use it repeatedly for a certain period, as long as the circumstances do not change significantly, to predict short-term price developments with a limited set of the most recent data. We download current Bitcoin or NVIDEA data for price prediction, load a model previously trained and saved on this or a related asset and use it to predict the price development of this asset for the coming hours. Compared to the original training of the model, this can be done very quickly, a matter of a few minutes instead of (sometimes several) hours!
Complete up-to-data price data
For this, we need a module that quickly downloads the most current data and makes it available in the form of a standardized Pandas Dataframe, such as the model in question, that we want to reuse, expects. That is the purpose of the GetCurrent class. It is designed to be used in a PyQt GUI application but can easily be used as a stand alone Python script.
The getCurrent Class
# Copyright (c) 2024 Hans De Weme
# Licensed under the MIT License (https://opensource.org/licenses/MIT).
# Class getCurrent
# Purpose: collect most recent 500 hourly datapoints for crypto or stock assets,
# trim fiels not needed by calling module, offer to manually add most recent value if missing using callback function
# return the obtained time series as a Pandas Dataframe
import requests
import pandas as pd
import numpy as np
import os
from datetime import datetime, timedelta
import yfinance as yf
from PyQt6.QtCore import QObject
from time_handle import handleTime
# class init arguments:
# asset - crypto asset to collect data from Binance / stock ticker to collect from Yahoo Finance
# settings - json object used as Python dictionary
# Binance doc: https://binance-docs.github.io/apidocs/spot/en/#kline-candlestick-data
# All timestamps from Binance's REST API are in UTC (milliseconds since epoch):
'''
[
1499040000000, // Open time (UTC in ms)
"0.01634790", // Open
"0.80000000", // High
"0.01575800", // Low
"0.01577100", // Close
"148976.11427815", // Volume
1499644799999, // Close time (UTC in ms)
...
]
'''
class getCurrent(QObject):
def __init__(self, kind, asset, trim, input_callback=None):
super().__init__() # necessary for QObject, needed for pyqtSignal (currently not used!)
self.callback = input_callback
self.kind = kind
self.trim = trim # trim data set down to just 'close' or keep open, high, low, volume, number trades
self.FREQ = '1h'
self.asset = asset
self.df = None
self.time_handle = handleTime('settings.json')
def get_data(self, kind):
if kind == 'C':
self.MARKT = self.asset+'USDT'
suc, data = self.download_data(self.MARKT, self.FREQ)
if suc == False:
print('* * * No Data obtained for this asset from Binance * * *')
return suc
elif kind == 'S':
suc, data = self.download_stock_data(self.asset, self.FREQ)
if suc == False:
print('* * * No Data obtained for this asset from Yahoo Finance * * *')
return suc
else:
print('Kind unknow: '+kind)
return False
self.df = pd.DataFrame(data)
if self.df.empty:
print('* * * Time Series Data missing * * * ')
return False
suc = True
if self.trim == True:
self.df.drop(['open', 'high', 'low', 'volume', 'number_of_trades'], axis = 'columns', inplace = True) # keep only 'close' price for LSTM and SARIMAX training
return suc
def download_data(self, markt, interval): # Download Binance most recent data and store in dataframe - last 500 datapoints
columns = ['open_time', 'open', 'high', 'low', 'close', 'volume', 'close_time', 'quote_asset_volume', 'number_of_trades', 'taker_buy_base_asset_volume', 'taker_buy_quote_asset_volume', 'ignore']
print(f'Downloading data for {markt}. Interval {interval}.')
tick_interval = '1h' # get most recent hourly data from Binance and save in the current dir
url = 'https://api.binance.com/api/v3/klines?symbol='+self.MARKT+'&interval='+tick_interval
try:
data = requests.get(url).json()
except:
suc = False
return
suc = True
df = pd.DataFrame(data)
now = datetime.now()
d = now.strftime("%d")
m = now.strftime("%m")
j = now.strftime("%Y")
NU = j+'-'+m+'-'+d
current = self.MARKT+'-'+NU+'.csv'
df.to_csv(current, header=columns, index=False)
print('\n* * * latest data collected from Binance')
df = pd.read_csv(current)
os.remove(current)
df.columns = columns
df['dt'] = pd.to_datetime(df['open_time'], unit='ms', origin='unix')
df['dt'] = df['dt'].dt.tz_localize('UTC') # <- this line is critical for time-zone conversion
df.drop(['open_time', 'close_time', 'quote_asset_volume', 'taker_buy_base_asset_volume', 'taker_buy_quote_asset_volume','ignore'], axis = 'columns', inplace = True)
# set index
df=df[~np.isnan(df)]
df=df.drop_duplicates()
df.set_index('dt', inplace=True)
df = df.sort_index()
df = self.time_handle.convert_dataframe_timezone(df, self.time_handle.tzone, original_tz='UTC') # convert
full_range = pd.date_range(start=df.index.min(), end=df.index.max(), freq='H') # reindex with full hourly range and check for missing hours
df_full = df.reindex(full_range)
missing_times = df_full[df_full.isnull().any(axis=1)].index
if len(missing_times) > 3:
print(f"\n* * * MISSING DATA AT: {missing_times}")
aantal = len(missing_times)
print('Number Data Points missing: '+str(aantal))
df['close'] = df['close'].interpolate(method='spline', order=3)
print('* * * Missing values filled in with spline order 3 * * * ')
else:
print("\n* * * No missing data detected. * * *")
return suc, df
def download_stock_data(self, markt, interval): # Download Yahoo Finance most recent data and store in dataframe - last 500 datapoints
print(f'Downloading data for {self.asset}. Interval {interval}.')
suc = False
data =yf.download(self.asset, period='100d', interval='1h')
data = pd.DataFrame(data)
if data.empty:
print("Failed to retrieve current data from Yahoo Finance")
if data.index.name is None:
data.index.name = 'Datetime'
if data.index.name != 'Datetime':
data.index.name = 'Datetime'
data.drop(['Adj Close'], axis = 'columns', inplace = True) # drop not needed column
data.rename(columns={'Open': 'open', 'High': 'high', 'Low': 'low', 'Close': 'close', 'Volume': 'volume'}, inplace=True) # rename columns to generic names
data['number_of_trades'] = pd.Series(0.0, index=data.index, dtype='float64')
data.reset_index(inplace=True) # reset index
data.rename(columns={'Datetime': 'dt'}, inplace=True)
data['dt'] = pd.to_datetime(data['dt']).dt.tz_localize('UTC') # localize time-zone
data['close'] = data['close'].astype(float)
data.set_index('dt', inplace=True)
df=data[~np.isnan(data)] # clean up
df=df.drop_duplicates()
df = self.time_handle.convert_dataframe_timezone(df, self.time_handle.tzone, original_tz='UTC') # convert
df = df.sort_index()
suc = True
return suc, df
GetCurrent Overview
The getCurrent class is designed to retrieve the most recent current 500 hourly datapoints for cryptocurrency or stock assets. It trims unnecessary fields, handles missing data, and can invoke a callback function to manually add the most recent value if it is missing. The resulting dataset is returned as a Pandas DataFrame for use with pre-trained ML models.
Key Features
Supports Multiple Asset Types: Fetch data for cryptocurrency assets from Binance or stock assets from Yahoo Finance.
Field Trimming: Retain only essential fields such as the close price, or keep all fields for advanced analysis.
Missing Data Handling: Detects and fills missing data using spline interpolation.
Callback for User Input: Allows manual entry of the most recent value when API latency causes missing records.
Output Format: Returns a cleaned and indexed Pandas DataFrame for easy integration with data uses in ML models.
Functional Details
Initialization
__init__(kind, asset, trim, input_callback=None)
Arguments:
kind (str): Type of asset (‘C’ for cryptocurrency, ‘S’ for stock).
asset (str): Asset identifier (e.g., “BTC” for cryptocurrency, “AAPL” for Apple stock).
trim (bool): Whether to trim the dataset to include only the ‘close’ column.
input_callback (callable, optional): A function to invoke for manual entry of the latest value if missing.
Methods
get_data(kind)
Fetches data for the specified asset type.
Arguments:
- kind (str): Asset type (‘C’ or ‘S’).
Returns:
- suc (bool): Success status of the data retrieval.
- self.df (DataFrame): Cleaned and indexed time series data.
Process Flow:
- Calls download_data for cryptocurrency or download_stock_data for stocks.
- Handles trimming if self.trim is True.
download_data(markt, interval)
Fetches the most recent 500 hourly data points from Binance.
Arguments:
- markt (str): The market symbol (e.g., “BTCUSDT”).
- interval (str): Data frequency (e.g., “1h”).
Returns:
- suc (bool): Success status.
- df (DataFrame): Processed time series data.
Additional Features:
- Saves data temporarily as a CSV file for processing.
- Detects missing hourly records and interpolates values using spline order 3.
download_stock_data(markt, interval)
Fetches the most recent 500 hourly data points from Yahoo Finance.
Arguments:
- markt (str): Asset ticker symbol.
- interval (str): Data frequency (e.g., “1h”).
Returns:
- suc (bool): Success status.
- df (DataFrame): Processed time series data.
- Additional Features:
- Renames columns to standardized names.
- Drops irrelevant fields such as ‘Adj Close’.
Functional Details
Dependencies
requests: For API calls to Binance.
pandas: For data manipulation and cleaning.
numpy: For numerical operations and handling missing values.
os: For file management.datetime: For handling timestamps.
yfinance: For retrieving stock data from Yahoo Finance.
PyQt6.QtCore: For integrating with PyQt applications.
Class Attributes
FREQ (str): Frequency of the data, set to “1h”.
df (DataFrame): Holds the fetched and processed data.
Error Handling
API Failures: Prints error messages if data retrieval fails.
Missing Data: Identifies missing hourly records and fills them using spline interpolation.
User Input: Invokes input_callback to manually input missing values when necessary.
Output
The class processes data into a Pandas DataFrame with:
Index: Datetime.
Columns: ‘open’, ‘high’, ‘low’, ‘close’, ‘volume’, ‘number_of_trades’ (if not trimmed).
Example Usage
# Example callback function
def user_input_callback():
return float(input(“Enter the most recent value: “))
# Initialize the class
getter = getCurrent(kind=’C’, asset=’BTC’, trim=True, input_callback=user_input_callback)
# Fetch data
data_retrieved = getter.get_data(kind=’C’)
if data_retrieved:
print(getter.df.head())
else:
print(“Failed to retrieve data.”)