Executive Summary
This repository documents the first stage of a broader carry-trade-oriented financial machine learning project. The current implementation focuses on SPY market data and was designed to build, test, and evaluate a complete forecasting pipeline before extending the approach to FX and carry-specific signals.
The project includes:
data cleaning and exploratory data analysis, feature engineering and scaling, baseline and classical time-series benchmarks, deep learning forecasting with Temporal Fusion Transformer (TFT), classification experiments, and distribution-shift analysis.
A key result of the project is that most models showed weak out-of-sample predictive power, which is consistent with the difficulty of forecasting returns in liquid financial markets. At the same time, the project identified important methodological insights, including the stronger robustness of return-based features compared with price-level-based features and the relevance of data shift for model stability.
One notable positive result was that a simple EMA12-based benchmark outperformed several more complex classical models in the hourly setup. However, the broader set of experiments did not show a robust and repeatable forecasting edge across the full pipeline.
Overall, this repository should be understood as a practical financial forecasting and modeling foundation, developed as the first phase of a larger carry-trade research direction.
Data Availability
The original raw SPY dataset used in this project is not included in the repository because it was obtained from a paid data source and cannot be redistributed publicly.
The project expects the raw input file at:
data/SPY.txt
To reproduce the workflow, users can either:
provide their own SPY intraday dataset in a compatible format, obtain equivalent data from a licensed provider, or contact me for clarification regarding the expected structure and preprocessing assumptions.
This repository therefore focuses on the research workflow, feature engineering, modeling pipeline, and evaluation framework, while the proprietary raw market data remain excluded for licensing reasons.
This repository contains a practical financial machine learning project focused on building and evaluating an end-to-end forecasting pipeline for financial time series.
The broader long-term idea behind the project is related to carry trade research, but the current implementation uses SPY ETF data as a liquid and structured benchmark dataset to develop and validate the modeling workflow first.
This repository should therefore be understood as:
Phase 1: forecasting pipeline development, model comparison, and robustness analysis on SPY
The project covers:
- data cleaning and exploratory data analysis
- feature engineering and scaling
- benchmark and baseline models
- deep learning forecasting with Temporal Fusion Transformer (TFT)
- classification experiments
- distribution-shift analysis
- and critical evaluation of weak predictive performance in financial markets
The goal of this project was to test whether short- and medium-horizon market returns can be forecast using:
- technical indicators
- volatility features
- lagged returns
- time-based features
- classical time-series methods
- and deep learning models such as the Temporal Fusion Transformer (TFT)
A second objective was to understand why predictive performance is weak in many financial settings, including the roles of:
- noisy return series
- class imbalance
- feature design
- and train-test distribution shift
The long-term research direction is carry-trade-related forecasting and strategy design. However, instead of beginning directly with FX carry-trade data, the current project first develops the forecasting and feature-engineering pipeline on SPY.
This was intentional:
- SPY is highly liquid and easy to structure
- it allows the pipeline to be tested under realistic market conditions
- the methodology can later be transferred to FX pairs, interest-rate differentials, and carry-specific signals
So the current repository is best interpreted as:
A practical modeling foundation for a future carry-trade extension
The project uses SPY intraday market data.
Depending on the notebook, the raw minute-level data are resampled into:
- 15-minute OHLCV candles
- hourly OHLCV candles
- daily OHLCV candles
- monthly OHLCV candles
The datasets contain standard market variables:
- Open
- High
- Low
- Close
- Volume
Target variables are defined as either:
- simple returns
- or log returns
depending on the experiment.
The notebooks are not all part of one single linear pipeline. There is one main early pipeline and then several later TFT experiments that were rebuilt independently.
Notebook: 1 data_cleaning EDA.ipynb
This is the starting point of the whole project.
What it does:
- loads raw SPY minute-level data
- combines date and time into a datetime index
- filters regular trading hours
- checks data quality and duplicates
- performs exploratory data analysis
- visualizes prices, volume, and correlations
- resamples data
- saves cleaned datasets for later use
This notebook is the base for the later workflow.
Notebook: 2 Feautures Scaling.ipynb
This notebook builds on the cleaned data from the first notebook.
What it does:
- creates forecasting targets
- engineers technical and time-based features
- removes missing values
- splits the data chronologically into train / validation / test
- scales numerical features
- saves prepared datasets for modeling
Important note:
This was the original feature-engineering pipeline. Later in the project, some of its design choices were revised because this setup turned out to be more sensitive to data shift, especially when features were still strongly tied to the raw price level.
Notebook: 3 Base Lines.ipynb
This notebook uses the prepared data from the earlier pipeline and compares simpler and classical benchmark models.
Models compared include:
- Linear Regression
- Random Walk
- EMA12
- MACD Signal
- ARIMA
- SARIMA
- GARCH Mean
Key result:
The strongest benchmark result came from the EMA12-based model, which outperformed the other tested baseline and classical models in the hourly setup.
Performance of EMA12:
- MAE: 0.002434
- RMSE: 0.004085
- R²: 0.2182
- SMAPE: 67.32%
This is an important positive finding in the project: a simple trend-based benchmark captured more useful signal than several more complex classical approaches.
Notebook: 4 TFT Hour with data shift.ipynb
This notebook still builds on the earlier feature-scaling pipeline.
What it does:
- trains a Temporal Fusion Transformer (TFT) on hourly log returns
- evaluates forecasting performance
- compares train and test feature distributions
- explicitly checks for distribution shift / data shift
This notebook is important because it showed that the earlier feature setup was vulnerable to train-test instability.
Main insight:
- some predictors were still too closely tied to the price level
- train and test distributions differed more strongly
- this likely contributed to weak out-of-sample performance
This notebook is therefore best understood as the transition point in the project: it revealed a methodological issue that motivated the later redesign.
After the TFT data-shift experiment, the later notebooks were rebuilt more independently.
Instead of continuing to rely on the old feature-scaling pipeline, the later notebooks were connected more directly to the cleaned data from the Data Cleaning and EDA notebook and used revised feature engineering.
This means:
- the first four notebooks form the original main pipeline
- the later TFT notebooks are new standalone experiments
- they were designed to reduce the earlier data-shift problem
- they rely more on return-based and more stationary features
These notebooks should be understood as separate follow-up experiments, each with its own full modeling setup.
Purpose:
- forecast 15-minute SPY log returns
Main result:
- weak predictive performance
- predictions tended to stay close to zero
- no clear strong visual data shift for the target distribution
Main takeaway:
- short-horizon returns are extremely noisy
- weak performance is more likely due to low signal than obvious train-test mismatch
Purpose:
- forecast daily SPY log returns
Main takeaway:
- daily forecasting pipeline implemented successfully
- but predictive power remained very weak
Purpose:
- forecast the next 5 daily log returns
Main takeaway:
- similar setup and conclusion to the daily TFT experiments
- more complex modeling still did not produce a robust edge
Purpose:
- forecast simple daily returns instead of log returns
Main takeaway:
- useful comparison against the log-return formulation
- overall predictive performance remained weak
Purpose:
- use lagged daily returns and technical indicators to predict future daily returns
Main takeaway:
- adding lag features did not materially change the conclusion
- signal remained weak
Purpose:
-
reformulate daily prediction as a 3-class classification problem:
- short
- hold
- long
Main takeaway:
- model accuracy looked acceptable at first glance
- but the model largely collapsed to the majority class
- highlights class imbalance and weak directional predictability
Purpose:
- forecast monthly SPY returns
Main takeaway:
- even at a longer horizon, predictive power remained limited
- no strong evidence of a robust forecasting edge
If you want to use the repository in a structured way, the recommended order is:
1 data_cleaning EDA.ipynb2 Feautures Scaling.ipynb3 Base Lines.ipynb4 TFT Hour with data shift.ipynb
This sequence shows the original pipeline and the key methodological finding about data shift.
After that, continue with the later standalone notebooks:
TFT Model 15 min.ipynbTFT Day.ipynbTFT Day log-return.ipynbTFT Day return.ipynbTFT Day with lag.ipynbTFT Day Classification.ipynbTFT Month.ipynb
Across the project, the main findings are:
- forecasting SPY returns is difficult, especially at short horizons
- many models showed weak out-of-sample predictive power
- deep learning models often produced near-zero or negative R²
- simple baselines can outperform more complex models in some settings
- EMA12 was the strongest benchmark result in the hourly setup
- classification introduces additional problems such as class imbalance
- feature design strongly affects robustness
- price-level-based features are more vulnerable to data shift than return-based features
A central insight from the project is that the models did not generate a robust and repeatable forecasting advantage across the full pipeline.
This is consistent with the idea that liquid financial markets are difficult to predict and broadly supports the weak-form Efficient Market Hypothesis in this setup.
At the same time, the project still produced clear methodological value:
- building a realistic financial forecasting workflow
- comparing several model families
- identifying where and why models fail
- improving feature engineering after diagnosing data shift
- and documenting the difference between local positive results and robust predictive edge
This repository does not yet include a full trading strategy backtest.
That was intentional.
The current focus is on:
- data pipeline construction
- forecasting model development
- out-of-sample evaluation
- and robustness analysis
A full trading backtest was not prioritized because most models did not demonstrate a sufficiently robust predictive edge in out-of-sample testing.
In other words:
The main limitation at the current stage appears to be the weakness of the predictive signal itself, not the absence of a backtesting layer.
. ├── data/ ├── images/ ├── modeling/ ├── models/ ├── notebooks/ ├── requirements.txt ├── requirements_dev.txt └── README.md
Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pipInstall the core dependencies:
pip install -r requirements.txtInstall development and notebook dependencies:
pip install -r requirements_dev.txtThis project uses two dependency files.
Contains the core dependencies needed for modeling and analysis.
Contains additional development dependencies for:
- Jupyter notebooks
- testing
- formatting
- experiment utilities
Main project dependencies include:
- arch
- lightning
- matplotlib
- numpy
- pandas
- plotly
- pytorch-forecasting
- scikit-learn
- seaborn
- statsmodels
- ta
- torch
- tqdm
Development dependencies include:
- black
- jupyterlab
- mlflow
- parsenvy
- protobuf
- pytest
- testbook
- pipreqs
This repository is useful as a practical example of how a financial forecasting workflow can be built, tested, revised, and critically evaluated.
It shows:
- how to clean and transform market data
- how to engineer finance-specific features
- how to compare simple and complex models
- how to detect and interpret data shift
- how to handle weak results honestly
- and how a research idea can evolve in stages before becoming a strategy project
The current state of the repository is best described as:
Phase 1: practical forecasting pipeline development on SPY
It is not yet a full carry-trade implementation with FX data, rate differentials, and strategy-level portfolio backtesting.
Instead, it provides the methodological foundation for that next step.
Possible next steps include:
- extending the pipeline from SPY to FX pairs
- adding carry-related macro and rate-differential features
- building genuine carry-trade signals
- adding walk-forward validation
- optimizing hyperparameters
- integrating a strategy-level backtesting layer
- evaluating portfolio-level performance
This project was developed as a practical financial machine learning project focused on time-series forecasting, feature engineering, model comparison, robustness analysis, and realistic evaluation of financial prediction models.
The current implementation uses SPY as the first-stage benchmark dataset and forms the modeling foundation for a broader carry-trade-oriented research direction.