Scientific ML

A lightweight end-to-end pipeline to forecast next-day Bitcoin returns and evaluate a simple trading strategy.

This project demonstrates a novel combination of data-driven equation discovery and ensemble machine learning to tackle the challenging problem of financial time-series prediction. We begin by applying the Sparse Identification of Nonlinear Dynamics (SINDy) to uncover simple, interpretable ordinary differential equations that describe the day-to-day dynamics of Bitcoin’s open, high, low, close, volume, returns, and technical indicators. These SINDy-derived rate features serve as physics-inspired inputs that capture first-order autoregressive and mean-reverting behavior. Alongside classic technical features—such as moving averages, RSI, MACD, Bollinger Bands, and ATR—the pipeline assembles a rich feature matrix that reflects both the raw market state and the underlying tendencies of price movement.

Once the features are engineered, we frame next-day log-return prediction as a supervised learning task. Three gradient-boosted learners (XGBoost with pseudo-Huber loss, LightGBM with Huber loss, and CatBoost with MAE loss) are rigorously tuned using time-series cross-validation to prevent look-ahead bias and overfitting. Their individual forecasts are then ensembled by simple averaging to balance bias and variance. Finally, we convert the ensemble’s log-return predictions to simple returns for realistic backtesting, generate long/short trading signals with transaction costs, and record every trade in a comprehensive log. Although the model explains approximately 12% of out-of-sample variance in returns—evidencing some predictive signal—the backtest highlights the importance of careful risk controls and feature refinement, offering a transparent framework for further research and enhancement.

In this project, we will be engaging in a comprehensive data analysis and web scraping task. The project is divided into two main components: scraping data from an online source and analyzing it. In the first part, we are going to extract data from an online book catalog website using web scraping techniques. The goal is to gather useful information, such as book titles, prices, availability, and ratings, from multiple pages of the website. This will be done using Python libraries like requests and BeautifulSoup, which allow us to navigate through the website, collect the relevant data, and save it into a CSV file. The purpose of this step is to gather structured data that we can later use for analysis. Once the data is scraped, we will move to the analysis part, where we will perform exploratory data analysis (EDA) to better understand the dataset. This will include inspecting summary statistics, distribution of values, and identifying potential patterns or anomalies within the data.

In the second part of the project, we will use machine learning techniques to cluster the data and predict prices based on book ratings. Specifically, we will apply KMeans clustering and DBSCAN clustering methods to segment the data into different groups based on similarities. We will also create a polynomial regression model to predict the price of books based on their ratings. After performing the analysis, we will evaluate the effectiveness of our clustering models using metrics like silhouette scores to determine how well the clusters match the data. This project will help us understand not only how to scrape and analyze data but also how to apply machine learning techniques to make predictions and find hidden patterns within real-world data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Scientific ML

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Releases: parsabe/MLMatrix

DS-Challenge dataset

Uh oh!

Scientific ML

Scientific ML

Uh oh!

Scraper

Uh oh!