A C++ implementation of the Burrows-Wheeler Transform and its inverse. This project provides tools for transforming files using BWT and recovering them, along with a comprehensive test suite and performance benchmarks.
- C++17 compatible compiler (g++, clang++, etc.)
- Make
- Python 3 (for Python benchmarks and plotting)
-
Navigate to the project root directory.
-
Run
maketo build all executables (main tools, tests, and benchmarks):make
-
All executables and build artifacts will be created in the
build/directory.
makeormake all: Build everything (main tools, tests, benchmarks).make clean: Remove thebuild/directory and all artifacts.make rebuild: Clean and rebuild everything.make test: Build and run the correctness test suite.make performance: Build and run the C++ performance benchmarks.make python_performance: Run the Python performance benchmarks.
The project produces two main executables for file processing.
Applies the BWT transform to an input file and writes the result to an output file.
Usage:
./build/bwt <input_file> <output_file> [block_size]Arguments:
input_file: Path to the input file to transform (required).output_file: Path where the transformed output will be written (required).block_size: Size of each block in bytes (optional, default: 128).
Example:
# Transform a file using 1KB blocks
./build/bwt data/canterbury_corpus/alice29.txt alice29.bwt 1024Reverses the BWT transform on an input file and writes the original data to an output file.
Usage:
./build/inverse_bwt <input_file> <output_file> [block_size]Arguments:
input_file: Path to the BWT-transformed input file (required).output_file: Path where the original (recovered) output will be written (required).block_size: Size of each block in bytes (optional, default: 128).- Note: The block size must match the block size used during the forward transform.
Example:
# Recover the original file
./build/inverse_bwt alice29.bwt recovered.txt 1024The project includes a comprehensive testing infrastructure to verify correctness.
To run the full correctness test suite:
make testThis runs two sets of tests:
- Small Tests (
build/test_small): Unit tests for basic functionality, edge cases, and small strings. - Medium Tests (
build/test_medium): Integration tests that run round-trip conversions (Forward BWT -> Inverse BWT) on real files from the Canterbury Corpus. Verifies that the recovered file is byte-for-byte identical to the original.
You can also run the test executables directly:
./build/test_small
./build/test_mediumTo run C++ performance benchmarks on the default Canterbury Corpus:
make performanceThis runs the benchmark suite which measures:
- Forward BWT execution time
- Inverse BWT execution time
- Total round-trip time
- Throughput (MB/s)
The benchmark tests multiple block sizes (512B, 1KB, 4KB, 16KB, 64KB, 256KB, 1MB) and runs 5 trials per test case for statistical accuracy.
You can also run the benchmark on a custom dataset:
./build/performance <dataset_path> [num_trials]Arguments:
dataset_path: Path to the directory containing test files.num_trials: Number of times to run each test (optional, default: 5).
Example:
# Run on Silesia corpus with 2 trials per file
./build/performance data/silesia 2To compare your BWT implementation's performance with bzip2's BWT:
make bzip2_benchmark BENCH_DIR=<directory>Arguments:
BENCH_DIR: Directory containing test files (required).
Example:
# Compare on Canterbury Corpus
make bzip2_benchmark BENCH_DIR=data/canterbury_corpusThis benchmark automatically tests multiple block sizes (64KB, 128KB, 256KB, 512KB, 1MB) and performs a comprehensive comparison across three phases:
- Forward BWT: Compares forward transform performance
- Inverse BWT: Compares inverse transform performance
- Round Trip: Compares total time for forward + inverse transform
For each phase, the benchmark runs 5 trials per file, compares execution time and throughput, calculates speedup ratios, and displays a detailed comparison report with aggregate summary.
You can also run the comparison tool directly on a single file:
./build/compare_bwt_performance <input_file>This will automatically test forward BWT, inverse BWT, and round trip performance for the specified file across all default block sizes.
The project includes Python scripts in util/ for generating performance plots. Install dependencies with pip install -r util/requirements.txt, then run scripts like plot_performance.py, plot_bzip2_bench.py, etc. to visualize benchmark results.
src/: Source code for the main application and BWT logic.tests/: Source code for test suites and benchmarks.util/: Utility helpers for file I/O, testing, and plotting.data/: Test data files (Canterbury Corpus, Silesia Corpus, etc.).build/: Generated directory containing compiled executables and temporary test output.Makefile: Root build configuration.
This project uses test data from the Canterbury Corpus, a collection of standard benchmark files for evaluating compression algorithms. http://corpus.canterbury.ac.nz/
This project also uses test data from Silesia Corpus, a comprehensive compression benchmark corpus containing diverse file types representative of real-world data. Maintained by the Silesian University of Technology. http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
This project uses bzip2 (included as a submodule) for performance comparison benchmarks. bzip2 is a program and library for lossless, block-sorting data compression. https://gitlab.com/bzip2/bzip2
This project leveraged AI assistance for coding and development.