Caution
Maintenance Notice: The User-friendly interface is currently under maintenance.
5ULTRA is a computational pipeline designed to annotate and score genetic variants located in the 5′ untranslated regions (5′UTRs) of genes. By focusing on upstream open reading frames (uORFs), Kozak sequences, and optional splicing sites (via SpliceAI), it provides detailed insights into how 5′UTR variants can affect gene regulation, translation efficiency, and disease pathogenesis.
The 5ULTRA pipeline filters variants that reside within the 5′ UTR region and characterizes them for their potential impact on translation. Leveraging a machine learning model (random forest), the pipeline computes a comprehensive score, helping prioritize variants that could alter mRNA splicing or the regulation of the coding sequence (CDS).
Key Highlights:
- Pinpoints uORF changes (e.g., gain/loss of start or stop codons).
- Evaluates Kozak sequence disruptions.
- Integrates optional splicing analysis through SpliceAI.
- Provides a single, interpretable score for rapid variant prioritization.
- Python ≥ 2.1.0
- pip
- Clone the repository
git clone https://github.com/mchaldebas/5ULTRA.git
- Navigate into the 5ULTRA directory
cd 5ULTRA - Install locally via pip
pip install . - Test installation
5ULTRA -h
- Download Required Data (default path: in a hidden directory .5ULTRA/data in your home directory)
Specify --data-dir if you wish to store reference data in a non-default location.
5ULTRA-download-data [--data-dir [path/to/data]]
Note: Ensure you have ~5 GB of available disk space to accommodate all reference data.
Once installed, 5ULTRA can be run directly as a command-line tool:
5ULTRA [-h] -I [input] [-O [output]] [--data-dir [path/to/data]] [--splice] [--full] [--mane]-h: Show help message and exit-I: Path to the input VCF or TSV file containing genetic variants.-O [output]: Path for the output TSV file. Defaults to <input_file>.5ULTRA.tsv or <input_file>.5ULTRA.splice.tsv if not specified.--data-dir [path/to/data]: Path to the data directory. Defaults to ~/.5ULTRA/data if not specified.--splice: Enable SpliceAI processing and outputs on the variants with impact on splicing (SNVs only).--full: Outputs a more detailed annotation (see Input and Output File Format) and all other columns from the input.--mane: Outputs only the variants affecting MANE transcripts
5ULTRA -I tests/test-variants.tsvThis command reads test-variants.tsv, filters for 5′ UTR variants, annotates them, calculates scores, and writes the output to test-variants.5ULTRA.tsv.
5ULTRA -I tests/test-variants.tsv \
-O tests/fully_annotated_variants.tsv \
--data-dir ~/.5ULTRA/data \
--splice \
--mane \
--fullThis command uses custom data path, analyse only splicing SNVs of MANE transcripts, and produces a fully annotated output with additional columns.
- Input: VCF or TSV file with genetic variants. (#CHROM, POS, ID, REF, ALT)
- Output: TSV file containing annotated and scored variants.
- CHROM, POS, ID, REF, ALT: Same as input.
- CSQ: Type of variant
- Translation: Increased, Decreased, or N-terminal Extension
- 5ULTRA_Score: Prioritization metric
- GENE: Gene Symbol
- TRANSCRIPT: Ensembl transcript ID
- Splicing specific
when --splice is specified:
- SpliceAI: SpliceAI predictions for the variant.
- Splicing_CSQ: Missplicing consequence on the 5’UTR sequence (DG: new donor splice site used, AG new acceptor splice site used).
- Full Annotation
When --full is specified, additional columns are appended:
- MANE: NCBI transcript ID if applicable (e.g., NM_123456789.1)
- 5UTR_START / 5UTR_END: Genomic coordinates of the 5′ UTR.
- STRAND: DNA strand (+ or -).
- 5UTR_LENGTH: Length of the 5’UTR.
- START_EXON: The exon number where the CDS starts.
- mKOZAK / mKOZAK_STRENGTH: Kozak sequence/context strength around the CDS start.
- uORF_count: Total number of uORFs in the transcript.
- Overlapping_count, Nterminal_count, NonOverlapping_count: Counts of specific uORF types.
- uORF_START / uORF_END: Genomic coordinates of the uORF.
- Ribo_seq: Indicates evidence of translation (True, False or New uORF)
- uSTART_mSTART_DIST / uSTART_CAP_DIST: Distance from CDS start and 5′ cap, respectively.
- uSTOP_CODON: Type of stop codon (TAA, TGA, or TAG).
- uORF_TYPE: uORF type (Non-overlapping, Overlapping, N-terminal extension).
- uKOZAK / uKOZAK_STRENGTH: Kozak sequence/context strength around the uORF start.
- uORF_LENGTH / uORF_AA_LENGTH: Length of the uORF in nucleotides and amino acids.
- uORF_rank: Relative rank based on proximity to the CDS.
- uSTART_PHYLOP / uSTART_PHASTCONS: Conservation scores (PhyloP, PhastCons).
- pLI / LOEUF: Gene-level intolerance metrics.
- Other columns from the input file.
Chaldebas et al., Genome-wide detection of human 5’UTR variants that impact protein translation. The American Journal of Human Genetics (2026), https://doi.org/10.1016/j.ajhg.2026.02.020
Contributions are welcome! Please submit pull requests or open issues on the GitHub repository.
This project is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0). See the LICENSE file for details.
Developer: Matthieu Chaldebas, Ph.D. candidate
Email: mchaldebas@rockefeller.edu
Laboratory: St. Giles Laboratory of Human Genetics of Infectious Diseases
Institution: The Rockefeller University, New York, NY, USA