spectrafuse

Incremental spectral clustering pipeline from quantms data. quantms is a workflow for reanalysis of public proteomics data. quantms not only releases a workflow to the public but also performs reanalysis of public proteomics data in a systematic way for TMT, LFQ, ITRAQ and other DDA methods.

quantms has reanalyzed an extensive number of datasets with almost 1 billion MS/MS (Mass Spectrometry/Mass Spectrometry) MS2 analyzed, comprising nearly 100 million PSMs (Peptide-Spectrum Matches) derived from various tissues, cell lines, and diseases. In light of this vast wealth of data, spectrafuse aims to apply spectral clustering techniques to organize this data and construct spectral libraries.

spectrafuse is a Nextflow workflow that performs incremental clustering of quantms data and is based on the tool MaRaCluster.

Workflow Overview

The pipeline converts QPX parquet files directly into MaRaCluster's internal .dat binary format (~100 bytes/spectrum), making it feasible to cluster datasets with millions of PSMs on standard hardware.

The pipeline supports three modes:

Full Mode (default)

Parquet → .dat → MaRaCluster → Cluster DB

Parquet to Dat (PARQUET_TO_DAT): Converts PSM parquet files directly to MaRaCluster's binary .dat format, optionally filtering by charge state. Replicates MaRaCluster's internal binning algorithm (bin = floor(mz / 1.000508 + 0.32), top-40 peaks).
MaRaCluster (RUN_MARACLUSTER_DAT): Clusters spectra using the -D flag to read pre-existing .dat files, skipping MaRaCluster's own file conversion step.
Build Cluster DB (BUILD_CLUSTER_DB): Generates cluster_metadata.parquet and psm_cluster_membership.parquet from MaRaCluster's clustering output and the scan-title mappings produced during dat conversion.

Incremental Mode

Adds new data to an existing cluster DB without re-clustering everything.

Extract Representatives (EXTRACT_REPRESENTATIVES): Reads cluster metadata and writes one consensus spectrum per existing cluster.
Convert New Data: Converts new project PSMs to charge-specific spectrum files.
MaRaCluster (RUN_MARACLUSTER): Clusters representatives + new data together. Representatives that land in the same new cluster as new spectra link those spectra to the existing cluster.
Merge Clusters (MERGE_INCREMENTAL): Resolves cluster IDs (representative → original cluster mapping, multi-representative merges, fresh UUIDs for new-only clusters) and updates the cluster DB.

Key Features

Compact binary format: Parquet-to-dat conversion produces ~1.3 GB for 12.8M PSMs
Incremental clustering: Add new datasets without re-clustering existing data
Parallel processing: Projects and charge partitions are processed in parallel
Partitioned by metadata: Clustering is performed separately for each species/instrument/charge combination
Consensus spectra: Multiple strategies (best, bin, most, average) for generating consensus spectra from clusters

Input Requirements

The workflow requires:

QPX Parquet files: A directory containing one or more project subdirectories. Each project directory should contain the QPX file set:

File	Purpose
`{id}.psm.parquet`	PSM spectra (mz_array, intensity_array, charge, sequence, etc.)
`{id}.run.parquet`	Run metadata (run_file_name, instrument, samples, enzymes)
`{id}.sample.parquet`	Sample metadata (organism, organism_part)

Directory structure:

parquet_dir/
├── PXD004732/
│   ├── PXD004732.psm.parquet
│   ├── PXD004732.run.parquet
│   └── PXD004732.sample.parquet
├── PXD004733/
│   ├── PXD004733.psm.parquet
│   ├── PXD004733.run.parquet
│   └── PXD004733.sample.parquet
└── ...

Installation

Prerequisites

Nextflow >= 23.04.0
Docker, Singularity, Podman, or another container engine
Java 11 or later (for Nextflow)

Install Nextflow

curl -s https://get.nextflow.io | bash

Clone the Repository

git clone https://github.com/bigbio/spectrafuse.git
cd spectrafuse

Usage

Full Mode (default)

nextflow run main.nf \
    --parquet_dir /path/to/projects \
    --default_species "Homo sapiens" \
    --default_instrument "Q Exactive HF" \
    -profile docker

Incremental Mode

nextflow run main.nf \
    --parquet_dir /path/to/new_projects \
    --mode incremental \
    --existing_cluster_db /path/to/cluster_db \
    -profile docker

Resume a Previous Run

nextflow run main.nf \
    --parquet_dir /path/to/projects \
    -profile docker \
    -resume

Test Run

nextflow run main.nf -profile test,docker

Parameters

Required Parameters

Parameter	Description	Example
`--parquet_dir`	Directory containing project subdirectories with QPX parquet files	`/data/projects`

Pipeline Mode Parameters

Parameter	Description	Default
`--mode`	Pipeline mode: `full` or `incremental`	`full`
`--existing_cluster_db`	Path to existing cluster DB (required for incremental mode)	`null`
`--default_species`	Species label for cluster DB metadata	`null`
`--default_instrument`	Instrument label for cluster DB metadata	`null`
`--skip_instrument`	Cluster all instruments together (no instrument partitioning)	`false`

MaRaCluster Parameters

Parameter	Description	Default
`--maracluster_pvalue_threshold`	Log10(p-value) threshold for MaRaCluster clustering (e.g., -10.0 = p-value 1e-10)	`-10.0`
`--maracluster_precursor_tolerance`	Precursor m/z tolerance in ppm for MaRaCluster	`20.0`
`--cluster_threshold`	P-value threshold for MaRaCluster output file naming (e.g., `*_p30.tsv`)	`30`
`--maracluster_verbose`	Enable verbose output from MaRaCluster	`false`

Consensus Strategy Parameters

Parameter	Description	Default	Options
`--strategytype`	Consensus spectrum generation method	`best`	`best`, `most`, `bin`, `average`

For most method (Most Similar Spectrum):

Parameter	Description	Default
`--sim`	Similarity measure method	`dot`
`--fragment_mz_tolerance`	Fragment m/z tolerance for spectrum comparison	`0.02`

For bin method (Binning Strategy):

Parameter	Description	Default
`--min_mz`	Minimum m/z value to consider	`100`
`--max_mz`	Maximum m/z value to consider	`2000`
`--bin_size`	Bin size in m/z units	`0.02`
`--peak_quorum`	Relative number of spectra in a cluster that need to contain a peak for it to be included (0.0-1.0)	`0.25`
`--edge_case_threshold`	Threshold for correcting m/z edge cases during binning (0.0-1.0)	`0.5`

For average method (Average Spectrum):

Parameter	Description	Default
`--diff_thresh`	Minimum distance between MS/MS peak clusters	`0.01`
`--dyn_range`	Dynamic range to apply to output spectra	`1000`
`--min_fraction`	Minimum fraction of cluster spectra where MS/MS peak is present (0.0-1.0)	`0.5`
`--pepmass`	Peptide mass calculation method	`lower_median`
`--msms_avg`	MS/MS averaging method	`weighted`

Output Parameters

Parameter	Description	Default
`--outdir`	Output directory for results	`./results`
`--publish_dir_mode`	Publish directory mode	`copy`

Resource Limits

Parameter	Description	Default
`--max_memory`	Maximum memory per process	`128.GB`
`--max_cpus`	Maximum CPUs per process	`16`
`--max_time`	Maximum time per process	`240.h`

Output Structure

Full Mode

results/
├── cluster_db/                     # Cluster database (per charge partition)
│   └── partition__charge{N}/
│       ├── cluster_metadata.parquet       # One row per cluster (consensus spectrum, purity, member count)
│       └── psm_cluster_membership.parquet # One row per PSM (USI, cluster_id, sequence, q-value)
├── maracluster_results/            # MaRaCluster raw output
│   └── clusters_p30.tsv
└── pipeline_info/                  # Pipeline execution metadata
    ├── execution_report.html
    ├── execution_timeline.html
    ├── execution_trace.txt
    └── pipeline_dag.html

Incremental Mode

Same structure as full mode, with updated cluster metadata and membership parquets that include both existing and new PSMs.

Profiles

Container Engine Profiles

docker - Use Docker containers
singularity - Use Singularity containers
podman - Use Podman containers
apptainer - Use Apptainer containers
charliecloud - Use Charliecloud containers
shifter - Use Shifter containers

Platform Profiles

arm64 - For ARM64 architecture (e.g., Apple Silicon)
emulate_amd64 - Emulate AMD64 on ARM64 hosts

Other Profiles

test - Use test configuration with minimal dataset
debug - Enable debug mode with detailed logging
gpu - Enable GPU support
wave - Use Wave containers

Example Profile Combinations

# Docker on x86_64
nextflow run main.nf -profile docker

# Docker on Apple Silicon (ARM64)
nextflow run main.nf -profile docker,arm64

# Singularity on HPC cluster
nextflow run main.nf -profile singularity

# Test run with Docker
nextflow run main.nf -profile test,docker

Consensus Spectrum Strategies

The workflow supports four different strategies for generating consensus spectra from clusters:

best (default): Selects the spectrum with the best quality metric (lowest posterior error probability or global q-value) as the consensus spectrum.
most: Selects the spectrum most similar to all other spectra in the cluster using similarity measures (e.g., dot product).
bin: Uses binning-based approach where peaks are binned by m/z, and peaks present in a quorum of spectra are included in the consensus.
average: Computes an averaged spectrum from all spectra in the cluster, with configurable peak alignment and filtering.

Choose the strategy based on your use case:

best: Fast, good for high-quality data
most: Good for finding representative spectra
bin: Good for noisy data, robust peak detection
average: Good for high-quality clusters, smooth spectra

Troubleshooting

Common Issues

"Please provide a folder containing the files that will be clustered (--parquet_dir)"
- Solution: Ensure you provide the --parquet_dir parameter with a valid path.
Container platform mismatch (ARM64 vs AMD64)
- Solution: Use -profile docker,arm64 on Apple Silicon, or -profile docker,emulate_amd64 to emulate AMD64.
Memory errors during clustering
- Solution: Increase --max_memory or reduce the number of parallel processes.
Missing QPX parquet files
- Solution: Ensure each project directory contains .psm.parquet, .run.parquet, and .sample.parquet files.

Getting Help

Check the Nextflow log for detailed error messages
Review the execution report in results/pipeline_info/execution_report.html
Open an issue on GitHub

Citation

If you use spectrafuse in your research, please cite:

The spectrafuse workflow (when published)
MaRaCluster for the clustering algorithm
quantms for the reanalysis workflow

License

This workflow is licensed under the MIT License. See the LICENSE file for details.

Contributing

Contributions are welcome! Please see the Contributing Guidelines for more information.

Acknowledgments

MaRaCluster for the clustering algorithm
quantms for the proteomics reanalysis workflow
Nextflow for the workflow management system
nf-core for workflow best practices and infrastructure

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
.github/workflows		.github/workflows
bin		bin
conf		conf
docs		docs
modules/local		modules/local
subworkflows/nf-core/utils_nextflow_pipeline		subworkflows/nf-core/utils_nextflow_pipeline
workflows		workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
e2e.config		e2e.config
incremental_quality.config		incremental_quality.config
main.nf		main.nf
multi_project.config		multi_project.config
nextflow.config		nextflow.config
phase0_split.config		phase0_split.config
pxd004452.config		pxd004452.config

Folders and files

Latest commit

History

Repository files navigation

spectrafuse

Workflow Overview

Full Mode (default)

Incremental Mode

Key Features

Input Requirements

Installation

Prerequisites

Install Nextflow

Clone the Repository

Usage

Full Mode (default)

Incremental Mode

Resume a Previous Run

Test Run

Parameters

Required Parameters

Pipeline Mode Parameters

MaRaCluster Parameters

Consensus Strategy Parameters

Output Parameters

Resource Limits

Output Structure

Full Mode

Incremental Mode

Profiles

Container Engine Profiles

Platform Profiles

Other Profiles

Example Profile Combinations

Consensus Spectrum Strategies

Troubleshooting

Common Issues

Getting Help

Citation

License

Contributing

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages