Skip to content

Add DIA-NN scoring mode parameter (Generic/Proteoforms/Peptidoforms) for proteogenomics support #43

@ypriverol

Description

@ypriverol

Summary

quantmsdiann currently runs DIA-NN in Generic scoring mode (the default). DIA-NN offers three scoring modes that significantly affect FDR estimation and variant peptide confidence, but none are exposed as pipeline parameters.

Background: DIA-NN Scoring Modes

Mode Flag Decoy strategy Best for
Generic (default) Shuffles most of the peptide sequence Standard proteomics — maximizes protein IDs
Proteoforms --proteoforms Mutates a single residue per decoy Amino acid substitutions, distinguishing paralogues, proteogenomics
Peptidoforms --peptidoforms Generic main q-values + extra peptidoform q-values PTM analysis when also wanting max protein IDs

Per DIA-NN documentation (PTMs and peptidoforms):

"If the purpose of the experiment is to identify/quantify specific PTMs, amino acid substitutions or distinguish proteins with high sequence identity, then the Peptidoforms (or Proteoforms) scoring option is recommended."

"It is only the Proteoforms mode that can be used to reliably distinguish paralogue/orthologue proteins originating due to amino acid substitutions, hence the name for this mode."

Why this matters for proteogenomics

When searching with a variant-containing FASTA (e.g., COSMIC mutations), the database contains both canonical and variant protein sequences. For missense mutations, the canonical and variant peptides often differ by a single amino acid and share most fragment ions:

Canonical KRAS:    LVVVGAGGVGK   (WT)
KRAS G12D variant: LVVVGAGDVGK   (G→D at position 12)

In Generic mode:

  • Decoys are fully shuffled → very different from both canonical and variant
  • Both the canonical and variant peptide easily beat the decoy
  • The q-value does NOT specifically validate whether the single-residue difference (G vs D) is correctly assigned
  • FDR may be underestimated for variant-specific peptides (confirmed by Armando et al. 2024)

In Proteoforms mode:

  • Decoys are single-residue mutations → directly model the confusion between canonical and variant peptides
  • The q-value provides confidence that the exact amino acid sequence is correct
  • FDR is properly estimated for single-residue variants
  • May slightly reduce total IDs (~5-10% fewer in some datasets)

For frameshift/nonsense variants (completely different sequences from canonical), Generic mode is sufficient. But missense variants represent ~85% of detected COSMIC variants, making Proteoforms mode important for this use case.

Proposed change

Add a new parameter diann_scoring_mode with three allowed values:

// nextflow.config
diann_scoring_mode = 'generic'  // default, backward-compatible
// nextflow_schema.json
"diann_scoring_mode": {
    "type": "string",
    "default": "generic",
    "enum": ["generic", "proteoforms", "peptidoforms"],
    "description": "DIA-NN scoring mode. 'generic' maximizes IDs (default). 'proteoforms' recommended for proteogenomics/variant detection — validates single-residue differences. 'peptidoforms' provides extra peptidoform q-values for PTM analysis."
}

In the DIA-NN modules, conditionally add the flag:

scoring_mode = params.diann_scoring_mode == 'proteoforms' ? '--proteoforms' :
               params.diann_scoring_mode == 'peptidoforms' ? '--peptidoforms' : ''

The flag should be added to all DIA-NN steps (in-silico library generation, preliminary analysis, individual analysis, final quantification) and added to each module's blocked flags list.

Additional output columns

When --proteoforms or --peptidoforms is enabled, DIA-NN produces additional columns in the report:

  • Peptidoform.Q.Value — run-specific peptidoform confidence
  • Global.Peptidoform.Q.Value — global peptidoform confidence
  • Lib.Peptidoform.Q.Value — library peptidoform confidence

These should be preserved in the output and documented.

DIA-NN version compatibility

  • --proteoforms: Available since DIA-NN 2.0
  • --peptidoforms: Available since DIA-NN 1.8+
  • --no-peptidoforms: Available since DIA-NN 1.8+ (disables automatic activation with --var-mod)

Note: Since --proteoforms requires DIA-NN >= 2.0, the pipeline should validate that the selected DIA-NN version supports the chosen scoring mode.

Files to modify

  1. nextflow.config — add diann_scoring_mode = 'generic'
  2. nextflow_schema.json — add parameter definition with enum and description
  3. modules/local/diann/insilico_library_generation/main.nf — add scoring flag + blocked list
  4. modules/local/diann/preliminary_analysis/main.nf — add scoring flag + blocked list
  5. modules/local/diann/individual_analysis/main.nf — add scoring flag + blocked list
  6. modules/local/diann/final_quantification/main.nf — add scoring flag + blocked list

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions