Add DIA-NN scoring mode parameter (Generic/Proteoforms/Peptidoforms) for proteogenomics support

## Summary

quantmsdiann currently runs DIA-NN in **Generic** scoring mode (the default). DIA-NN offers three scoring modes that significantly affect FDR estimation and variant peptide confidence, but none are exposed as pipeline parameters.

## Background: DIA-NN Scoring Modes

| Mode | Flag | Decoy strategy | Best for |
|------|------|---------------|----------|
| **Generic** | *(default)* | Shuffles most of the peptide sequence | Standard proteomics — maximizes protein IDs |
| **Proteoforms** | `--proteoforms` | Mutates a **single residue** per decoy | Amino acid substitutions, distinguishing paralogues, proteogenomics |
| **Peptidoforms** | `--peptidoforms` | Generic main q-values + extra peptidoform q-values | PTM analysis when also wanting max protein IDs |

Per DIA-NN documentation ([PTMs and peptidoforms](https://github.com/vdemichev/DiaNN/blob/master/README.md#ptms-and-peptidoforms)):

> *"If the purpose of the experiment is to identify/quantify specific PTMs, amino acid substitutions or distinguish proteins with high sequence identity, then the Peptidoforms (or Proteoforms) scoring option is recommended."*

> *"It is only the **Proteoforms** mode that can be used to reliably distinguish paralogue/orthologue proteins originating due to amino acid substitutions, hence the name for this mode."*

## Why this matters for proteogenomics

When searching with a variant-containing FASTA (e.g., COSMIC mutations), the database contains both canonical and variant protein sequences. For missense mutations, the canonical and variant peptides often differ by a single amino acid and share most fragment ions:

```
Canonical KRAS:    LVVVGAGGVGK   (WT)
KRAS G12D variant: LVVVGAGDVGK   (G→D at position 12)
```

In **Generic** mode:
- Decoys are fully shuffled → very different from both canonical and variant
- Both the canonical and variant peptide easily beat the decoy
- The q-value does NOT specifically validate whether the **single-residue difference** (G vs D) is correctly assigned
- FDR may be underestimated for variant-specific peptides (confirmed by [Armando et al. 2024](https://www.mdpi.com/2227-7382/12/4/33))

In **Proteoforms** mode:
- Decoys are single-residue mutations → directly model the confusion between canonical and variant peptides
- The q-value provides confidence that the exact amino acid sequence is correct
- FDR is properly estimated for single-residue variants
- May slightly reduce total IDs (~5-10% fewer in some datasets)

For frameshift/nonsense variants (completely different sequences from canonical), Generic mode is sufficient. But **missense variants represent ~85% of detected COSMIC variants**, making Proteoforms mode important for this use case.

## Proposed change

Add a new parameter `diann_scoring_mode` with three allowed values:

```groovy
// nextflow.config
diann_scoring_mode = 'generic'  // default, backward-compatible
```

```json
// nextflow_schema.json
"diann_scoring_mode": {
    "type": "string",
    "default": "generic",
    "enum": ["generic", "proteoforms", "peptidoforms"],
    "description": "DIA-NN scoring mode. 'generic' maximizes IDs (default). 'proteoforms' recommended for proteogenomics/variant detection — validates single-residue differences. 'peptidoforms' provides extra peptidoform q-values for PTM analysis."
}
```

In the DIA-NN modules, conditionally add the flag:

```groovy
scoring_mode = params.diann_scoring_mode == 'proteoforms' ? '--proteoforms' :
               params.diann_scoring_mode == 'peptidoforms' ? '--peptidoforms' : ''
```

The flag should be added to **all DIA-NN steps** (in-silico library generation, preliminary analysis, individual analysis, final quantification) and added to each module's blocked flags list.

## Additional output columns

When `--proteoforms` or `--peptidoforms` is enabled, DIA-NN produces additional columns in the report:
- `Peptidoform.Q.Value` — run-specific peptidoform confidence
- `Global.Peptidoform.Q.Value` — global peptidoform confidence
- `Lib.Peptidoform.Q.Value` — library peptidoform confidence

These should be preserved in the output and documented.

## DIA-NN version compatibility

- `--proteoforms`: Available since DIA-NN 2.0
- `--peptidoforms`: Available since DIA-NN 1.8+
- `--no-peptidoforms`: Available since DIA-NN 1.8+ (disables automatic activation with `--var-mod`)

**Note**: Since `--proteoforms` requires DIA-NN >= 2.0, the pipeline should validate that the selected DIA-NN version supports the chosen scoring mode.

## Files to modify

1. `nextflow.config` — add `diann_scoring_mode = 'generic'`
2. `nextflow_schema.json` — add parameter definition with enum and description
3. `modules/local/diann/insilico_library_generation/main.nf` — add scoring flag + blocked list
4. `modules/local/diann/preliminary_analysis/main.nf` — add scoring flag + blocked list
5. `modules/local/diann/individual_analysis/main.nf` — add scoring flag + blocked list
6. `modules/local/diann/final_quantification/main.nf` — add scoring flag + blocked list

## Related

- bigbio/quantmsdiann#42 — FDR parameter naming and matrix-level FDR controls
- [Armando et al. 2024 — Assessment of DIA-MS for SAAV Identification](https://www.mdpi.com/2227-7382/12/4/33)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DIA-NN scoring mode parameter (Generic/Proteoforms/Peptidoforms) for proteogenomics support #43

Summary

Background: DIA-NN Scoring Modes

Why this matters for proteogenomics

Proposed change

Additional output columns

DIA-NN version compatibility

Files to modify

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mode	Flag	Decoy strategy	Best for
Generic	(default)	Shuffles most of the peptide sequence	Standard proteomics — maximizes protein IDs
Proteoforms	`--proteoforms`	Mutates a single residue per decoy	Amino acid substitutions, distinguishing paralogues, proteogenomics
Peptidoforms	`--peptidoforms`	Generic main q-values + extra peptidoform q-values	PTM analysis when also wanting max protein IDs

Add DIA-NN scoring mode parameter (Generic/Proteoforms/Peptidoforms) for proteogenomics support #43

Description

Summary

Background: DIA-NN Scoring Modes

Why this matters for proteogenomics

Proposed change

Additional output columns

DIA-NN version compatibility

Files to modify

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions