Improve protein ID normalization and expand DE analysis in Mokume#18
Improve protein ID normalization and expand DE analysis in Mokume#18Shen-YuFei wants to merge 9 commits intobigbio:devfrom
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| Documentation | 14 minor |
| CodeStyle | 1 minor |
🟢 Metrics 235 complexity · 8 duplication
Metric Results Complexity 235 Duplication 8
TIP This summary will be updated as you push new changes. Give us feedback
There was a problem hiding this comment.
Pull request overview
This PR adds new differential expression (DE) analysis capabilities to mokume (LimROTS, DEqMS, proDA), improves protein accession normalization for FASTA matching, and expands CLI/config/docs to expose the new options.
Changes:
- Introduces a new
mokume.analysisDE API (DifferentialExpression) plus implementations of LimROTS, DEqMS, and proDA, and wires DE into the pipeline with an"auto"method selector. - Improves protein identifier normalization via
get_accession()/build_accession_map()and applies it to FASTA-based peptide counting / molecular weight calculation paths. - Adds censored-aware imputation utilities and LOESS normalization, plus new tests and documentation/CLI/config updates for these features.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_features2proteins_cli.py | Adds CLI tests covering DirectLFQ and batch correction option plumbing/validation. |
| tests/test_de_and_imputation.py | Adds tests for DE methods, IHW path, censored imputation, and LOESS normalization. |
| tests/test_accession_normalization.py | Adds tests for accession parsing/mapping and FASTA sequence cleaning helper. |
| mokume/quantification/ibaq.py | Normalizes protein identifiers when matching FASTA entries for iBAQ/TPA calculations. |
| mokume/pipeline/stages.py | Adds DE method auto-selection and optional peptide-count loading for DEqMS; pins pandas groupby observed. |
| mokume/pipeline/features_to_proteins.py | Extends pipeline entrypoint parameters/config wiring for new DE and DirectLFQ options. |
| mokume/pipeline/config.py | Adds ImputationConfig, extends DEConfig with fdr_method, and adds imputation to PipelineConfig. |
| mokume/normalization/loess.py | Introduces LOESS normalization implementation and functional wrapper. |
| mokume/normalization/init.py | Exports LOESS normalization API. |
| mokume/io/fasta.py | Applies accession normalization and shared nonstandard-AA stripping in FASTA extraction and MW lookup. |
| mokume/imputation/censored.py | Adds censored-aware imputation methods (MinProb/MinDet/KNN) and missingness classification. |
| mokume/imputation/init.py | Exports new censored-aware imputation functions. |
| mokume/core/constants.py | Enhances get_accession() and adds build_accession_map() for normalization. |
| mokume/commands/features2proteins.py | Expands CLI with batch correction options, new DE methods, and --de-fdr-method; adds argument validation. |
| mokume/analysis/proda.py | Adds simplified proDA-style dropout-aware DE implementation. |
| mokume/analysis/limrots.py | Adds LimROTS DE implementation with bootstrap/permutation optimization. |
| mokume/analysis/differential_expression.py | Refactors DE orchestration, adds IHW support helpers, and integrates LimROTS/DEqMS/proDA dispatch. |
| mokume/analysis/deqms.py | Adds DEqMS implementation (spectra-count eBayes with variance–count modeling). |
| mokume/analysis/init.py | Exports new analysis entrypoints for direct import. |
| mokume/init.py | Suppresses a pyopenms OPENMS_DATA_PATH warning. |
| docs/user-guide/features2proteins.md | Updates user guide examples and option tables for new DE/FDR options. |
| docs/reference/configuration.md | Documents new DE config defaults and fdr_method. |
| docs/reference/cli.md | Updates CLI reference with batch correction and new DE/FDR options. |
| README.md | Documents new analysis/imputation/normalization modules and provides usage examples. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _finalize_results(self, de_df: pd.DataFrame) -> pd.DataFrame: | ||
| """Apply FDR correction and classify significance.""" | ||
| # FDR correction (Benjamini-Hochberg) | ||
| reject, adj_pvalues, _, _ = multipletests( | ||
| de_df["pvalue"].values, method="fdr_bh" | ||
| ) | ||
| de_df["adj_pvalue"] = adj_pvalues | ||
| if self.fdr_method == "ihw" and "adj_pvalue" not in de_df.columns: | ||
| de_df["adj_pvalue"] = _ihw_correction( | ||
| de_df["pvalue"].values, | ||
| de_df, | ||
| alpha=self.fdr_threshold, | ||
| ) | ||
| elif "adj_pvalue" not in de_df.columns: | ||
| # Benjamini-Hochberg (default) | ||
| de_df["adj_pvalue"] = multipletests( | ||
| de_df["pvalue"].values, method="fdr_bh" | ||
| )[1] | ||
|
|
There was a problem hiding this comment.
fdr_method is effectively ignored for all DE methods right now: run_limrots, run_deqms, and run_proda each return an adj_pvalue column, and _finalize_results only applies IHW/BH when adj_pvalue is missing. This means --de-fdr-method ihw (and even BH recomputation) won’t take effect. Consider always recomputing/overriding adj_pvalue based on self.fdr_method (optionally preserving any method-specific adjusted values under a different column name).
| if n_obs < 5 or n_obs == n_total: | ||
| return np.nanmedian(observed) - 2.0, 1.0 | ||
|
|
There was a problem hiding this comment.
_fit_dropout_curve() calls np.nanmedian(observed) even when observed is empty (n_obs == 0), which yields nan parameters and can propagate NaNs through the DE results. Handle the all-missing case explicitly (e.g., return a fixed fallback rho/zeta based on a global default) before calling nanmedian.
README.md
Outdated
| # Multiple contrasts | ||
| contrasts = [("Treatment", "Control"), ("Drug", "Control")] | ||
| results = de.run_comparisons(protein_df, sample_to_condition, contrasts) | ||
| # Returns dict: {"Treatment_vs_Control": DataFrame, "Drug_vs_Control": DataFrame} |
There was a problem hiding this comment.
The README example for run_comparisons() shows keys like "Treatment_vs_Control", but the implementation uses f"{contrast[0]}-{contrast[1]}" (e.g., "Treatment-Control"). Update the example to match the actual keys returned so users can index results correctly.
| # Returns dict: {"Treatment_vs_Control": DataFrame, "Drug_vs_Control": DataFrame} | |
| # Returns dict: {"Treatment-Control": DataFrame, "Drug-Control": DataFrame} |
README.md
Outdated
| imputed = impute_censored(log2_matrix, method="minprob", q=0.01, tune_sigma=1.0) | ||
|
|
||
| # MinDet: replace NaN with per-column quantile | ||
| imputed = impute_censored(log2_matrix, method="mindet", q=0.01) | ||
|
|
||
| # KNN: k-nearest neighbor imputation | ||
| imputed = impute_censored(log2_matrix, method="knn", k=10) |
There was a problem hiding this comment.
The impute_censored README examples use keyword args (q, tune_sigma, k) that don’t match the actual function signature (quantile, shift, scale, n_neighbors, …). As written, these snippets will raise TypeError. Please update the example arguments to the real parameter names.
| imputed = impute_censored(log2_matrix, method="minprob", q=0.01, tune_sigma=1.0) | |
| # MinDet: replace NaN with per-column quantile | |
| imputed = impute_censored(log2_matrix, method="mindet", q=0.01) | |
| # KNN: k-nearest neighbor imputation | |
| imputed = impute_censored(log2_matrix, method="knn", k=10) | |
| imputed = impute_censored(log2_matrix, method="minprob", quantile=0.01, scale=1.0) | |
| # MinDet: replace NaN with per-column quantile | |
| imputed = impute_censored(log2_matrix, method="mindet", quantile=0.01) | |
| # KNN: k-nearest neighbor imputation | |
| imputed = impute_censored(log2_matrix, method="knn", n_neighbors=10) |
| class TestIHW: | ||
| def test_ihw_via_de(self): | ||
| mat, sa, sb, _ = _make_protein_matrix() | ||
| wide, s2c = _make_wide_df(mat, sa, sb) | ||
| de = DifferentialExpression( | ||
| method="deqms", log2fc_threshold=1.0, fdr_method="ihw", skip_log2=True, | ||
| ) | ||
| result = de.run(wide, s2c, ("A", "B")) | ||
| assert len(result) > 0 | ||
| assert "adj_pvalue" in result.columns | ||
|
|
||
| def test_ihw_fallback_on_small_data(self): | ||
| # Very few proteins → should fall back to BH | ||
| mat, sa, sb, _ = _make_protein_matrix(n_proteins=5) | ||
| wide, s2c = _make_wide_df(mat, sa, sb) | ||
| de = DifferentialExpression( | ||
| method="deqms", log2fc_threshold=1.0, fdr_method="ihw", skip_log2=True, | ||
| ) | ||
| result = de.run(wide, s2c, ("A", "B")) | ||
| assert len(result) > 0 | ||
|
|
There was a problem hiding this comment.
The IHW tests currently only assert that an adj_pvalue column exists, which would also pass if BH is being applied (or if IHW is ignored). Since --de-fdr-method ihw is a user-facing option, add an assertion that specifically verifies IHW is applied (or that the BH fallback path is taken for small datasets).
…y-obs guard, README examples
This pull request introduces several new features and documentation improvements to the mokume project, with a focus on expanding support for differential expression analysis, normalization methods (including LOESS), and the new TissueMap workflow for tissue atlas analysis. The CLI and configuration documentation have been updated to reflect new options and workflows, and installation instructions now cover optional dependencies for specialized features.
Major new features and documentation updates:
Differential Expression and Normalization Enhancements
analysis/module with support for modern differential expression methods (limrots,deqms,proda), and documented their usage in the mainREADME.md, including Python API examples and CLI integration. [1] [2]README.mdandconcepts/normalization.md. [1] [2] [3]TissueMap Workflow
tissuemapworkflow, enabling per-dataset tissue atlas analysis with AdaTiSS scoring, AnnData outputs, and atlas plots. This includes CLI documentation, quickstart examples, and optional dependency installation guidance. [1] [2] [3] [4] [5] [6] [7]Installation and Quickstart Improvements
Miscellaneous