Improve protein ID normalization and expand DE analysis in Mokume by Shen-YuFei · Pull Request #18 · bigbio/mokume

Shen-YuFei · 2026-04-14T07:09:59Z

This pull request introduces several new features and documentation improvements to the mokume project, with a focus on expanding support for differential expression analysis, normalization methods (including LOESS), and the new TissueMap workflow for tissue atlas analysis. The CLI and configuration documentation have been updated to reflect new options and workflows, and installation instructions now cover optional dependencies for specialized features.

Major new features and documentation updates:

Differential Expression and Normalization Enhancements

Added a new analysis/ module with support for modern differential expression methods (limrots, deqms, proda), and documented their usage in the main README.md, including Python API examples and CLI integration. [1] [2]
Introduced LOESS normalization as a standalone utility, with code examples and documentation in both the main README.md and concepts/normalization.md. [1] [2] [3]
Updated CLI and configuration documentation to reflect new normalization and DE options, including expanded choices for normalization methods, FDR correction, and DE method auto-selection. [1] [2] [3] [4] [5] [6]

TissueMap Workflow

Added documentation and installation instructions for the new tissuemap workflow, enabling per-dataset tissue atlas analysis with AdaTiSS scoring, AnnData outputs, and atlas plots. This includes CLI documentation, quickstart examples, and optional dependency installation guidance. [1] [2] [3] [4] [5] [6] [7]

Installation and Quickstart Improvements

Clarified installation instructions for standard and optional features, including batch correction and TissueMap extras. [1] [2] [3]
Updated quickstart and workflow documentation to guide users through both standard quantification and tissue atlas analysis paths. [1] [2]

Miscellaneous

Updated CLI reference to reflect expanded commands (now six main commands), new options, and improved descriptions for normalization, DE, and batch correction. [1] [2] [3]
Minor corrections and clarifications in documentation, including code identifier updates and improved workflow descriptions. [1] [2]

…n, LOESS, docs

coderabbitai · 2026-04-14T07:10:07Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 635b5ef0-652e-4992-b9d1-50cd1d0f711f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codacy-production · 2026-04-14T07:11:34Z

Not up to standards ⛔

🔴 Issues 15 minor

Alerts:
⚠ 15 issues (≤ 0 issues of at least minor severity)

Results:
15 new issues

Category Results

Documentation 14 minor

CodeStyle 1 minor

View in Codacy

🟢 Metrics 235 complexity · 8 duplication

Metric Results

Complexity 235

Duplication 8

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

Copilot

Pull request overview

This PR adds new differential expression (DE) analysis capabilities to mokume (LimROTS, DEqMS, proDA), improves protein accession normalization for FASTA matching, and expands CLI/config/docs to expose the new options.

Changes:

Introduces a new mokume.analysis DE API (DifferentialExpression) plus implementations of LimROTS, DEqMS, and proDA, and wires DE into the pipeline with an "auto" method selector.
Improves protein identifier normalization via get_accession()/build_accession_map() and applies it to FASTA-based peptide counting / molecular weight calculation paths.
Adds censored-aware imputation utilities and LOESS normalization, plus new tests and documentation/CLI/config updates for these features.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/test_features2proteins_cli.py	Adds CLI tests covering DirectLFQ and batch correction option plumbing/validation.
tests/test_de_and_imputation.py	Adds tests for DE methods, IHW path, censored imputation, and LOESS normalization.
tests/test_accession_normalization.py	Adds tests for accession parsing/mapping and FASTA sequence cleaning helper.
mokume/quantification/ibaq.py	Normalizes protein identifiers when matching FASTA entries for iBAQ/TPA calculations.
mokume/pipeline/stages.py	Adds DE method auto-selection and optional peptide-count loading for DEqMS; pins pandas groupby `observed`.
mokume/pipeline/features_to_proteins.py	Extends pipeline entrypoint parameters/config wiring for new DE and DirectLFQ options.
mokume/pipeline/config.py	Adds `ImputationConfig`, extends `DEConfig` with `fdr_method`, and adds `imputation` to `PipelineConfig`.
mokume/normalization/loess.py	Introduces LOESS normalization implementation and functional wrapper.
mokume/normalization/init.py	Exports LOESS normalization API.
mokume/io/fasta.py	Applies accession normalization and shared nonstandard-AA stripping in FASTA extraction and MW lookup.
mokume/imputation/censored.py	Adds censored-aware imputation methods (MinProb/MinDet/KNN) and missingness classification.
mokume/imputation/init.py	Exports new censored-aware imputation functions.
mokume/core/constants.py	Enhances `get_accession()` and adds `build_accession_map()` for normalization.
mokume/commands/features2proteins.py	Expands CLI with batch correction options, new DE methods, and `--de-fdr-method`; adds argument validation.
mokume/analysis/proda.py	Adds simplified proDA-style dropout-aware DE implementation.
mokume/analysis/limrots.py	Adds LimROTS DE implementation with bootstrap/permutation optimization.
mokume/analysis/differential_expression.py	Refactors DE orchestration, adds IHW support helpers, and integrates LimROTS/DEqMS/proDA dispatch.
mokume/analysis/deqms.py	Adds DEqMS implementation (spectra-count eBayes with variance–count modeling).
mokume/analysis/init.py	Exports new analysis entrypoints for direct import.
mokume/init.py	Suppresses a pyopenms OPENMS_DATA_PATH warning.
docs/user-guide/features2proteins.md	Updates user guide examples and option tables for new DE/FDR options.
docs/reference/configuration.md	Documents new DE config defaults and `fdr_method`.
docs/reference/cli.md	Updates CLI reference with batch correction and new DE/FDR options.
README.md	Documents new analysis/imputation/normalization modules and provides usage examples.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T07:14:30Z

mokume/analysis/differential_expression.py

    def _finalize_results(self, de_df: pd.DataFrame) -> pd.DataFrame:
        """Apply FDR correction and classify significance."""
-        # FDR correction (Benjamini-Hochberg)
-        reject, adj_pvalues, _, _ = multipletests(
-            de_df["pvalue"].values, method="fdr_bh"
-        )
-        de_df["adj_pvalue"] = adj_pvalues
+        if self.fdr_method == "ihw" and "adj_pvalue" not in de_df.columns:
+            de_df["adj_pvalue"] = _ihw_correction(
+                de_df["pvalue"].values,
+                de_df,
+                alpha=self.fdr_threshold,
+            )
+        elif "adj_pvalue" not in de_df.columns:
+            # Benjamini-Hochberg (default)
+            de_df["adj_pvalue"] = multipletests(
+                de_df["pvalue"].values, method="fdr_bh"
+            )[1]



fdr_method is effectively ignored for all DE methods right now: run_limrots, run_deqms, and run_proda each return an adj_pvalue column, and _finalize_results only applies IHW/BH when adj_pvalue is missing. This means --de-fdr-method ihw (and even BH recomputation) won’t take effect. Consider always recomputing/overriding adj_pvalue based on self.fdr_method (optionally preserving any method-specific adjusted values under a different column name).

Copilot · 2026-04-14T07:14:31Z

mokume/analysis/proda.py

+    if n_obs < 5 or n_obs == n_total:
+        return np.nanmedian(observed) - 2.0, 1.0
+


_fit_dropout_curve() calls np.nanmedian(observed) even when observed is empty (n_obs == 0), which yields nan parameters and can propagate NaNs through the DE results. Handle the all-missing case explicitly (e.g., return a fixed fallback rho/zeta based on a global default) before calling nanmedian.

Copilot · 2026-04-14T07:14:31Z

README.md

+# Multiple contrasts
+contrasts = [("Treatment", "Control"), ("Drug", "Control")]
+results = de.run_comparisons(protein_df, sample_to_condition, contrasts)
+# Returns dict: {"Treatment_vs_Control": DataFrame, "Drug_vs_Control": DataFrame}


The README example for run_comparisons() shows keys like "Treatment_vs_Control", but the implementation uses f"{contrast[0]}-{contrast[1]}" (e.g., "Treatment-Control"). Update the example to match the actual keys returned so users can index results correctly.

Suggested change

# Returns dict: {"Treatment_vs_Control": DataFrame, "Drug_vs_Control": DataFrame}

# Returns dict: {"Treatment-Control": DataFrame, "Drug-Control": DataFrame}

Copilot · 2026-04-14T07:14:31Z

README.md

+imputed = impute_censored(log2_matrix, method="minprob", q=0.01, tune_sigma=1.0)
+
+# MinDet: replace NaN with per-column quantile
+imputed = impute_censored(log2_matrix, method="mindet", q=0.01)
+
+# KNN: k-nearest neighbor imputation
+imputed = impute_censored(log2_matrix, method="knn", k=10)


The impute_censored README examples use keyword args (q, tune_sigma, k) that don’t match the actual function signature (quantile, shift, scale, n_neighbors, …). As written, these snippets will raise TypeError. Please update the example arguments to the real parameter names.

Suggested change

imputed = impute_censored(log2_matrix, method="minprob", q=0.01, tune_sigma=1.0)

# MinDet: replace NaN with per-column quantile

imputed = impute_censored(log2_matrix, method="mindet", q=0.01)

# KNN: k-nearest neighbor imputation

imputed = impute_censored(log2_matrix, method="knn", k=10)

imputed = impute_censored(log2_matrix, method="minprob", quantile=0.01, scale=1.0)

# MinDet: replace NaN with per-column quantile

imputed = impute_censored(log2_matrix, method="mindet", quantile=0.01)

# KNN: k-nearest neighbor imputation

imputed = impute_censored(log2_matrix, method="knn", n_neighbors=10)

Copilot · 2026-04-14T07:14:32Z

tests/test_de_and_imputation.py

+class TestIHW:
+    def test_ihw_via_de(self):
+        mat, sa, sb, _ = _make_protein_matrix()
+        wide, s2c = _make_wide_df(mat, sa, sb)
+        de = DifferentialExpression(
+            method="deqms", log2fc_threshold=1.0, fdr_method="ihw", skip_log2=True,
+        )
+        result = de.run(wide, s2c, ("A", "B"))
+        assert len(result) > 0
+        assert "adj_pvalue" in result.columns
+
+    def test_ihw_fallback_on_small_data(self):
+        # Very few proteins → should fall back to BH
+        mat, sa, sb, _ = _make_protein_matrix(n_proteins=5)
+        wide, s2c = _make_wide_df(mat, sa, sb)
+        de = DifferentialExpression(
+            method="deqms", log2fc_threshold=1.0, fdr_method="ihw", skip_log2=True,
+        )
+        result = de.run(wide, s2c, ("A", "B"))
+        assert len(result) > 0
+


The IHW tests currently only assert that an adj_pvalue column exists, which would also pass if BH is being applied (or if IHW is ignored). Since --de-fdr-method ihw is a user-facing option, add an assertion that specifically verifies IHW is applied (or that the BH fallback path is taken for small datasets).

…y-obs guard, README examples

Shen-YuFei added 2 commits April 8, 2026 15:57

fix(fasta): normalize protein ID matching for non-standard formats

6f850ef

feat(analysis): add LimROTS/DEqMS/proDA DE, auto-selection, imputatio…

709ad5c

…n, LOESS, docs

Copilot AI review requested due to automatic review settings April 14, 2026 07:10

Copilot started reviewing on behalf of Shen-YuFei April 14, 2026 07:10 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Shen-YuFei added 7 commits April 14, 2026 16:43

fix(analysis): clear codacy warnings

98fa792

fix(analysis): clear codacy warnings

ad4e619

Merge branch 'dev' of https://github.com/Shen-YuFei/mokume into dev

43ee40d

fix

45c9a62

docs: align documentation with current codebase and fix API kwarg bug

804f8a8

refactor(analysis): remove dead code, YAGNI guards, and unused exports

ac113d9

fix(analysis): address Copilot PR review — IHW fdr_method, proda empt…

c20ef3d

…y-obs guard, README examples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve protein ID normalization and expand DE analysis in Mokume#18

Improve protein ID normalization and expand DE analysis in Mokume#18
Shen-YuFei wants to merge 9 commits intobigbio:devfrom
Shen-YuFei:dev

Shen-YuFei commented Apr 14, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 14, 2026 •

edited

Loading

Review skipped

Uh oh!

codacy-production bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if n_obs < 5 or n_obs == n_total:
		return np.nanmedian(observed) - 2.0, 1.0

	# Returns dict: {"Treatment_vs_Control": DataFrame, "Drug_vs_Control": DataFrame}
	# Returns dict: {"Treatment-Control": DataFrame, "Drug-Control": DataFrame}

Conversation

Shen-YuFei commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Differential Expression and Normalization Enhancements

TissueMap Workflow

Installation and Quickstart Improvements

Miscellaneous

Uh oh!

coderabbitai bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

codacy-production bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not up to standards ⛔

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shen-YuFei commented Apr 14, 2026 •

edited

Loading

coderabbitai bot commented Apr 14, 2026 •

edited

Loading

codacy-production bot commented Apr 14, 2026 •

edited

Loading