Skip to content

Improve protein ID normalization and expand DE analysis in Mokume#18

Open
Shen-YuFei wants to merge 9 commits intobigbio:devfrom
Shen-YuFei:dev
Open

Improve protein ID normalization and expand DE analysis in Mokume#18
Shen-YuFei wants to merge 9 commits intobigbio:devfrom
Shen-YuFei:dev

Conversation

@Shen-YuFei
Copy link
Copy Markdown
Contributor

@Shen-YuFei Shen-YuFei commented Apr 14, 2026

This pull request introduces several new features and documentation improvements to the mokume project, with a focus on expanding support for differential expression analysis, normalization methods (including LOESS), and the new TissueMap workflow for tissue atlas analysis. The CLI and configuration documentation have been updated to reflect new options and workflows, and installation instructions now cover optional dependencies for specialized features.

Major new features and documentation updates:

Differential Expression and Normalization Enhancements

  • Added a new analysis/ module with support for modern differential expression methods (limrots, deqms, proda), and documented their usage in the main README.md, including Python API examples and CLI integration. [1] [2]
  • Introduced LOESS normalization as a standalone utility, with code examples and documentation in both the main README.md and concepts/normalization.md. [1] [2] [3]
  • Updated CLI and configuration documentation to reflect new normalization and DE options, including expanded choices for normalization methods, FDR correction, and DE method auto-selection. [1] [2] [3] [4] [5] [6]

TissueMap Workflow

  • Added documentation and installation instructions for the new tissuemap workflow, enabling per-dataset tissue atlas analysis with AdaTiSS scoring, AnnData outputs, and atlas plots. This includes CLI documentation, quickstart examples, and optional dependency installation guidance. [1] [2] [3] [4] [5] [6] [7]

Installation and Quickstart Improvements

  • Clarified installation instructions for standard and optional features, including batch correction and TissueMap extras. [1] [2] [3]
  • Updated quickstart and workflow documentation to guide users through both standard quantification and tissue atlas analysis paths. [1] [2]

Miscellaneous

  • Updated CLI reference to reflect expanded commands (now six main commands), new options, and improved descriptions for normalization, DE, and batch correction. [1] [2] [3]
  • Minor corrections and clarifications in documentation, including code identifier updates and improved workflow descriptions. [1] [2]

Copilot AI review requested due to automatic review settings April 14, 2026 07:10
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 14, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 635b5ef0-652e-4992-b9d1-50cd1d0f711f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codacy-production
Copy link
Copy Markdown

codacy-production bot commented Apr 14, 2026

Not up to standards ⛔

🔴 Issues 15 minor

Alerts:
⚠ 15 issues (≤ 0 issues of at least minor severity)

Results:
15 new issues

Category Results
Documentation 14 minor
CodeStyle 1 minor

View in Codacy

🟢 Metrics 235 complexity · 8 duplication

Metric Results
Complexity 235
Duplication 8

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds new differential expression (DE) analysis capabilities to mokume (LimROTS, DEqMS, proDA), improves protein accession normalization for FASTA matching, and expands CLI/config/docs to expose the new options.

Changes:

  • Introduces a new mokume.analysis DE API (DifferentialExpression) plus implementations of LimROTS, DEqMS, and proDA, and wires DE into the pipeline with an "auto" method selector.
  • Improves protein identifier normalization via get_accession()/build_accession_map() and applies it to FASTA-based peptide counting / molecular weight calculation paths.
  • Adds censored-aware imputation utilities and LOESS normalization, plus new tests and documentation/CLI/config updates for these features.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_features2proteins_cli.py Adds CLI tests covering DirectLFQ and batch correction option plumbing/validation.
tests/test_de_and_imputation.py Adds tests for DE methods, IHW path, censored imputation, and LOESS normalization.
tests/test_accession_normalization.py Adds tests for accession parsing/mapping and FASTA sequence cleaning helper.
mokume/quantification/ibaq.py Normalizes protein identifiers when matching FASTA entries for iBAQ/TPA calculations.
mokume/pipeline/stages.py Adds DE method auto-selection and optional peptide-count loading for DEqMS; pins pandas groupby observed.
mokume/pipeline/features_to_proteins.py Extends pipeline entrypoint parameters/config wiring for new DE and DirectLFQ options.
mokume/pipeline/config.py Adds ImputationConfig, extends DEConfig with fdr_method, and adds imputation to PipelineConfig.
mokume/normalization/loess.py Introduces LOESS normalization implementation and functional wrapper.
mokume/normalization/init.py Exports LOESS normalization API.
mokume/io/fasta.py Applies accession normalization and shared nonstandard-AA stripping in FASTA extraction and MW lookup.
mokume/imputation/censored.py Adds censored-aware imputation methods (MinProb/MinDet/KNN) and missingness classification.
mokume/imputation/init.py Exports new censored-aware imputation functions.
mokume/core/constants.py Enhances get_accession() and adds build_accession_map() for normalization.
mokume/commands/features2proteins.py Expands CLI with batch correction options, new DE methods, and --de-fdr-method; adds argument validation.
mokume/analysis/proda.py Adds simplified proDA-style dropout-aware DE implementation.
mokume/analysis/limrots.py Adds LimROTS DE implementation with bootstrap/permutation optimization.
mokume/analysis/differential_expression.py Refactors DE orchestration, adds IHW support helpers, and integrates LimROTS/DEqMS/proDA dispatch.
mokume/analysis/deqms.py Adds DEqMS implementation (spectra-count eBayes with variance–count modeling).
mokume/analysis/init.py Exports new analysis entrypoints for direct import.
mokume/init.py Suppresses a pyopenms OPENMS_DATA_PATH warning.
docs/user-guide/features2proteins.md Updates user guide examples and option tables for new DE/FDR options.
docs/reference/configuration.md Documents new DE config defaults and fdr_method.
docs/reference/cli.md Updates CLI reference with batch correction and new DE/FDR options.
README.md Documents new analysis/imputation/normalization modules and provides usage examples.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 204 to 217
def _finalize_results(self, de_df: pd.DataFrame) -> pd.DataFrame:
"""Apply FDR correction and classify significance."""
# FDR correction (Benjamini-Hochberg)
reject, adj_pvalues, _, _ = multipletests(
de_df["pvalue"].values, method="fdr_bh"
)
de_df["adj_pvalue"] = adj_pvalues
if self.fdr_method == "ihw" and "adj_pvalue" not in de_df.columns:
de_df["adj_pvalue"] = _ihw_correction(
de_df["pvalue"].values,
de_df,
alpha=self.fdr_threshold,
)
elif "adj_pvalue" not in de_df.columns:
# Benjamini-Hochberg (default)
de_df["adj_pvalue"] = multipletests(
de_df["pvalue"].values, method="fdr_bh"
)[1]

Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fdr_method is effectively ignored for all DE methods right now: run_limrots, run_deqms, and run_proda each return an adj_pvalue column, and _finalize_results only applies IHW/BH when adj_pvalue is missing. This means --de-fdr-method ihw (and even BH recomputation) won’t take effect. Consider always recomputing/overriding adj_pvalue based on self.fdr_method (optionally preserving any method-specific adjusted values under a different column name).

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +43
if n_obs < 5 or n_obs == n_total:
return np.nanmedian(observed) - 2.0, 1.0

Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_fit_dropout_curve() calls np.nanmedian(observed) even when observed is empty (n_obs == 0), which yields nan parameters and can propagate NaNs through the DE results. Handle the all-missing case explicitly (e.g., return a fixed fallback rho/zeta based on a global default) before calling nanmedian.

Copilot uses AI. Check for mistakes.
README.md Outdated
# Multiple contrasts
contrasts = [("Treatment", "Control"), ("Drug", "Control")]
results = de.run_comparisons(protein_df, sample_to_condition, contrasts)
# Returns dict: {"Treatment_vs_Control": DataFrame, "Drug_vs_Control": DataFrame}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README example for run_comparisons() shows keys like "Treatment_vs_Control", but the implementation uses f"{contrast[0]}-{contrast[1]}" (e.g., "Treatment-Control"). Update the example to match the actual keys returned so users can index results correctly.

Suggested change
# Returns dict: {"Treatment_vs_Control": DataFrame, "Drug_vs_Control": DataFrame}
# Returns dict: {"Treatment-Control": DataFrame, "Drug-Control": DataFrame}

Copilot uses AI. Check for mistakes.
README.md Outdated
Comment on lines +229 to +235
imputed = impute_censored(log2_matrix, method="minprob", q=0.01, tune_sigma=1.0)

# MinDet: replace NaN with per-column quantile
imputed = impute_censored(log2_matrix, method="mindet", q=0.01)

# KNN: k-nearest neighbor imputation
imputed = impute_censored(log2_matrix, method="knn", k=10)
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The impute_censored README examples use keyword args (q, tune_sigma, k) that don’t match the actual function signature (quantile, shift, scale, n_neighbors, …). As written, these snippets will raise TypeError. Please update the example arguments to the real parameter names.

Suggested change
imputed = impute_censored(log2_matrix, method="minprob", q=0.01, tune_sigma=1.0)
# MinDet: replace NaN with per-column quantile
imputed = impute_censored(log2_matrix, method="mindet", q=0.01)
# KNN: k-nearest neighbor imputation
imputed = impute_censored(log2_matrix, method="knn", k=10)
imputed = impute_censored(log2_matrix, method="minprob", quantile=0.01, scale=1.0)
# MinDet: replace NaN with per-column quantile
imputed = impute_censored(log2_matrix, method="mindet", quantile=0.01)
# KNN: k-nearest neighbor imputation
imputed = impute_censored(log2_matrix, method="knn", n_neighbors=10)

Copilot uses AI. Check for mistakes.
Comment on lines +183 to +203
class TestIHW:
def test_ihw_via_de(self):
mat, sa, sb, _ = _make_protein_matrix()
wide, s2c = _make_wide_df(mat, sa, sb)
de = DifferentialExpression(
method="deqms", log2fc_threshold=1.0, fdr_method="ihw", skip_log2=True,
)
result = de.run(wide, s2c, ("A", "B"))
assert len(result) > 0
assert "adj_pvalue" in result.columns

def test_ihw_fallback_on_small_data(self):
# Very few proteins → should fall back to BH
mat, sa, sb, _ = _make_protein_matrix(n_proteins=5)
wide, s2c = _make_wide_df(mat, sa, sb)
de = DifferentialExpression(
method="deqms", log2fc_threshold=1.0, fdr_method="ihw", skip_log2=True,
)
result = de.run(wide, s2c, ("A", "B"))
assert len(result) > 0

Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IHW tests currently only assert that an adj_pvalue column exists, which would also pass if BH is being applied (or if IHW is ignored). Since --de-fdr-method ihw is a user-facing option, add an assertion that specifically verifies IHW is applied (or that the BH fallback path is taken for small datasets).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants