Summary
Currently, the MZML_STATISTICS step only processes .mzML files. Bruker .d files are skipped entirely, meaning no QC statistics are generated for timsTOF datasets.
The previous --convert_dotd option (removed in PR #27) converted .d to mzML so statistics could be computed, but this is wasteful — converting hundreds of GB just for QC metrics.
Proposed Solution
Use pyopenms's .d reading capabilities (or another library that supports Bruker .d natively) to compute QC statistics directly from .d files without mzML conversion.
Context
Per @jpfeuffer's review in PR #27:
"I think the number one priority for this should now be to not convert hundreds of gigabytes of .d to mzml just for a bit of qc. Use the new .d reading capabilities of pyopenms if they are sufficient, otherwise there are many other libraries that can read .d these days."
Acceptance Criteria
Summary
Currently, the
MZML_STATISTICSstep only processes.mzMLfiles. Bruker.dfiles are skipped entirely, meaning no QC statistics are generated for timsTOF datasets.The previous
--convert_dotdoption (removed in PR #27) converted.dto mzML so statistics could be computed, but this is wasteful — converting hundreds of GB just for QC metrics.Proposed Solution
Use pyopenms's
.dreading capabilities (or another library that supports Bruker.dnatively) to compute QC statistics directly from.dfiles without mzML conversion.Context
Per @jpfeuffer's review in PR #27:
Acceptance Criteria
MZML_STATISTICS(or a new module) can compute MS1/MS2 statistics from.dfiles*_ms_info.parquetfor consistency