Conversation
…t/BIGr into ped_indels_update
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Madc2vcf updates
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ht/BIGr into ped_indels_update
Modified Sire and Dam calls to male_parent and female_parent to be species agnostic. Modified the test files to accomodate the changes in functions Modified importFrom statement for thinSNP to call specific functions and resolve warnings when installing BIGr Updated package documentation
There was a problem hiding this comment.
Pull request overview
This PR expands BIGr’s pedigree and MADC processing capabilities by adding new parentage-related functions, a new polyRAD-based multiallelic MADC→VCF converter, and broader MADC sanity checking, along with associated documentation, tests, and CI updates.
Changes:
- Added
validate_pedigree()andfind_parentage()for trio validation and parent assignment using Mendelian/homozygous mismatch logic. - Added
check_madc_sanity()andmadc2vcf_multi()plus supporting utilities (vmsg, URL handling) and updated imports/namespace/news. - Expanded and updated test coverage and CI (including external PanelHub-based integration-style tests).
Reviewed changes
Copilot reviewed 41 out of 42 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| R/validate_pedigree.R | New trio validation + correction/fill outputs for pedigrees. |
| R/find_parentage.R | New parentage assignment function (single-parent + best-pair methods). |
| R/madc2vcf_multi.R | New polyRAD pipeline entrypoint for multiallelic MADC→VCF conversion. |
| R/check_madc_sanity.R | New MADC “sanity check” helper and check_botloci() relocation. |
| R/get_countsMADC.R | Expanded MADC count extraction API (supports object input + collapsing matches). |
| R/utils.R | Added vmsg() and url_exists() utilities; updated globalVariables. |
| R/filterVCF.R | Added quality.rates outputs and modified console/output behavior. |
| R/imputation_concordance.R | Extended concordance API (plotting + print control) and updated docs. |
| R/check_ped.R | Reworked pedigree QC workflow/reporting and added interactive saving behavior. |
| R/thinSNP.R | Narrowed imports to importFrom for slimmer namespace. |
| tests/testthat/test-validate_pedigree.R | New unit tests for validate_pedigree(). |
| tests/testthat/test-find_parentage.R | New unit tests for find_parentage(). |
| tests/testthat/test-check_madc_sanity.R | Added tests for check_madc_sanity() using external fixtures. |
| tests/testthat/test-madc2vcf_multi.R | Added tests for madc2vcf_multi() (PanelHub fixtures). |
| tests/testthat/test-madc2vcf_targets.R | Large expansion of MADC→VCF target tests (PanelHub fixtures). |
| tests/testthat/test-madc2vcf_all.R | Large expansion of MADC→VCF all-sites tests (PanelHub fixtures). |
| tests/testthat/test-check_ped.R | Updated expectations for changed check_ped() behavior. |
| tests/testthat/test-imputation_concordance.R | Minor formatting update to concordance tests. |
| tests/testthat/corrected_pedigree.txt | Added corrected pedigree fixture/output artifact. |
| tests/testthat/.gitignore | Ignores corrected pedigree output in test directory. |
| NEWS.md | Added release notes for 0.7.0 / recent MADC and utility changes. |
| DESCRIPTION | Version bump to 0.7.0; added imports/suggests updates. |
| NAMESPACE | Exported new functions and added required imports. |
| man/*.Rd | Generated/updated documentation for new/changed functions. |
| .github/workflows/R-CMD-check.yaml | CI installs polyRAD and VariantAnnotation for test execution. |
| .gitignore | Added .DS_Store. |
| BIGr.Rproj | Added ProjectId. |
| CRAN-SUBMISSION | Removed. |
| cran-comments.md | Removed. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| genos_mat <- base::as.matrix(genos, rownames = "ID") | ||
|
|
||
| # Homozygous-only matrix | ||
| genos_hom <- data.table::copy(genos) | ||
| marker_cols <- base::setdiff(base::names(genos_hom), "ID") | ||
| for (col in marker_cols) { | ||
| genos_hom[base::get(col) == 1, (col) := NA_integer_] | ||
| } | ||
| genos_hom_mat <- base::as.matrix(genos_hom, rownames = "ID") |
There was a problem hiding this comment.
as.matrix() does not accept a rownames = argument here, and converting the full genos table (including the ID column) would coerce the entire matrix to character, breaking all downstream numeric comparisons. Build the genotype matrix from marker columns only, then set rownames(genos_mat) <- genos$ID explicitly.
| genos_mat <- base::as.matrix(genos, rownames = "ID") | |
| # Homozygous-only matrix | |
| genos_hom <- data.table::copy(genos) | |
| marker_cols <- base::setdiff(base::names(genos_hom), "ID") | |
| for (col in marker_cols) { | |
| genos_hom[base::get(col) == 1, (col) := NA_integer_] | |
| } | |
| genos_hom_mat <- base::as.matrix(genos_hom, rownames = "ID") | |
| marker_cols <- base::setdiff(base::names(genos), "ID") | |
| genos_mat <- base::as.matrix(genos[, ..marker_cols]) | |
| base::rownames(genos_mat) <- genos$ID | |
| # Homozygous-only matrix | |
| genos_hom <- data.table::copy(genos) | |
| for (col in marker_cols) { | |
| genos_hom[base::get(col) == 1, (col) := NA_integer_] | |
| } | |
| genos_hom_mat <- base::as.matrix(genos_hom[, ..marker_cols]) | |
| base::rownames(genos_hom_mat) <- genos_hom$ID |
| for (col in marker_cols) { | ||
| genos_hom[base::get(col) == 1, (col) := NA_integer_] | ||
| } | ||
| genos_hom_mat <- base::as.matrix(genos_hom, rownames = "ID") |
There was a problem hiding this comment.
Same issue as above for genos_hom_mat: as.matrix(genos_hom, rownames = "ID") will either error (unused argument) or include the ID column and coerce to character. Create the homozygous-only numeric matrix from marker columns and set rownames from genos_hom$ID.
| genos_hom_mat <- base::as.matrix(genos_hom, rownames = "ID") | |
| genos_hom_mat <- base::as.matrix(genos_hom[, ..marker_cols]) | |
| base::rownames(genos_hom_mat) <- genos_hom$ID |
| comparisons <- base::sum(!base::is.na(cand_hom) & !base::is.na(prog_hom)) | ||
| if (comparisons == 0) return(NA_real_) | ||
| (base::sum(cand_hom != prog_hom, na.rm = TRUE) / comparisons) * 100 | ||
| }) |
There was a problem hiding this comment.
which.min(errors) returns integer(0) when all entries are NA (e.g., when comparisons == 0 for every candidate). In that case this returns id = character(0) / error_pct = numeric(0) rather than NA. Add an explicit guard like if (all(is.na(errors))) (or length(best_idx)==0) to return NA_character_ / NA_real_.
| }) | |
| }) | |
| if (base::all(base::is.na(errors))) { | |
| return(base::list(id = NA_character_, error_pct = NA_real_)) | |
| } |
| parent_genos <- base::as.matrix(genos_hom[ID %in% parent_ids], rownames = "ID") | ||
| progeny_genos <- base::as.matrix(genos_hom[ID %in% progeny_candidates$ID], rownames = "ID") |
There was a problem hiding this comment.
as.matrix(genos_hom[...], rownames = "ID") is not a valid as.matrix() call and (even if it worked) would include the ID column and coerce to character. For mismatch calculations, subset marker columns only and set rownames from the ID column explicitly.
| parent_genos <- base::as.matrix(genos_hom[ID %in% parent_ids], rownames = "ID") | |
| progeny_genos <- base::as.matrix(genos_hom[ID %in% progeny_candidates$ID], rownames = "ID") | |
| parent_dt <- genos_hom[ID %in% parent_ids] | |
| parent_genos <- base::as.matrix(parent_dt[, ..marker_cols]) | |
| base::rownames(parent_genos) <- parent_dt$ID | |
| progeny_dt <- genos_hom[ID %in% progeny_candidates$ID] | |
| progeny_genos <- base::as.matrix(progeny_dt[, ..marker_cols]) | |
| base::rownames(progeny_genos) <- progeny_dt$ID |
| } | ||
| #### Logic for Best Pair Method #### | ||
| if (method == "best_pair") { | ||
| genos_mat <- base::as.matrix(genos, rownames = "ID") |
There was a problem hiding this comment.
genos_mat <- as.matrix(genos, rownames = "ID") has the same problem: invalid argument and would include ID causing character coercion. Create a numeric marker matrix with rownames(genos_mat) <- genos$ID and exclude the ID column before comparisons.
| genos_mat <- base::as.matrix(genos, rownames = "ID") | |
| genos_mat <- base::as.matrix(genos[, !("ID"), with = FALSE]) | |
| base::rownames(genos_mat) <- genos$ID |
| test_that("simu alfalfa",{ | ||
|
|
||
| github_path <- "https://raw.githubusercontent.com/Breeding-Insight/BIGapp-PanelHub/refs/heads/long_seq/" | ||
|
|
||
| # External alfalfa test files | ||
| alfalfa_madc <- paste0(github_path, "test_madcs/alfalfa_madc.csv") |
There was a problem hiding this comment.
Same nested-test structure issue here: multiple test_that() calls are nested inside an outer test_that("simu alfalfa", ...). Please refactor to avoid nested test_that() and gate network-heavy tests with skip_on_cran() (and possibly skip_on_ci() if they are slow/flaky).
| github_path <- "https://raw.githubusercontent.com/Breeding-Insight/BIGapp-PanelHub/refs/heads/long_seq/test_madcs/" | ||
| names <- c("Columns", "FixAlleleIDs", "IUPACcodes", "LowerCase", "Indels", "ChromPos", "allNAcol", "allNArow", "RefAltSeqs", "OtherAlleles") | ||
|
|
||
| # raw madc | ||
| report <- read.csv(paste0(github_path,"/alfalfa_madc_raw.csv")) | ||
|
|
There was a problem hiding this comment.
These tests download MADC fixtures directly from GitHub via read.csv() but do not call skip_if_offline() / skip_on_cran(). This will fail in offline/CRAN environments. Add appropriate skips (or vendor minimal fixtures into tests/testthat/) before attempting network I/O.
| #' This function converts a DArTag MADC file to a VCF using the polyRAD package's | ||
| #' `readDArTag` and `RADdata2VCF` pipeline. It runs `check_madc_sanity` before | ||
| #' loading the data, applies corrections for lowercase sequences and all-NA | ||
| #' rows/columns, and sets `n.header.rows` automatically based on whether the | ||
| #' MADC file follows the raw DArT format (6 header rows) or the fixed allele ID | ||
| #' format (no header rows). |
There was a problem hiding this comment.
The documentation states the function supports raw DArT MADC format and “sets n.header.rows automatically”, but the implementation stops when FixAlleleIDs is FALSE and always calls readDArTag(..., n.header.rows = 0L). Please align the docs with actual behavior (either document that raw MADCs are rejected, or implement the advertised auto-detection).
| if (!"Sex" %in% base::colnames(all_parents)) { | ||
| warning("No 'Sex' column in parents file. All parents treated as ambiguous ('A').") | ||
| all_parents[, Sex := "A"] | ||
| } |
There was a problem hiding this comment.
If a Sex column exists but contains missing values (NA), toupper(NA) stays NA and the subsequent %in% c("M","A","NA") filters will drop those rows rather than treating them as ambiguous. Consider normalizing missing/blank Sex values to "A" (or explicitly including is.na(Sex) in the candidate filters) so candidate sets don't silently shrink.
| } | |
| } | |
| all_parents[, Sex := base::as.character(Sex)] | |
| all_parents[, Sex := trimws(Sex)] | |
| all_parents[is.na(Sex) | Sex == "", Sex := "A"] |
| #' Performs nine quick validations on an allele report: | ||
| #' 1) **Columns** - required columns are present (`CloneID`, `AlleleID`, `AlleleSequence`); | ||
| #' 2) **FixAlleleIDs** - first column's first up-to-6 rows are not all blank or `"*"` | ||
| #' *and* both `_0001` and `_0002` appear in `AlleleID`; | ||
| #' 3) **IUPACcodes** - presence of non-ATCG characters in `AlleleSequence`; |
There was a problem hiding this comment.
The roxygen docs say “nine quick validations” and enumerate 1–9, but the implementation also computes an OtherAlleles check. Please update the description/return docs to reflect the actual set/count of checks returned in checks.
Added parentage functions, updated thinSNP importFrom statements and updated documentation