Skip to content

Add parentage functions#57

Open
josuechinchilla wants to merge 50 commits intodevelopmentfrom
add_parentage_functions
Open

Add parentage functions#57
josuechinchilla wants to merge 50 commits intodevelopmentfrom
add_parentage_functions

Conversation

@josuechinchilla
Copy link
Copy Markdown
Collaborator

Added parentage functions, updated thinSNP importFrom statements and updated documentation

Cristianetaniguti and others added 30 commits October 3, 2025 15:11
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands BIGr’s pedigree and MADC processing capabilities by adding new parentage-related functions, a new polyRAD-based multiallelic MADC→VCF converter, and broader MADC sanity checking, along with associated documentation, tests, and CI updates.

Changes:

  • Added validate_pedigree() and find_parentage() for trio validation and parent assignment using Mendelian/homozygous mismatch logic.
  • Added check_madc_sanity() and madc2vcf_multi() plus supporting utilities (vmsg, URL handling) and updated imports/namespace/news.
  • Expanded and updated test coverage and CI (including external PanelHub-based integration-style tests).

Reviewed changes

Copilot reviewed 41 out of 42 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
R/validate_pedigree.R New trio validation + correction/fill outputs for pedigrees.
R/find_parentage.R New parentage assignment function (single-parent + best-pair methods).
R/madc2vcf_multi.R New polyRAD pipeline entrypoint for multiallelic MADC→VCF conversion.
R/check_madc_sanity.R New MADC “sanity check” helper and check_botloci() relocation.
R/get_countsMADC.R Expanded MADC count extraction API (supports object input + collapsing matches).
R/utils.R Added vmsg() and url_exists() utilities; updated globalVariables.
R/filterVCF.R Added quality.rates outputs and modified console/output behavior.
R/imputation_concordance.R Extended concordance API (plotting + print control) and updated docs.
R/check_ped.R Reworked pedigree QC workflow/reporting and added interactive saving behavior.
R/thinSNP.R Narrowed imports to importFrom for slimmer namespace.
tests/testthat/test-validate_pedigree.R New unit tests for validate_pedigree().
tests/testthat/test-find_parentage.R New unit tests for find_parentage().
tests/testthat/test-check_madc_sanity.R Added tests for check_madc_sanity() using external fixtures.
tests/testthat/test-madc2vcf_multi.R Added tests for madc2vcf_multi() (PanelHub fixtures).
tests/testthat/test-madc2vcf_targets.R Large expansion of MADC→VCF target tests (PanelHub fixtures).
tests/testthat/test-madc2vcf_all.R Large expansion of MADC→VCF all-sites tests (PanelHub fixtures).
tests/testthat/test-check_ped.R Updated expectations for changed check_ped() behavior.
tests/testthat/test-imputation_concordance.R Minor formatting update to concordance tests.
tests/testthat/corrected_pedigree.txt Added corrected pedigree fixture/output artifact.
tests/testthat/.gitignore Ignores corrected pedigree output in test directory.
NEWS.md Added release notes for 0.7.0 / recent MADC and utility changes.
DESCRIPTION Version bump to 0.7.0; added imports/suggests updates.
NAMESPACE Exported new functions and added required imports.
man/*.Rd Generated/updated documentation for new/changed functions.
.github/workflows/R-CMD-check.yaml CI installs polyRAD and VariantAnnotation for test execution.
.gitignore Added .DS_Store.
BIGr.Rproj Added ProjectId.
CRAN-SUBMISSION Removed.
cran-comments.md Removed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +144 to +152
genos_mat <- base::as.matrix(genos, rownames = "ID")

# Homozygous-only matrix
genos_hom <- data.table::copy(genos)
marker_cols <- base::setdiff(base::names(genos_hom), "ID")
for (col in marker_cols) {
genos_hom[base::get(col) == 1, (col) := NA_integer_]
}
genos_hom_mat <- base::as.matrix(genos_hom, rownames = "ID")
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as.matrix() does not accept a rownames = argument here, and converting the full genos table (including the ID column) would coerce the entire matrix to character, breaking all downstream numeric comparisons. Build the genotype matrix from marker columns only, then set rownames(genos_mat) <- genos$ID explicitly.

Suggested change
genos_mat <- base::as.matrix(genos, rownames = "ID")
# Homozygous-only matrix
genos_hom <- data.table::copy(genos)
marker_cols <- base::setdiff(base::names(genos_hom), "ID")
for (col in marker_cols) {
genos_hom[base::get(col) == 1, (col) := NA_integer_]
}
genos_hom_mat <- base::as.matrix(genos_hom, rownames = "ID")
marker_cols <- base::setdiff(base::names(genos), "ID")
genos_mat <- base::as.matrix(genos[, ..marker_cols])
base::rownames(genos_mat) <- genos$ID
# Homozygous-only matrix
genos_hom <- data.table::copy(genos)
for (col in marker_cols) {
genos_hom[base::get(col) == 1, (col) := NA_integer_]
}
genos_hom_mat <- base::as.matrix(genos_hom[, ..marker_cols])
base::rownames(genos_hom_mat) <- genos_hom$ID

Copilot uses AI. Check for mistakes.
for (col in marker_cols) {
genos_hom[base::get(col) == 1, (col) := NA_integer_]
}
genos_hom_mat <- base::as.matrix(genos_hom, rownames = "ID")
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above for genos_hom_mat: as.matrix(genos_hom, rownames = "ID") will either error (unused argument) or include the ID column and coerce to character. Create the homozygous-only numeric matrix from marker columns and set rownames from genos_hom$ID.

Suggested change
genos_hom_mat <- base::as.matrix(genos_hom, rownames = "ID")
genos_hom_mat <- base::as.matrix(genos_hom[, ..marker_cols])
base::rownames(genos_hom_mat) <- genos_hom$ID

Copilot uses AI. Check for mistakes.
comparisons <- base::sum(!base::is.na(cand_hom) & !base::is.na(prog_hom))
if (comparisons == 0) return(NA_real_)
(base::sum(cand_hom != prog_hom, na.rm = TRUE) / comparisons) * 100
})
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which.min(errors) returns integer(0) when all entries are NA (e.g., when comparisons == 0 for every candidate). In that case this returns id = character(0) / error_pct = numeric(0) rather than NA. Add an explicit guard like if (all(is.na(errors))) (or length(best_idx)==0) to return NA_character_ / NA_real_.

Suggested change
})
})
if (base::all(base::is.na(errors))) {
return(base::list(id = NA_character_, error_pct = NA_real_))
}

Copilot uses AI. Check for mistakes.
Comment on lines +148 to +149
parent_genos <- base::as.matrix(genos_hom[ID %in% parent_ids], rownames = "ID")
progeny_genos <- base::as.matrix(genos_hom[ID %in% progeny_candidates$ID], rownames = "ID")
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as.matrix(genos_hom[...], rownames = "ID") is not a valid as.matrix() call and (even if it worked) would include the ID column and coerce to character. For mismatch calculations, subset marker columns only and set rownames from the ID column explicitly.

Suggested change
parent_genos <- base::as.matrix(genos_hom[ID %in% parent_ids], rownames = "ID")
progeny_genos <- base::as.matrix(genos_hom[ID %in% progeny_candidates$ID], rownames = "ID")
parent_dt <- genos_hom[ID %in% parent_ids]
parent_genos <- base::as.matrix(parent_dt[, ..marker_cols])
base::rownames(parent_genos) <- parent_dt$ID
progeny_dt <- genos_hom[ID %in% progeny_candidates$ID]
progeny_genos <- base::as.matrix(progeny_dt[, ..marker_cols])
base::rownames(progeny_genos) <- progeny_dt$ID

Copilot uses AI. Check for mistakes.
}
#### Logic for Best Pair Method ####
if (method == "best_pair") {
genos_mat <- base::as.matrix(genos, rownames = "ID")
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

genos_mat <- as.matrix(genos, rownames = "ID") has the same problem: invalid argument and would include ID causing character coercion. Create a numeric marker matrix with rownames(genos_mat) <- genos$ID and exclude the ID column before comparisons.

Suggested change
genos_mat <- base::as.matrix(genos, rownames = "ID")
genos_mat <- base::as.matrix(genos[, !("ID"), with = FALSE])
base::rownames(genos_mat) <- genos$ID

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +78
test_that("simu alfalfa",{

github_path <- "https://raw.githubusercontent.com/Breeding-Insight/BIGapp-PanelHub/refs/heads/long_seq/"

# External alfalfa test files
alfalfa_madc <- paste0(github_path, "test_madcs/alfalfa_madc.csv")
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same nested-test structure issue here: multiple test_that() calls are nested inside an outer test_that("simu alfalfa", ...). Please refactor to avoid nested test_that() and gate network-heavy tests with skip_on_cran() (and possibly skip_on_ci() if they are slow/flaky).

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +8
github_path <- "https://raw.githubusercontent.com/Breeding-Insight/BIGapp-PanelHub/refs/heads/long_seq/test_madcs/"
names <- c("Columns", "FixAlleleIDs", "IUPACcodes", "LowerCase", "Indels", "ChromPos", "allNAcol", "allNArow", "RefAltSeqs", "OtherAlleles")

# raw madc
report <- read.csv(paste0(github_path,"/alfalfa_madc_raw.csv"))

Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests download MADC fixtures directly from GitHub via read.csv() but do not call skip_if_offline() / skip_on_cran(). This will fail in offline/CRAN environments. Add appropriate skips (or vendor minimal fixtures into tests/testthat/) before attempting network I/O.

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +8
#' This function converts a DArTag MADC file to a VCF using the polyRAD package's
#' `readDArTag` and `RADdata2VCF` pipeline. It runs `check_madc_sanity` before
#' loading the data, applies corrections for lowercase sequences and all-NA
#' rows/columns, and sets `n.header.rows` automatically based on whether the
#' MADC file follows the raw DArT format (6 header rows) or the fixed allele ID
#' format (no header rows).
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states the function supports raw DArT MADC format and “sets n.header.rows automatically”, but the implementation stops when FixAlleleIDs is FALSE and always calls readDArTag(..., n.header.rows = 0L). Please align the docs with actual behavior (either document that raw MADCs are rejected, or implement the advertised auto-detection).

Copilot uses AI. Check for mistakes.
if (!"Sex" %in% base::colnames(all_parents)) {
warning("No 'Sex' column in parents file. All parents treated as ambiguous ('A').")
all_parents[, Sex := "A"]
}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a Sex column exists but contains missing values (NA), toupper(NA) stays NA and the subsequent %in% c("M","A","NA") filters will drop those rows rather than treating them as ambiguous. Consider normalizing missing/blank Sex values to "A" (or explicitly including is.na(Sex) in the candidate filters) so candidate sets don't silently shrink.

Suggested change
}
}
all_parents[, Sex := base::as.character(Sex)]
all_parents[, Sex := trimws(Sex)]
all_parents[is.na(Sex) | Sex == "", Sex := "A"]

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +8
#' Performs nine quick validations on an allele report:
#' 1) **Columns** - required columns are present (`CloneID`, `AlleleID`, `AlleleSequence`);
#' 2) **FixAlleleIDs** - first column's first up-to-6 rows are not all blank or `"*"`
#' *and* both `_0001` and `_0002` appear in `AlleleID`;
#' 3) **IUPACcodes** - presence of non-ATCG characters in `AlleleSequence`;
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The roxygen docs say “nine quick validations” and enumerate 1–9, but the implementation also computes an OtherAlleles check. Please update the description/return docs to reflect the actual set/count of checks returned in checks.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants