ARGprep Pipeline

This repository provides a Snakemake workflow for processing AnchorWave MAFs directly into per-contig site outputs. The workflow emits all-sites VCFs, variant-only VCFs, and BED masks from the alignments. Written with the aid of Codex and Claude. Note that v1.0 was a major rewrite from v0.4, and no longer uses Tassel or GATK.

Requirements

Conda
The environment defined in argprep.yml

Setup

conda env create -f argprep.yml
conda activate argprep

Configure

Create or edit a config file such as options.yaml. The workflow requires --configfile and will fail if it is omitted.

Required keys:

maf_dir: directory containing *.maf or *.maf.gz
reference_fasta: reference FASTA path

Optional path keys:

results_dir: output directory (default: results)

Optional controls (defaults shown):

max_missing_count - no default; see missingness thresholds below
max_missing_fraction - no default; see missingness thresholds below
mask_indels: true - mask reference positions overlapped by deletions
mask_indel_adjacent_snps: true - mask SNPs immediately flanking an indel (only applies when mask_indels: true)
treat_n_as_missing: false - treat N bases as missing rather than as a call
allow_multiallelic_snps: true - retain sites with more than two alleles
add_ref: false - append a synthetic REF sample (genotype 0) to both VCFs
summary_window_bp: 100000 - window size in bp for the per-contig plots in summary.html

SLURM resource overrides (for the direct_maf_sites rule):

maf_threads: 2
maf_mem_mb: 48000
maf_time: "24:00:00"

SLURM profile keys (required when using --profile profiles/slurm):

slurm_account
slurm_partition

Advanced override:

contigs: restrict the run to specific contigs instead of using the automatic shared-contig behavior
samples: restrict the run to specific sample basenames instead of using all *.maf / *.maf.gz files in maf_dir

Contig and sample selection behavior:

If samples is omitted, samples are auto-discovered from both *.maf and *.maf.gz in maf_dir.
If both <sample>.maf and <sample>.maf.gz exist, <sample>.maf is used.
If contigs is omitted, the workflow uses the intersection of contigs present in all selected MAFs.
Requested contigs are matched to reference .fai contigs with normalization (for example chr01 can map to 1 when unambiguous).

Example CLI override:

snakemake -j 8 --configfile options.yaml --config samples='["sampleA","sampleB"]' contigs='["chr1","chr2"]'

Missingness thresholds:

max_missing_count is an absolute number of missing samples allowed at a retained site.
max_missing_fraction is a fraction of samples allowed to be missing.
If both are set, the workflow uses the stricter threshold.
The fraction is converted to a count with downward truncation. For example, with 10 samples, 0.15 allows 1 missing sample.
If neither is set, the default is 0 - any site where even one sample is unaligned or missing is masked. Set one of these options explicitly if you want to retain sites with partial coverage.

Indel masking behavior:

mask_indels: true masks reference positions directly overlapped by deletions.
mask_indel_adjacent_snps: true additionally masks SNPs immediately adjacent to an insertion or deletion.
mask_indels: false disables indel-based masking entirely, so indel-overlapped and indel-adjacent sites are judged only by the remaining filters such as missingness.
mask_indel_adjacent_snps only has an effect when mask_indels: true.

Reference-sample behavior:

add_ref: true appends a synthetic REF sample to both final VCFs.
The added sample is emitted as genotype 0 at every retained site in all_sites and variants.

Run

Local:

snakemake -j 8 --configfile options.yaml

SLURM:

snakemake --profile profiles/slurm --configfile options.yaml

When using the SLURM profile, set slurm_account and slurm_partition in your config file. Slurm defaults for other resources are defined in profiles/slurm/config.yaml. Parsing the MAFs is the most computationally expensive step in the pipeline, and direct-maf rule resources can be overridden in options.yaml (maf_threads, maf_mem_mb, maf_time).

Outputs

Outputs are written under results/ by default (or under results_dir if provided):

sites/combined.<contig>.all_sites.vcf
sites/combined.<contig>.vcf
sites/combined.<contig>.mask.bed
sites/combined.<contig>.site_summary.tsv
summary.html

The site_summary.tsv contains one metric per row with columns metric and value:

metric	description
`contig`	contig name
`contig_length`	contig length in bp
`samples`	number of samples
`allowed_missing`	effective missing-sample threshold used
`all_sites`	retained sites (invariant + variant)
`variants`	retained variant sites
`invariant`	retained invariant sites
`masked_total`	total masked positions
`masked_intervals`	number of merged BED intervals in the mask
`masked_missingness`	positions masked due to too many missing samples
`masked_indel`	positions masked due to indel overlap or adjacency
`masked_multiallelic`	positions masked due to more than two alleles
`masked_no_alignment`	positions masked because at least one sample had no alignment
`masked_ref_non_acgt`	reference positions with non-ACGT bases (always masked)

The pipeline still validates that retained sites plus the mask span each contig exactly, but that check is now internal and is no longer written as a separate coverage.txt file.

TODO

Per-sample assembly quality masking: support optional per-sample BED files (in each sample's own genome coordinates) with a 0–1 quality score. During MAF parsing, track the sample-coordinate position alongside the reference position (handling +/- strand). Aligned bases at positions below a user-specified quality_min threshold would be treated as missing rather than as a call, integrating transparently with the existing max_missing_count/max_missing_fraction framework. New config keys: quality_bed_dir (directory of <sample>.bed files) and quality_min (float threshold; feature disabled when omitted).

Testing

pytest -q

Simulation Helper

The repository includes scripts/simulate_msprime_indels.py for generating haploid test datasets with msprime SNP variation plus branch-based indels on the tree sequence. Note that these simulations are not intended to be evolutionarily accurate, but simply to give a reasonable example data.

Example:

python scripts/simulate_msprime_indels.py \
  --sequence-length 1000000 \
  --num-samples 8 \
  --theta 0.01 \
  --rho 0.01 \
  --ne 10000 \
  --indel-rate 1e-8 \
  --indel-lambda 0.001 \
  --seed 8675309 \
  --out-prefix example_data/example

Outputs:

<prefix>.reference.fa
<prefix>.samples.fa
<prefix>.indels.tsv
<prefix>.summary.tsv
<prefix>.maf/

Summary fields include:

seed
sequence_length
reference_bp_with_indel_in_ge1_sample
total_snps
snps_without_overlapping_indel

Name		Name	Last commit message	Last commit date
Latest commit History 300 Commits
example_data		example_data
profiles/slurm		profiles/slurm
scripts		scripts
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
argprep.yml		argprep.yml
changelog.md		changelog.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ARGprep Pipeline

Requirements

Setup

Configure

Run

Outputs

TODO

Testing

Simulation Helper

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ARGprep Pipeline

Requirements

Setup

Configure

Run

Outputs

TODO

Testing

Simulation Helper

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages