This repository provides a Snakemake workflow for processing AnchorWave MAFs directly into per-contig site outputs. The workflow emits all-sites VCFs, variant-only VCFs, and BED masks from the alignments. Written with the aid of Codex and Claude. Note that v1.0 was a major rewrite from v0.4, and no longer uses Tassel or GATK.
- Conda
- The environment defined in
argprep.yml
conda env create -f argprep.yml
conda activate argprepCreate or edit a config file such as options.yaml.
The workflow requires --configfile and will fail if it is omitted.
Required keys:
maf_dir: directory containing*.mafor*.maf.gzreference_fasta: reference FASTA path
Optional path keys:
results_dir: output directory (default:results)
Optional controls (defaults shown):
max_missing_count- no default; see missingness thresholds belowmax_missing_fraction- no default; see missingness thresholds belowmask_indels: true- mask reference positions overlapped by deletionsmask_indel_adjacent_snps: true- mask SNPs immediately flanking an indel (only applies whenmask_indels: true)treat_n_as_missing: false- treatNbases as missing rather than as a callallow_multiallelic_snps: true- retain sites with more than two allelesadd_ref: false- append a syntheticREFsample (genotype0) to both VCFssummary_window_bp: 100000- window size in bp for the per-contig plots insummary.html
SLURM resource overrides (for the direct_maf_sites rule):
maf_threads: 2maf_mem_mb: 48000maf_time: "24:00:00"
SLURM profile keys (required when using --profile profiles/slurm):
slurm_accountslurm_partition
Advanced override:
contigs: restrict the run to specific contigs instead of using the automatic shared-contig behaviorsamples: restrict the run to specific sample basenames instead of using all*.maf/*.maf.gzfiles inmaf_dir
Contig and sample selection behavior:
- If
samplesis omitted, samples are auto-discovered from both*.mafand*.maf.gzinmaf_dir. - If both
<sample>.mafand<sample>.maf.gzexist,<sample>.mafis used. - If
contigsis omitted, the workflow uses the intersection of contigs present in all selected MAFs. - Requested contigs are matched to reference
.faicontigs with normalization (for examplechr01can map to1when unambiguous).
Example CLI override:
snakemake -j 8 --configfile options.yaml --config samples='["sampleA","sampleB"]' contigs='["chr1","chr2"]'Missingness thresholds:
max_missing_countis an absolute number of missing samples allowed at a retained site.max_missing_fractionis a fraction of samples allowed to be missing.- If both are set, the workflow uses the stricter threshold.
- The fraction is converted to a count with downward truncation. For example, with 10 samples,
0.15allows1missing sample. - If neither is set, the default is 0 - any site where even one sample is unaligned or missing is masked. Set one of these options explicitly if you want to retain sites with partial coverage.
Indel masking behavior:
mask_indels: truemasks reference positions directly overlapped by deletions.mask_indel_adjacent_snps: trueadditionally masks SNPs immediately adjacent to an insertion or deletion.mask_indels: falsedisables indel-based masking entirely, so indel-overlapped and indel-adjacent sites are judged only by the remaining filters such as missingness.mask_indel_adjacent_snpsonly has an effect whenmask_indels: true.
Reference-sample behavior:
add_ref: trueappends a syntheticREFsample to both final VCFs.- The added sample is emitted as genotype
0at every retained site inall_sitesandvariants.
Local:
snakemake -j 8 --configfile options.yamlSLURM:
snakemake --profile profiles/slurm --configfile options.yamlWhen using the SLURM profile, set slurm_account and slurm_partition in your config file. Slurm defaults for other resources are defined in profiles/slurm/config.yaml. Parsing the MAFs is the most computationally expensive step in the pipeline, and direct-maf rule resources can be overridden in options.yaml (maf_threads, maf_mem_mb, maf_time).
Outputs are written under results/ by default (or under results_dir if provided):
sites/combined.<contig>.all_sites.vcfsites/combined.<contig>.vcfsites/combined.<contig>.mask.bedsites/combined.<contig>.site_summary.tsvsummary.html
The site_summary.tsv contains one metric per row with columns metric and value:
| metric | description |
|---|---|
contig |
contig name |
contig_length |
contig length in bp |
samples |
number of samples |
allowed_missing |
effective missing-sample threshold used |
all_sites |
retained sites (invariant + variant) |
variants |
retained variant sites |
invariant |
retained invariant sites |
masked_total |
total masked positions |
masked_intervals |
number of merged BED intervals in the mask |
masked_missingness |
positions masked due to too many missing samples |
masked_indel |
positions masked due to indel overlap or adjacency |
masked_multiallelic |
positions masked due to more than two alleles |
masked_no_alignment |
positions masked because at least one sample had no alignment |
masked_ref_non_acgt |
reference positions with non-ACGT bases (always masked) |
The pipeline still validates that retained sites plus the mask span each contig exactly, but that check is now internal and is no longer written as a separate coverage.txt file.
- Per-sample assembly quality masking: support optional per-sample BED files (in each sample's own genome coordinates) with a 0–1 quality score. During MAF parsing, track the sample-coordinate position alongside the reference position (handling
+/-strand). Aligned bases at positions below a user-specifiedquality_minthreshold would be treated as missing rather than as a call, integrating transparently with the existingmax_missing_count/max_missing_fractionframework. New config keys:quality_bed_dir(directory of<sample>.bedfiles) andquality_min(float threshold; feature disabled when omitted).
pytest -qThe repository includes scripts/simulate_msprime_indels.py for generating haploid test datasets with msprime SNP variation plus branch-based indels on the tree sequence. Note that these simulations are not intended to be evolutionarily accurate, but simply to give a reasonable example data.
Example:
python scripts/simulate_msprime_indels.py \
--sequence-length 1000000 \
--num-samples 8 \
--theta 0.01 \
--rho 0.01 \
--ne 10000 \
--indel-rate 1e-8 \
--indel-lambda 0.001 \
--seed 8675309 \
--out-prefix example_data/exampleOutputs:
<prefix>.reference.fa<prefix>.samples.fa<prefix>.indels.tsv<prefix>.summary.tsv<prefix>.maf/
Summary fields include:
seedsequence_lengthreference_bp_with_indel_in_ge1_sampletotal_snpssnps_without_overlapping_indel