Note
🎉 2026-04 — ukbflow is now available on CRAN! Install with install.packages("ukbflow").
ukbflow provides a streamlined, RAP-native R workflow for UK Biobank analysis — from phenotype extraction and disease derivation to association analysis and publication-quality figures.
UK Biobank Data Policy (2024+): Individual-level data must remain within the RAP environment. Only summary-level outputs may be downloaded locally. All
ukbflowfunctions are designed with this constraint in mind.
library(ukbflow)
# Simulate UKB-style data locally (on RAP: replace with extract_batch() + job_wait())
data <- ops_toy(n = 5000, seed = 2026) |>
derive_missing()
# Derive lung cancer outcome (ICD-10 C34) and follow-up time
data <- data |>
derive_icd10(name = "lung", icd10 = "C34",
source = c("cancer_registry", "hes")) |>
derive_followup(name = "lung",
event_col = "lung_icd10_date",
baseline_col = "p53_i0",
censor_date = as.Date("2022-10-31"),
death_col = "p40000_i0")
# Define exposure: ever vs. never smoker
data[, smoking_ever := factor(
ifelse(p20116_i0 == "Never", "Never", "Ever"),
levels = c("Never", "Ever")
)]
# Cox regression: smoking → lung cancer (3-model adjustment)
res <- assoc_coxph(data,
outcome_col = "lung_icd10",
time_col = "lung_followup_years",
exposure_col = "smoking_ever",
covariates = c("p21022", "p31", "p22189"))
# Forest plot
res_df <- as.data.frame(res)
plot_forest(
data = res_df,
est = res_df$HR,
lower = res_df$CI_lower,
upper = res_df$CI_upper,
ci_column = 2L
)# From CRAN (recommended)
install.packages("ukbflow")
# Latest development version from GitHub
pak::pkg_install("evanbio/ukbflow")
# or
remotes::install_github("evanbio/ukbflow")Requirements: R ≥ 4.1 · dxpy (dx-toolkit, required for RAP interaction)
pip install dxpy| Layer | Key Functions | Description |
|---|---|---|
| Connection | auth_login, auth_select_project |
Authenticate to RAP via dx-toolkit |
| Data Access | fetch_metadata, extract_batch, job_wait |
Retrieve phenotype data from UKB dataset on RAP |
| Data Processing | decode_names, decode_values, derive_icd10, derive_followup, derive_case |
Harmonize multi-source records; derive analysis-ready cohort |
| Association Analysis | assoc_coxph, assoc_logistic, assoc_subgroup |
Three-model adjustment; subgroup & trend analysis |
| Genomic Scoring | grs_bgen2pgen, grs_score, grs_standardize |
Distributed plink2 scoring on RAP worker nodes |
| Visualization | plot_forest, plot_tableone |
Publication-ready figures & tables |
| Utilities | ops_setup, ops_toy, ops_na, ops_snapshot, ops_withdraw |
Environment check, synthetic data, pipeline diagnostics, and cohort management |
Auth & Fetch
auth_login(),auth_status(),auth_logout(),auth_list_projects(),auth_select_project()— RAP authenticationfetch_ls(),fetch_tree(),fetch_url(),fetch_file()— RAP file systemfetch_metadata(),fetch_field()— UKB metadata shortcuts
Extract & Decode
extract_ls(),extract_pheno(),extract_batch()— phenotype extractiondecode_values()— integer codes → human-readable labelsdecode_names()— field IDs → snake_case column names
Job Monitoring
job_status()— query job status by IDjob_wait()— block until job completes (with timeout)job_path()— get output path of a completed jobjob_result()— retrieve job result objectjob_ls()— list recent jobs
Derive — Phenotypes
derive_missing()— handle "Do not know" / "Prefer not to answer"derive_covariate()— type conversion + summaryderive_cut()— bin continuous variablesderive_selfreport()— self-reported disease status + datederive_hes()— HES inpatient ICD-10derive_first_occurrence()— First Occurrence fieldsderive_cancer_registry()— cancer registryderive_death_registry()— death registryderive_icd10()— combine sources (wrapper)derive_case()— merge self-report + ICD-10
Derive — Survival
derive_timing()— prevalent vs. incident classificationderive_age()— age at eventderive_followup()— follow-up end date and duration
Association Analysis
assoc_coxph()/assoc_cox()— Cox proportional hazards (HR)assoc_logistic()/assoc_logit()— logistic regression (OR)assoc_linear()/assoc_lm()— linear regression (β)assoc_coxph_zph()— proportional hazards assumption testassoc_subgroup()— stratified analysis + interaction LRTassoc_trend()— dose-response trend + p_trendassoc_competing()— Fine-Gray competing risks (SHR)assoc_lag()— lagged exposure sensitivity analysis
Visualisation
plot_forest()— forest plot (PNG / PDF / JPG / TIFF, 300 dpi)plot_tableone()— Table 1 (DOCX / HTML / PDF / PNG)
Utilities & Diagnostics
ops_setup()— environment health check (dx CLI, RAP auth, R packages)ops_toy()— generate synthetic UKB-like data for development and testingops_na()— summarise missing values (NA and"") across all columnsops_snapshot()— record pipeline checkpoints and track dataset changesops_snapshot_cols()— retrieve column list from a saved snapshotops_snapshot_diff()— compare columns between two snapshotsops_snapshot_remove()— remove columns added after a given snapshotops_set_safe_cols()— define protected columns that ops_snapshot_remove will not dropops_withdraw()— exclude UKB withdrawn participants from a cohort
GRS Pipeline
grs_check()— validate SNP weights filegrs_bgen2pgen()— convert BGEN → PGEN on RAP (submits cloud jobs)grs_score()— score GRS across chromosomes with plink2grs_standardize()/grs_zscore()— Z-score standardisationgrs_validate()— OR/HR per SD, high vs low, trend, AUC/C-index
Full vignettes and function reference:
https://evanbio.github.io/ukbflow/
Bug reports, feature requests, and pull requests are welcome. See CONTRIBUTING.md.
MIT License © 2026 Yibin Zhou
Made with ❤️ by Yibin Zhou
