This vignette shows how to generate synthetic cohort datasets for method development before using real health data.
The package-style API supports:
This returns a named list of data frames and writes CSVs to an output folder in your current working directory.
To return objects only (without writing CSV files):
out_objects_only <- generate_ofh_cohort(
n = 1000,
seed = 123,
save_csv = FALSE,
return_objects = TRUE
)If you run this interactively, the generated data frames are also
available in your R environment (for example
questionnaire_data, clinic_measurements_data,
nhse_inpat_data).
out <- generate_ofh_cohort(
n = 1000,
seed = 123,
icd10 = c(
I210 = "STEMI of anterolateral wall",
I500 = "Congestive heart failure"
),
opcs4 = c(
K401 = "Percutaneous transluminal balloon angioplasty of coronary artery"
),
bnf_codes = data.frame(
BNFCode = c("0212000B0", "0601023A0"),
BNFName = c("Atorvastatin 20 mg tablets", "Metformin 500 mg tablets"),
Formulation = c("tablets", "tablets"),
Strength = c("20 mg", "500 mg"),
stringsAsFactors = FALSE
)
)You can also provide code files:
code and
descriptioncode,description) or
tab-separated TXT (code<TAB>description)BNFCode, BNFName,
Formulation (optional Strength)out_custom <- generate_ofh_cohort(
n = 1000,
seed = 123,
proportions = list(
nhse_outpat = 0.25,
nhse_inpat = 0.20,
nhse_ed = 0.30,
nhse_primcare_meds = 0.75
),
record_multipliers = list(
nhse_outpat = 1.2,
nhse_inpat = 1.1,
nhse_ed = 1.3
),
code_config = list(
nhse_outpat_data = list(diag_4_02_missing_prob = 0.70),
nhse_inpat_data = list(single_diag_prob = 0.85)
)
)syn <- OFHCohortSynthesizer$new(project_root = ".", seed = 123)
syn$set_code_pools(
icd10 = c(I210 = "STEMI of anterolateral wall"),
opcs4 = c(K401 = "Percutaneous transluminal balloon angioplasty of coronary artery"),
bnf_meds = data.frame(
BNFCode = c("0212000B0", "0601023A0"),
BNFName = c("Atorvastatin 20 mg tablets", "Metformin 500 mg tablets"),
Formulation = c("tablets", "tablets"),
Strength = c("20 mg", "500 mg"),
stringsAsFactors = FALSE
)
)
out <- syn$run_all(n = 800)n (for example, 200 to 1000) while
developing.seed for reproducibility during method
testing.pid linkage assumptions in your
analysis scripts.