Getting Started: OFH Synthetic Cohort Generation

Overview

This vignette shows how to generate synthetic cohort datasets for method development before using real health data.

The package-style API supports:

1. Load the package

library(ofhsyn)

2. Generate a basic cohort

out <- generate_ofh_cohort(
  n = 1000,
  seed = 123
)

names(out)

This returns a named list of data frames and writes CSVs to an output folder in your current working directory.

To return objects only (without writing CSV files):

out_objects_only <- generate_ofh_cohort(
  n = 1000,
  seed = 123,
  save_csv = FALSE,
  return_objects = TRUE
)

If you run this interactively, the generated data frames are also available in your R environment (for example questionnaire_data, clinic_measurements_data, nhse_inpat_data).

3. Restrict to specific code lists

out <- generate_ofh_cohort(
  n = 1000,
  seed = 123,
  icd10 = c(
    I210 = "STEMI of anterolateral wall",
    I500 = "Congestive heart failure"
  ),
  opcs4 = c(
    K401 = "Percutaneous transluminal balloon angioplasty of coronary artery"
  ),
  bnf_codes = data.frame(
    BNFCode = c("0212000B0", "0601023A0"),
    BNFName = c("Atorvastatin 20 mg tablets", "Metformin 500 mg tablets"),
    Formulation = c("tablets", "tablets"),
    Strength = c("20 mg", "500 mg"),
    stringsAsFactors = FALSE
  )
)

You can also provide code files:

out <- generate_ofh_cohort(
  n = 1000,
  seed = 123,
  icd10_file = "icd10_codes.txt",
  opcs4_file = "opcs4_codes.txt",
  bnf_codes_file = "bnf_medications.csv"
)

4. Configure dataset generation probabilities

out_custom <- generate_ofh_cohort(
  n = 1000,
  seed = 123,
  proportions = list(
    nhse_outpat = 0.25,
    nhse_inpat = 0.20,
    nhse_ed = 0.30,
    nhse_primcare_meds = 0.75
  ),
  record_multipliers = list(
    nhse_outpat = 1.2,
    nhse_inpat = 1.1,
    nhse_ed = 1.3
  ),
  code_config = list(
    nhse_outpat_data = list(diag_4_02_missing_prob = 0.70),
    nhse_inpat_data = list(single_diag_prob = 0.85)
  )
)

5. Use the OOP interface directly

syn <- OFHCohortSynthesizer$new(project_root = ".", seed = 123)

syn$set_code_pools(
  icd10 = c(I210 = "STEMI of anterolateral wall"),
  opcs4 = c(K401 = "Percutaneous transluminal balloon angioplasty of coronary artery"),
  bnf_meds = data.frame(
    BNFCode = c("0212000B0", "0601023A0"),
    BNFName = c("Atorvastatin 20 mg tablets", "Metformin 500 mg tablets"),
    Formulation = c("tablets", "tablets"),
    Strength = c("20 mg", "500 mg"),
    stringsAsFactors = FALSE
  )
)

out <- syn$run_all(n = 800)

6. Practical tips for researchers

7. Notes