CDC Data Exercise

About the dataset

The Botswana Combination Prevention Project (BCPP) was a collaborative research effort led by the Botswana Ministry of Health (MOH), the Harvard School of Public Health/Botswana Harvard AIDS Institute Partnership (BHP), and the U.S. Centers for Disease Control and Prevention (CDC). This community-based randomized trial aimed to assess the impact of various HIV prevention strategies on reducing HIV incidence across 15 intervention and 15 control communities. The intervention communities received comprehensive HIV testing, linkage to care, and universal treatment services, guided by the UNAIDS 90-90-90 targets: ensuring 90% of individuals with HIV are aware of their status, 90% of those diagnosed are on antiretroviral therapy (ART), and 90% of those on ART achieve viral suppression.

The BCPP study was structured around two interrelated protocols: the Evaluation Protocol and the Intervention Protocol. The Evaluation Protocol assessed the primary outcome—HIV incidence—along with key secondary outcomes, focusing on data collected from the Baseline Household Survey, the HIV Incidence Cohort, and the End of Study Survey. Meanwhile, the Intervention Protocol involved the implementation of a combination prevention (CP) package in combination prevention communities (CPCs), monitoring the uptake of interventions such as expanded HIV testing and counseling, enhanced male circumcision services, and improved access to HIV care and treatment.

The dataset is available and free to download at CDC Website:https://data.cdc.gov/Global-Health/Botswana-Combination-Prevention-Project-BCPP-Publi/qcw5-4m9q/about_data.

The study is a cohort study with 3 phases. In this exercise, I only work on year 3 dataset. There are many information covered in this dataset, including demographic information of the respondent, soccioeconomic factors, HIV exposure, HIV status and conditions.

Install all packages needed

Install and library all packages needed in this section.

install.packages("tidyverse")
The following package(s) will be installed:
- tidyverse [2.0.0]
These packages will be installed into "C:/Users/mn27712/MADA_NEW/muhammadnasir-mada2025-portofolio/renv/library/windows/R-4.4/x86_64-w64-mingw32".

# Installing packages --------------------------------------------------------
- Installing tidyverse ...                      OK [linked from cache]
Successfully installed 1 package in 14 milliseconds.
library(tidyverse)
Warning: package 'tibble' was built under R version 4.4.3
Warning: package 'tidyr' was built under R version 4.4.3
Warning: package 'readr' was built under R version 4.4.3
Warning: package 'purrr' was built under R version 4.4.3
Warning: package 'dplyr' was built under R version 4.4.3
Warning: package 'forcats' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
install.packages("ggplot2")
The following package(s) will be installed:
- ggplot2 [3.5.2]
These packages will be installed into "C:/Users/mn27712/MADA_NEW/muhammadnasir-mada2025-portofolio/renv/library/windows/R-4.4/x86_64-w64-mingw32".

# Installing packages --------------------------------------------------------
- Installing ggplot2 ...                        OK [linked from cache]
Successfully installed 1 package in 14 milliseconds.
library(ggplot2)
library(here)
here() starts at C:/Users/mn27712/MADA_NEW/muhammadnasir-mada2025-portofolio
install.packages("patchwork")  # This package is to redefine "/" operator for plot arrangement
The following package(s) will be installed:
- patchwork [1.3.0]
These packages will be installed into "C:/Users/mn27712/MADA_NEW/muhammadnasir-mada2025-portofolio/renv/library/windows/R-4.4/x86_64-w64-mingw32".

# Installing packages --------------------------------------------------------
- Installing patchwork ...                      OK [linked from cache]
Successfully installed 1 package in 13 milliseconds.
library(patchwork)
Warning: package 'patchwork' was built under R version 4.4.3
install.packages("writexl")
The following package(s) will be installed:
- writexl [1.5.4]
These packages will be installed into "C:/Users/mn27712/MADA_NEW/muhammadnasir-mada2025-portofolio/renv/library/windows/R-4.4/x86_64-w64-mingw32".

# Installing packages --------------------------------------------------------
- Installing writexl ...                        OK [linked from cache]
Successfully installed 1 package in 21 milliseconds.
library(writexl)
library(haven)
library(dplyr)
install.packages("janitor")
The following package(s) will be installed:
- janitor [2.2.1]
These packages will be installed into "C:/Users/mn27712/MADA_NEW/muhammadnasir-mada2025-portofolio/renv/library/windows/R-4.4/x86_64-w64-mingw32".

# Installing packages --------------------------------------------------------
- Installing janitor ...                        OK [linked from cache]
Successfully installed 1 package in 12 milliseconds.
library(janitor)
Warning: package 'janitor' was built under R version 4.4.3

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(procs)
Warning: package 'procs' was built under R version 4.4.3
Loading required package: common
Warning: package 'common' was built under R version 4.4.3

Loading the dataset

## set path  dictionary 
bostwana <- here::here("cdcdata-exercise","data", "cdcbostwana.csv") # set the pathway to create relative path 

bostwana <- read_csv(bostwana) # read the dataset stored in specified path
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 11369 Columns: 322
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (217): de_subj_idC, de_hh_idC, de_plot_idC, region, random_arm, survey, ...
dbl  (64): community_rndmN, pair_rndmN, interview_days, yeardone, age_at_int...
lgl  (41): religion_name, religious_affil, relig_theogrp, ethnicity, prev_hi...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(bostwana)
# A tibble: 6 × 322
  de_subj_idC de_hh_idC de_plot_idC region community_rndmN pair_rndmN random_arm
  <chr>       <chr>     <chr>       <chr>            <dbl>      <dbl> <chr>     
1 10011A      16863A    14700A      North…              25         13 Intervent…
2 1001A       8552A     17400A      South…               4          1 Intervent…
3 10021A      13235A    4578A       Centr…               5          8 Standard …
4 10023A      7028A     7820A       Centr…               9          4 Standard …
5 10026A      5500A     4704A       South…              19          3 Intervent…
6 10032A      2667A     8896A       South…              14          2 Standard …
# ℹ 315 more variables: interview_days <dbl>, yeardone <dbl>, survey <chr>,
#   gender <chr>, age_at_interview <dbl>, age5cat <chr>,
#   length_residence <chr>, permanent_resident <chr>, intend_residency <chr>,
#   nights_away <chr>, cattle_postlands <chr>, religion_name <lgl>,
#   religious_affil <lgl>, relig_theogrp <lgl>, ethnicity <lgl>,
#   marital_status <chr>, num_wives <dbl>, husband_wives <dbl>,
#   live_alone <chr>, livewith_family <chr>, livewith_partner <chr>, …
dim(bostwana)
[1] 11369   322

The dataset contains 11369 observations and 322 columns.

Data Cleaning

I want to select interesting variables for further analysis and clean the data as needed including dropping/ lablig missing values.

bost_df <- bostwana %>% 
  select("region", "random_arm", "gender", "age_at_interview", "age5cat", "employment_status", "circumcised", "hiv_status_current", "hiv_status_time", "viral_load_yr3",  "cd4_survey_yr3", "cd4_surveydays_yr3", "arv_duration_days")

head(bost_df)
# A tibble: 6 × 13
  region   random_arm       gender age_at_interview age5cat    employment_status
  <chr>    <chr>            <chr>             <dbl> <chr>      <chr>            
1 Northern Intervention     M                  45.7 45-54 yea… Unemployed not l…
2 Southern Intervention     F                  51.8 45-54 yea… Unemployed looki…
3 Central  Standard of Care F                  40.9 35-44 yea… Unemployed looki…
4 Central  Standard of Care F                  18.7 16-24 yea… Unemployed not l…
5 Southern Intervention     F                  35.8 35-44 yea… Employed         
6 Southern Standard of Care F                  56.9 55-64 yea… Unemployed looki…
# ℹ 7 more variables: circumcised <chr>, hiv_status_current <chr>,
#   hiv_status_time <chr>, viral_load_yr3 <dbl>, cd4_survey_yr3 <dbl>,
#   cd4_surveydays_yr3 <dbl>, arv_duration_days <dbl>
summary(bost_df)
    region           random_arm           gender          age_at_interview
 Length:11369       Length:11369       Length:11369       Min.   :17.20   
 Class :character   Class :character   Class :character   1st Qu.:26.70   
 Mode  :character   Mode  :character   Mode  :character   Median :35.70   
                                                          Mean   :38.03   
                                                          3rd Qu.:48.20   
                                                          Max.   :67.90   
                                                                          
   age5cat          employment_status  circumcised        hiv_status_current
 Length:11369       Length:11369       Length:11369       Length:11369      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 hiv_status_time    viral_load_yr3    cd4_survey_yr3   cd4_surveydays_yr3
 Length:11369       Min.   :     40   Min.   :  20.0   Min.   : 358      
 Class :character   1st Qu.:     40   1st Qu.: 401.0   1st Qu.: 875      
 Mode  :character   Median :     40   Median : 573.0   Median : 895      
                    Mean   :   4305   Mean   : 602.1   Mean   : 881      
                    3rd Qu.:     40   3rd Qu.: 757.0   3rd Qu.: 911      
                    Max.   :2243242   Max.   :1567.0   Max.   :1184      
                    NA's   :8130      NA's   :10594    NA's   :8032      
 arv_duration_days
 Min.   :   0     
 1st Qu.: 908     
 Median :2393     
 Mean   :2432     
 3rd Qu.:3764     
 Max.   :5879     
 NA's   :8222     

Note: To make easy for further analysis, I selected several variables:

  • region = region of sbject
  • random_arm = randomized arm
  • age= age
  • age5cat = age category
  • employment_status = employment status
  • hiv_status_current = Current HIV status
  • viral_load_yr3 = viral load in year 3 of the study
  • cd4_survey_yr3 = CD4 count in year 3
  • cd4_surveydays_yr3 = numbers of days enrolled to survey year 3.
  • arv_duration_days = number of day (duration) taking ARV

In this excercise, I would like to see the relationship between viral load, lenght of enrollment, ARV duration, and CD4 count. Therefore, I will drop all missing data in those variables.

data <- bost_df %>% 
  drop_na(viral_load_yr3, cd4_survey_yr3, cd4_surveydays_yr3, arv_duration_days) %>% # I drop all observation wit N/A data
  filter(viral_load_yr3 != 0, 
         cd4_survey_yr3 != 0, 
         cd4_surveydays_yr3 != 0, 
         arv_duration_days !=0) # I found some observation contain 0 I drop those observation
sapply(data, class) # I want to check class of all variables 
            region         random_arm             gender   age_at_interview 
       "character"        "character"        "character"          "numeric" 
           age5cat  employment_status        circumcised hiv_status_current 
       "character"        "character"        "character"        "character" 
   hiv_status_time     viral_load_yr3     cd4_survey_yr3 cd4_surveydays_yr3 
       "character"          "numeric"          "numeric"          "numeric" 
 arv_duration_days 
         "numeric" 

For N/A information in categorical data, I lable those missing value wih 999. However, the values are in characters. I want to change the values into factors to make us easy in analysis.

data <- data %>% 
  mutate(region = recode(region, 
                         "Northern" = 0,
                         "Southern" = 1, 
                         "Central" = 2), 
         random_arm = recode(random_arm, 
                             "Standard of Care" = 0,
                             "Intervention" = 1), 
         gender = recode(gender,
                         "M" = 0,
                         "F" =1), 
         age5cat = recode (age5cat, 
                           "16-24 years" = 0, 
                           "25-34 years" = 1, 
                           "35-44 years" = 2, 
                           "45-54 years" = 3, 
                           "55-64 years" = 4), 
         employment_status = recode(employment_status , 
                                    "Employed" = 0, 
                                    "Unemployed looking for work" = 1, 
                                    "Unemployed not looking for work" = 2), 
         circumcised= recode(circumcised , 
                             "No" = 0, 
                             "Yes" = 1), 
         hiv_status_current = recode(hiv_status_current, 
                                     "HIV-uninfected" = 0, 
                                     "HIV-infected" = 1, 
                                     "Refused HIV testing" = 2), 
         hiv_status_time = recode(hiv_status_time, 
                                  "HIV-negative" = 0, 
                                  "HIV-positive: previously diagnosed" = 1, 
                                  "Refused HIV testing" = 2, 
                                  "HIV-positive: newly discovered" = 3)) # this part is to recode from character to numeric to allow us convert ito factors. 

sapply(data, class) # check character of the variables 
            region         random_arm             gender   age_at_interview 
         "numeric"          "numeric"          "numeric"          "numeric" 
           age5cat  employment_status        circumcised hiv_status_current 
         "numeric"          "numeric"          "numeric"          "numeric" 
   hiv_status_time     viral_load_yr3     cd4_survey_yr3 cd4_surveydays_yr3 
         "numeric"          "numeric"          "numeric"          "numeric" 
 arv_duration_days 
         "numeric" 

Now, I want to convert categorical variables into factors

data <- data %>%
  mutate(across (c("region", "random_arm", "gender", "age5cat", "employment_status", "circumcised", "hiv_status_current",  "hiv_status_time"), as.factor)) # this is to convert multiple variables into factors. as.factor() only can work in single variable. 

sapply(data, class) # check character of the variables 
            region         random_arm             gender   age_at_interview 
          "factor"           "factor"           "factor"          "numeric" 
           age5cat  employment_status        circumcised hiv_status_current 
          "factor"           "factor"           "factor"           "factor" 
   hiv_status_time     viral_load_yr3     cd4_survey_yr3 cd4_surveydays_yr3 
          "factor"          "numeric"          "numeric"          "numeric" 
 arv_duration_days 
         "numeric" 

I want to replace all N/A with 999 in categorical variables.

data <- data %>%
  mutate(across(where(is.factor), ~ replace_na(as.character(.), "999"))) %>% # replace the values
  mutate(across(where(is.character), as.factor)) # convert the variable back to factor. 

dim(data)
[1] 539  13

The data is clean now and ready for data exploration and analysis. There are 539 obserations and 13 variables in the final data.

summary_stats <- data %>%
  summarise(
    mean_viral_load = mean(viral_load_yr3, na.rm = TRUE),
    sd_viral_load = sd(viral_load_yr3, na.rm = TRUE),
    mean_cd4 = mean(cd4_survey_yr3, na.rm = TRUE),
    sd_cd4 = sd(cd4_survey_yr3, na.rm = TRUE)
  )
print(summary_stats)
# A tibble: 1 × 4
  mean_viral_load sd_viral_load mean_cd4 sd_cd4
            <dbl>         <dbl>    <dbl>  <dbl>
1           1784.        13182.     620.   279.

Now I want to make summary statistics for Viral load and visualize it ina plot.

summary_by_arm <- data %>%
  group_by(random_arm) %>%
  summarise(
    mean_cd4 = mean(cd4_survey_yr3, na.rm = TRUE),
    sd_cd4 = sd(cd4_survey_yr3, na.rm = TRUE)
  ) # create by randomise_arm 

plot1 <- ggplot(summary_by_arm, aes(x = random_arm, y = mean_cd4, fill = random_arm)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = mean_cd4 - sd_cd4, ymax = mean_cd4 + sd_cd4), width = 0.2) +
  labs(title = "Mean CD4 Count by randomise group with Standard Deviation",
       x = "Groups", y = "Mean CD4 Count", 
       caption = "Red (0)= Standard of Care, Green (1)= Intervention ") +
  theme_minimal() +
  theme(plot.background = element_rect(color = "black", size = 1)) 
Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.
print(plot1)

figure_file = here("cdcdata-exercise", "pictures", "CD4 Mean and SD .png") # to set up location for the pictures created 
ggsave(filename = figure_file, plot=plot1) # save the pictures created 
Saving 7 x 5 in image

Now I want to make summary statistics for viral load and visualize it in a plot.

summary_by_arm_vl <- data %>%
  group_by(random_arm) %>%
  summarise(
    mean_vl = mean(viral_load_yr3, na.rm = TRUE),
    sd_vl = sd(viral_load_yr3, na.rm = TRUE)
  ) # create by randomise_arm 

plot2 <- ggplot(summary_by_arm_vl, aes(x = random_arm, y = mean_vl, fill = random_arm)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = mean_vl - sd_vl, ymax = mean_vl + sd_vl), width = 0.2) +
  labs(title = "Mean Viral Load Count by randomise group with Standard Deviation",
       x = "Groups", y = "Mean Viral Load", 
       caption = "Red (0)= Standard of Care, Green (1)= Intervention ") +
  theme_minimal() +
  theme(plot.background = element_rect(color = "black", size = 1)) 

print(plot2)

figure_file = here("cdcdata-exercise", "pictures", "Viral Load Mean and SD .png") # to set up location for the pictures created 
ggsave(filename = figure_file, plot=plot2) # save the pictures created
Saving 7 x 5 in image

Data Exploration

In this part, I want to explore the data and vizualise it before data analysis.

plot3 <- ggplot(data, aes(x = viral_load_yr3, y = cd4_survey_yr3, color = region)) + 
  geom_point(size = 3) +  # Specify geom as geom_point to make a scatterplot
  labs(title = "Viral Load VS CD4 count based on the region", 
       x = "Viral load", 
       y = "CD4") +  # Rename title and axes
  theme(axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate state names 
        legend.position = "bottom",  # Position legend at the bottom
        plot.title = element_text(size = 18, face = "bold", hjust = 0.5), 
        axis.title.x = element_text(size = 12, face = "bold"), 
        axis.title.y = element_text(size = 12, face = "bold")) + # Increase size and boldness of title and axes
  scale_color_manual(values = c("0" = "#53d127", "1" = "#2ce1b0", "2" = "#ff5733"))+ # crete color manually 
  theme(plot.background = element_rect(color = "black", size = 1))

print(plot3)

figure_file = here("cdcdata-exercise", "pictures", "Viral Load VS CD4 based on region .png") # to set up location for the pictures created 
ggsave(filename = figure_file, plot=plot3) # save the pictures created
Saving 7 x 5 in image
plot4 <- ggplot(data, aes(x = arv_duration_days , y = cd4_survey_yr3, color = region)) + 
  geom_boxplot() +  # Specify geom as geom_point to make a scatterplot
  labs(title = "ARV Duration VS CD4 count based on the region", 
       x = "ARV Duration (days)", 
       y = "CD4 count",
       caption = "Green = Nothern, Blue= Northern, and red= Central") +  # Rename title and axes
  theme(axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate state names 
        legend.position = "bottom",  # Position legend at the bottom
        plot.title = element_text(size = 18, face = "bold", hjust = 0.5), 
        axis.title.x = element_text(size = 12, face = "bold"), 
        axis.title.y = element_text(size = 12, face = "bold")) + # Increase size and boldness of title and axes
  theme(plot.background = element_rect(color = "black", size = 1)) +
  scale_x_continuous(limits = c(0, 1000)) + # Set x-axis maximum limit to 1000
  scale_color_manual(values = c("0" = "#53d127", "1" = "#2ce1b0", "2" = "#ff5733")) # crete color manually 

print(plot4)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`stat_boxplot()`).

figure_file = here("cdcdata-exercise", "pictures", "ARV Duration VS CD4 count based on the region .png") # to set up location for the pictures created 
ggsave(filename = figure_file, plot=plot4) # save the pictures created
Saving 7 x 5 in image
Warning: Removed 2 rows containing missing values or values outside the scale range
(`stat_boxplot()`).

This section contributed by Rayleen Lewis.

Performing a few more checks

Prior to creating the synthetic dataset, I wanted to check on a few more things with the data, like number of persons in each region and arm of the study. It also looks like most people are virally suppressed, so I wanted to look at the viral load among people who had values >200. I also want to get the median and IQR of CD4 count by region.

#Number of persons in each region
table(data$region)

  0   1   2 
160 192 187 
#Number of persons in each arm of the study
table(data$random_arm)

  0   1 
239 300 
#Confirming that basically everyone is virally suppressed
data_supp <- data %>%
  filter(viral_load_yr3>200)
ggplot(data_supp, aes(x = viral_load_yr3)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

data_rl <- data %>%
  mutate(viral_2 = if_else(viral_load_yr3 < 200, 0, 1))
table(data_rl$viral_2)

  0   1 
510  29 
#95% are virally suppressed (viral load <200)

#Mean (SD) of CD4 count by arm
proc_means(data, var = cd4_survey_yr3, class = random_arm)
# A tibble: 3 × 9
  CLASS  TYPE  FREQ VAR                N  MEAN   STD   MIN   MAX
  <fct> <dbl> <int> <chr>          <int> <dbl> <dbl> <dbl> <dbl>
1 <NA>      0   539 cd4_survey_yr3   539  620.  279.    70  1567
2 0         1   239 cd4_survey_yr3   239  604.  265.    70  1540
3 1         1   300 cd4_survey_yr3   300  632.  290.    78  1567
proc_means(data, var = viral_load_yr3, class = random_arm)
# A tibble: 3 × 9
  CLASS  TYPE  FREQ VAR                N  MEAN    STD   MIN    MAX
  <fct> <dbl> <int> <chr>          <int> <dbl>  <dbl> <dbl>  <dbl>
1 <NA>      0   539 viral_load_yr3   539 1784. 13182.    40 177075
2 0         1   239 viral_load_yr3   239 3256. 19241.    40 177075
3 1         1   300 viral_load_yr3   300  610.  3854.    40  37415

Generating synthetic data

I aimed to reproduce a dataset similar to the dataset above. The following prompt was used in ChatGPT to generate the code in the next code chunk: using R, create a dataset with 539 observations and variables called region with 3 levels (values 0, 1, 2), arm with 2 levels (0, 1), cd4 (continuous), viral_load (continuous). 55% of observations should have an arm value of 1. 95% of the values for viral_load should be <200. Among observations with arm = 0, viral_load should have a mean of 3256 and standard deviation of 19241 and cd4 should have a mean of 604 and standard deviation of 265. Among observations with arm = 1, viral_load should have a mean of 610 and standard deviation of 3854 and cd4 should have a mean of 632 and standard deviation of 290. 160 observations should have a region value of 0, 192 should have a region value equal to 1 and the rest should have a region value equal to 2.

set.seed(123)  # For reproducibility

# Define the number of observations
n <- 539

# Define region values
region <- c(rep(0, 160), rep(1, 192), rep(2, n - 160 - 192))

# Define arm values ensuring 55% are 1
arm <- sample(c(0, 1), n, replace = TRUE, prob = c(0.45, 0.55))

# Generate cd4 values based on arm
cd4 <- ifelse(arm == 0,
              rnorm(sum(arm == 0), mean = 604, sd = 265),
              rnorm(sum(arm == 1), mean = 632, sd = 290))

# Generate viral_load values based on arm
viral_load <- ifelse(arm == 0,
                     rnorm(sum(arm == 0), mean = 3256, sd = 19241),
                     rnorm(sum(arm == 1), mean = 610, sd = 3854))

# Ensure 95% of viral_load values are <200
threshold <- quantile(viral_load, 0.95)
viral_load <- ifelse(viral_load > 200, runif(length(viral_load), min = 0, max = 200), viral_load)

# Create data frame
dataset <- data.frame(region = factor(region), arm = factor(arm), cd4 = cd4, viral_load = viral_load)

# Check dataset
summary(dataset)
 region  arm          cd4           viral_load       
 0:160   0:228   Min.   :-251.6   Min.   :-48569.28  
 1:192   1:311   1st Qu.: 446.1   1st Qu.: -3267.70  
 2:187           Median : 637.1   Median :    24.23  
                 Mean   : 625.9   Mean   : -3507.56  
                 3rd Qu.: 800.0   3rd Qu.:   125.58  
                 Max.   :1518.6   Max.   :   198.68  

I noticed that >25% of the viral load values were negative. I needed to update the code to set the minimum value to 40 (based on the original dataset). To update the code, I used the following follow-up prompt: can you update the code so that the min value for viral_load is 40?

set.seed(123)  # For reproducibility

# Define the number of observations
n <- 539

# Define region values
region <- c(rep(0, 160), rep(1, 192), rep(2, n - 160 - 192))

# Define arm values ensuring 55% are 1
arm <- sample(c(0, 1), n, replace = TRUE, prob = c(0.45, 0.55))

# Generate cd4 values based on arm
cd4 <- ifelse(arm == 0,
              rnorm(sum(arm == 0), mean = 604, sd = 265),
              rnorm(sum(arm == 1), mean = 632, sd = 290))

# Generate viral_load values based on arm
viral_load <- ifelse(arm == 0,
                     rnorm(sum(arm == 0), mean = 3256, sd = 19241),
                     rnorm(sum(arm == 1), mean = 610, sd = 3854))

# Ensure 95% of viral_load values are <200 while maintaining a minimum of 40
threshold <- quantile(viral_load, 0.95)
viral_load <- ifelse(viral_load > 200, runif(length(viral_load), min = 40, max = 200), viral_load)
viral_load <- pmax(viral_load, 40)  # Ensure minimum value is 40

# Create data frame
dataset <- data.frame(region = factor(region), arm = factor(arm), cd4 = cd4, viral_load = viral_load)

# Check dataset
summary(dataset)
 region  arm          cd4           viral_load    
 0:160   0:228   Min.   :-251.6   Min.   : 40.00  
 1:192   1:311   1st Qu.: 446.1   1st Qu.: 40.00  
 2:187           Median : 637.1   Median : 59.39  
                 Mean   : 625.9   Mean   : 87.73  
                 3rd Qu.: 800.0   3rd Qu.:140.22  
                 Max.   :1518.6   Max.   :198.95  

Recreating tables and plots from above

To check the synthetic dataset, I will recreate some of the results from above.

# Mean and SD of viral load and CD4 count by arm
#Synthetic dataset
proc_means(dataset, var = cd4, class = arm)
  CLASS TYPE FREQ VAR   N     MEAN      STD       MIN      MAX
1  <NA>    0  539 cd4 539 625.8906 261.6480 -251.6203 1518.566
2     0    1  228 cd4 228 599.0607 250.5687 -201.0123 1255.786
3     1    1  311 cd4 311 645.5601 268.1762 -251.6203 1518.566
#Original dataset
proc_means(data, var = cd4_survey_yr3, class = random_arm)
# A tibble: 3 × 9
  CLASS  TYPE  FREQ VAR                N  MEAN   STD   MIN   MAX
  <fct> <dbl> <int> <chr>          <int> <dbl> <dbl> <dbl> <dbl>
1 <NA>      0   539 cd4_survey_yr3   539  620.  279.    70  1567
2 0         1   239 cd4_survey_yr3   239  604.  265.    70  1540
3 1         1   300 cd4_survey_yr3   300  632.  290.    78  1567
#Synthetic dataset
proc_means(dataset, var = viral_load, class = arm)
  CLASS TYPE FREQ        VAR   N     MEAN      STD MIN      MAX
1  <NA>    0  539 viral_load 539 87.72695 54.60308  40 198.9452
2     0    1  228 viral_load 228 90.65165 55.76379  40 198.9452
3     1    1  311 viral_load 311 85.58280 53.72527  40 198.2108
#Original dataset
proc_means(data, var = viral_load_yr3, class = random_arm)
# A tibble: 3 × 9
  CLASS  TYPE  FREQ VAR                N  MEAN    STD   MIN    MAX
  <fct> <dbl> <int> <chr>          <int> <dbl>  <dbl> <dbl>  <dbl>
1 <NA>      0   539 viral_load_yr3   539 1784. 13182.    40 177075
2 0         1   239 viral_load_yr3   239 3256. 19241.    40 177075
3 1         1   300 viral_load_yr3   300  610.  3854.    40  37415

The means and standard deviations for each arm are very comparable between the original and synthetic datasets. The means are within 15 of each other. The number in each arm is also similar.

I also used ChatGPT to create plots similar to the plots above to see how well ChatGPT could continue to use the dataset. I supplied this prompt to create the code for the plots: write code to create three plots. first, plot the mean cd4 by arm with error bars showing 1 standard deviation, color arm = 0 as red and arm = 1 as teal. For the second plot, perform the same as the first plot using viral_load instead of cd4. For the third plot, create a scatterplot with cd4 as the y and viral_load as the x, coloring the dots by region.

As a note, based on my prompt, R assigned one of the colors in Plots 1 and 2 as teal, which isn’t a color R recognized. I changed teal to blue in the code below.

# Compute summary statistics for CD4 by arm
cd4_summary <- dataset %>%
  group_by(arm) %>%
  summarise(mean_cd4 = mean(cd4),
            sd_cd4 = sd(cd4)) 

# First plot: Mean CD4 by arm with error bars
p1 <- ggplot(cd4_summary, aes(x = arm, y = mean_cd4, fill = arm)) +
  geom_bar(stat = "identity", position = position_dodge(), color = "black") +
  geom_errorbar(aes(ymin = mean_cd4 - sd_cd4, ymax = mean_cd4 + sd_cd4), width = 0.2) +
  scale_fill_manual(values = c("0" = "red", "1" = "blue")) +
  labs(title = "Mean CD4 by Arm", x = "Arm", y = "Mean CD4 Count") +
  theme_minimal()

# Compute summary statistics for Viral Load by arm
viral_summary <- dataset %>%
  group_by(arm) %>%
  summarise(mean_viral = mean(viral_load),
            sd_viral = sd(viral_load)) 

# Second plot: Mean Viral Load by arm with error bars
p2 <- ggplot(viral_summary, aes(x = arm, y = mean_viral, fill = arm)) +
  geom_bar(stat = "identity", position = position_dodge(), color = "black") +
  geom_errorbar(aes(ymin = mean_viral - sd_viral, ymax = mean_viral + sd_viral), width = 0.2) +
  scale_fill_manual(values = c("0" = "red", "1" = "blue")) +
  labs(title = "Mean Viral Load by Arm", x = "Arm", y = "Mean Viral Load") +
  theme_minimal()

# Third plot: Scatterplot of CD4 vs. Viral Load, colored by Region
p3 <- ggplot(dataset, aes(x = viral_load, y = cd4, color = region)) +
  geom_point(alpha = 0.7) +
  scale_color_manual(values = c("0" = "blue", "1" = "green", "2" = "purple")) +
  labs(title = "Scatterplot of CD4 vs Viral Load", x = "Viral Load", y = "CD4 Count") +
  theme_minimal()

# Display plots
print(p1)

print(p2)

print(p3)

The synthetic viral load variable has a max of 200, even though 5% of values were supposed to be above 200 leading to a mean value of 3300 for arm = 0 and 610 for arm = 1. To get around this, I told ChatGPT what the problem was and suggested a solution: For the dataset creation, the mean viral_load is too low. Can the order of the steps for confirming 95% of value are below 200 and sampling from the distribution be switched?

set.seed(123)  # For reproducibility

# Define the number of observations
n <- 539

# Define region values
region <- c(rep(0, 160), rep(1, 192), rep(2, n - 160 - 192))

# Define arm values ensuring 55% are 1
arm <- sample(c(0, 1), n, replace = TRUE, prob = c(0.45, 0.55))

# Generate cd4 values based on arm
cd4 <- ifelse(arm == 0,
              rnorm(sum(arm == 0), mean = 604, sd = 265),
              rnorm(sum(arm == 1), mean = 632, sd = 290))

# Generate viral_load values based on arm
viral_load <- numeric(n)  # Initialize vector
viral_load[arm == 0] <- rnorm(sum(arm == 0), mean = 3256, sd = 19241)
viral_load[arm == 1] <- rnorm(sum(arm == 1), mean = 610, sd = 3854)

# Ensure 95% of viral_load values are <200 while keeping original distribution
high_values <- which(viral_load >= quantile(viral_load, 0.95))
viral_load[high_values] <- runif(length(high_values), min = 40, max = 200)  # Keep min at 40

# Create data frame
dataset <- data.frame(region = factor(region), arm = factor(arm), cd4 = cd4, viral_load = viral_load)

# Check dataset
summary(dataset)
 region  arm          cd4           viral_load      
 0:160   0:228   Min.   :-251.6   Min.   :-48569.3  
 1:192   1:311   1st Qu.: 446.1   1st Qu.: -3309.8  
 2:187           Median : 637.1   Median :   268.9  
                 Mean   : 625.9   Mean   :   101.8  
                 3rd Qu.: 800.0   3rd Qu.:  4466.5  
                 Max.   :1518.6   Max.   : 25375.5  
# Compute summary statistics for CD4 by arm
cd4_summary <- dataset %>%
  group_by(arm) %>%
  summarise(mean_cd4 = mean(cd4),
            sd_cd4 = sd(cd4)) 

# First plot: Mean CD4 by arm with error bars
p1 <- ggplot(cd4_summary, aes(x = arm, y = mean_cd4, fill = arm)) +
  geom_bar(stat = "identity", position = position_dodge(), color = "black") +
  geom_errorbar(aes(ymin = mean_cd4 - sd_cd4, ymax = mean_cd4 + sd_cd4), width = 0.2) +
  scale_fill_manual(values = c("0" = "red", "1" = "blue")) +
  labs(title = "Mean CD4 by Arm", x = "Arm", y = "Mean CD4 Count") +
  theme_minimal()

# Compute summary statistics for Viral Load by arm
viral_summary <- dataset %>%
  group_by(arm) %>%
  summarise(mean_viral = mean(viral_load),
            sd_viral = sd(viral_load)) 

# Second plot: Mean Viral Load by arm with error bars
p2 <- ggplot(viral_summary, aes(x = arm, y = mean_viral, fill = arm)) +
  geom_bar(stat = "identity", position = position_dodge(), color = "black") +
  geom_errorbar(aes(ymin = mean_viral - sd_viral, ymax = mean_viral + sd_viral), width = 0.2) +
  scale_fill_manual(values = c("0" = "red", "1" = "blue")) +
  labs(title = "Mean Viral Load by Arm", x = "Arm", y = "Mean Viral Load") +
  theme_minimal()

# Third plot: Scatterplot of CD4 vs. Viral Load, colored by Region
p3 <- ggplot(dataset, aes(x = viral_load, y = cd4, color = region)) +
  geom_point(alpha = 0.7) +
  scale_color_manual(values = c("0" = "blue", "1" = "green", "2" = "purple")) +
  labs(title = "Scatterplot of CD4 vs Viral Load", x = "Viral Load", y = "CD4 Count") +
  theme_minimal()

# Display plots
print(p1)

print(p2)

print(p3)

This resulted in viral_load values that were negative again, so I added a final prompt: Can you update so the lowest value for viral_load is 40?

The following code was created:

set.seed(123)  # For reproducibility

# Define the number of observations
n <- 539

# Define region values
region <- c(rep(0, 160), rep(1, 192), rep(2, n - 160 - 192))

# Define arm values ensuring 55% are 1
arm <- sample(c(0, 1), n, replace = TRUE, prob = c(0.45, 0.55))

# Generate cd4 values based on arm
cd4 <- ifelse(arm == 0,
              rnorm(sum(arm == 0), mean = 604, sd = 265),
              rnorm(sum(arm == 1), mean = 632, sd = 290))

# Generate viral_load values based on arm
viral_load <- numeric(n)  # Initialize vector
viral_load[arm == 0] <- rnorm(sum(arm == 0), mean = 3256, sd = 19241)
viral_load[arm == 1] <- rnorm(sum(arm == 1), mean = 610, sd = 3854)

# Ensure 95% of viral_load values are <200 while keeping original distribution
high_values <- which(viral_load >= quantile(viral_load, 0.95))
viral_load[high_values] <- runif(length(high_values), min = 40, max = 200)  # Keep min at 40

# Ensure minimum viral_load is 40
viral_load <- pmax(viral_load, 40)

# Create data frame
dataset <- data.frame(region = factor(region), arm = factor(arm), cd4 = cd4, viral_load = viral_load)

# Check dataset
summary(dataset)
 region  arm          cd4           viral_load     
 0:160   0:228   Min.   :-251.6   Min.   :   40.0  
 1:192   1:311   1st Qu.: 446.1   1st Qu.:   40.0  
 2:187           Median : 637.1   Median :  268.9  
                 Mean   : 625.9   Mean   : 3525.9  
                 3rd Qu.: 800.0   3rd Qu.: 4466.5  
                 Max.   :1518.6   Max.   :25375.5  
# Compute summary statistics for CD4 by arm
cd4_summary <- dataset %>%
  group_by(arm) %>%
  summarise(mean_cd4 = mean(cd4),
            sd_cd4 = sd(cd4)) 

# First plot: Mean CD4 by arm with error bars
p1 <- ggplot(cd4_summary, aes(x = arm, y = mean_cd4, fill = arm)) +
  geom_bar(stat = "identity", position = position_dodge(), color = "black") +
  geom_errorbar(aes(ymin = mean_cd4 - sd_cd4, ymax = mean_cd4 + sd_cd4), width = 0.2) +
  scale_fill_manual(values = c("0" = "red", "1" = "blue")) +
  labs(title = "Mean CD4 by Arm", x = "Arm", y = "Mean CD4 Count") +
  theme_minimal()

# Compute summary statistics for Viral Load by arm
viral_summary <- dataset %>%
  group_by(arm) %>%
  summarise(mean_viral = mean(viral_load),
            sd_viral = sd(viral_load)) 

# Second plot: Mean Viral Load by arm with error bars
p2 <- ggplot(viral_summary, aes(x = arm, y = mean_viral, fill = arm)) +
  geom_bar(stat = "identity", position = position_dodge(), color = "black") +
  geom_errorbar(aes(ymin = mean_viral - sd_viral, ymax = mean_viral + sd_viral), width = 0.2) +
  scale_fill_manual(values = c("0" = "red", "1" = "blue")) +
  labs(title = "Mean Viral Load by Arm", x = "Arm", y = "Mean Viral Load") +
  theme_minimal()

# Third plot: Scatterplot of CD4 vs. Viral Load, colored by Region
p3 <- ggplot(dataset, aes(x = viral_load, y = cd4, color = region)) +
  geom_point(alpha = 0.7) +
  scale_color_manual(values = c("0" = "blue", "1" = "green", "2" = "purple")) +
  labs(title = "Scatterplot of CD4 vs Viral Load", x = "Viral Load", y = "CD4 Count") +
  theme_minimal()

# Display plots
print(p1)

print(p2)

print(p3)

The bar chart for CD4 by arm is very similar to the original data. Based on the bar chart for viral load, the means appear simialr but the standard errors are reduced compared to the original data. The final plot, the scatterplot of CD4 count by viral load, showed that many more observations had higher viral loads and many more had viral loads above 200. It seems like ChatGPT isn’t handling the prompts for this variable well. Overall, the synthetic data look good but not perfect.