Data Exercise

About the dataset

The data is obtained from The University of Michigan Health and Retirement Study (HRS). HRS is a longitudinal panel study that surveys a representative sample of approximately 20,000 people in America, supported by the National Institute on Aging (NIA U01AG009740) and the Social Security Administration.

Through its unique and in-depth interviews, the HRS provides an invaluable and growing body of multidisciplinary data that researchers can use to address important questions about the challenges and opportunities of aging.

Please visit: https://hrsdata.isr.umich.edu/data-products/public-survey-data?_gl=18ofhfg_gaMTc5MDM4ODE3NS4xNzM4MzU5MjU5_ga_FF28MW3MW2*MTczODM1OTI1OS4xLjEuMTczODM1OTI3NC4wLjAuMA..

The data is in STATA format.

I plan to conduct survival analysis for retired people who had high Cholesterol in 2014 until they Heart Attack.

Install all packages needed

Install and library all packages needed in this section.

install.packages("tidyverse")
The following package(s) will be installed:
- tidyverse [2.0.0]
These packages will be installed into "C:/Users/mn27712/MADA_NEW/muhammadnasir-mada2025-portofolio/renv/library/windows/R-4.4/x86_64-w64-mingw32".

# Installing packages --------------------------------------------------------
- Installing tidyverse ...                      OK [linked from cache]
Successfully installed 1 package in 21 milliseconds.
library(tidyverse)
Warning: package 'tibble' was built under R version 4.4.3
Warning: package 'tidyr' was built under R version 4.4.3
Warning: package 'readr' was built under R version 4.4.3
Warning: package 'purrr' was built under R version 4.4.3
Warning: package 'dplyr' was built under R version 4.4.3
Warning: package 'forcats' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
install.packages("ggplot2")
The following package(s) will be installed:
- ggplot2 [3.5.2]
These packages will be installed into "C:/Users/mn27712/MADA_NEW/muhammadnasir-mada2025-portofolio/renv/library/windows/R-4.4/x86_64-w64-mingw32".

# Installing packages --------------------------------------------------------
- Installing ggplot2 ...                        OK [linked from cache]
Successfully installed 1 package in 34 milliseconds.
library(ggplot2)
library(here)
here() starts at C:/Users/mn27712/MADA_NEW/muhammadnasir-mada2025-portofolio
install.packages("patchwork")  # This package is to redefine "/" operator for plot arrangement
The following package(s) will be installed:
- patchwork [1.3.0]
These packages will be installed into "C:/Users/mn27712/MADA_NEW/muhammadnasir-mada2025-portofolio/renv/library/windows/R-4.4/x86_64-w64-mingw32".

# Installing packages --------------------------------------------------------
- Installing patchwork ...                      OK [linked from cache]
Successfully installed 1 package in 26 milliseconds.
library(patchwork)
Warning: package 'patchwork' was built under R version 4.4.3
install.packages("writexl")
The following package(s) will be installed:
- writexl [1.5.4]
These packages will be installed into "C:/Users/mn27712/MADA_NEW/muhammadnasir-mada2025-portofolio/renv/library/windows/R-4.4/x86_64-w64-mingw32".

# Installing packages --------------------------------------------------------
- Installing writexl ...                        OK [linked from cache]
Successfully installed 1 package in 21 milliseconds.
library(writexl)
library(haven)

Loading Dataset

This dataset is longitudinal survey. In this analisys, I take three years, 2014, 2016,and 2018. I will read data per year then combine them.

In this part, I will load the dataset using here() function.

Reading data for 2014 and 2016 needs several steps, because the data is in STATA format, and also we need to match with dictionary. ### Data for 2014

## set path  dictionary 
dict.file2014 <- here::here("data-exercise","data", "H14C_R.dct")

file.exists(here::here("data-exercise", "data", "H14C_R.dct")) # If we want to check the working path 
[1] TRUE
# Read the dictionary file 
df.dict2014 <- read.table(dict.file2014, skip = 1, fill = TRUE, stringsAsFactors = FALSE)

# Set column names for dictionary dataframe 
colnames(df.dict2014) <- c("col.num","col.type","col.name","col.width","col.lbl")

# Remove last row which only contains a closing }
df.dict2014 <- df.dict2014[-nrow(df.dict2014),]


# Remove first row 
df.dict2014 <- df.dict2014[-1,]

# Extract numeric value from column width field
df.dict2014$col.width <- as.integer(sapply(df.dict2014$col.width, gsub, pattern = "[^0-9\\.]", replacement = ""))


# Convert column types to format to be used with read_fwf function
df.dict2014$col.type <- sapply(df.dict2014$col.type, function(x) ifelse(x %in% c("int","byte","long"), "i", ifelse(x == "float", "n", ifelse(x == "double", "d", "c"))))



##data file 
data.file2014 <- here::here( "data-exercise","data", "H14C_R.da") 

# Read the data file into a dataframe
df2014 <- read_fwf(
    file = data.file2014,
    fwf_widths(widths = df.dict2014$col.width, col_names = df.dict2014$col.name),
    col_types = paste(df.dict2014$col.type, collapse = ""))

# Add column labels to headers
#attributes(df2014)$variable.labels <- df.dict2014$col.lbl

head(df2014)
# A tibble: 6 × 227
  HHID   PN    OSUBHH NSUBHH OPN_SP  OCSR OFAMR OFINR OC231 OC234 OC235 OC239
  <chr>  <chr> <chr>  <chr>  <chr>  <int> <int> <int> <int> <int> <int> <int>
1 000003 020   0      0      <NA>       1     1     1     0     1     1     1
2 010001 010   0      0      <NA>       1     1     1     0     1     1     1
3 010003 030   0      0      <NA>       1     1     1     0     1     1     1
4 010004 040   0      0      <NA>       1     1     1     0     1     1     1
5 010013 040   1      1      <NA>       1     1     1     0     1     1     1
6 010013 010   2      2      <NA>       1     1     1     0     1     1     1
# ℹ 215 more variables: OC248 <int>, OC185 <int>, OC001 <int>, OC002 <int>,
#   OC005 <int>, OC006 <int>, OC010 <int>, OC214 <int>, OC011 <int>,
#   OC012 <int>, OC236 <int>, OC018 <int>, OC019 <int>, OC020 <int>,
#   OC232u1 <int>, OC021M1 <int>, OC021M2 <int>, OC021M3 <int>, OC021M4 <int>,
#   OC021M5 <int>, OC021M6 <int>, OC021M7 <int>, OC023 <int>, OC024 <int>,
#   OC028 <int>, OC029 <int>, OC030 <int>, OC031 <int>, OC032 <int>,
#   OC033 <int>, OC034 <int>, OC036 <int>, OC037 <int>, OC038 <int>, …

Data for 2016

## set path  dictionary 
dict.file2016 <- here::here("data-exercise","data", "H16C_R.dct")

file.exists(here::here("data-exercise", "data", "H16C_R.dct")) # If we want to check the working path 
[1] TRUE
# Read the dictionary file 
df.dict2016 <- read.table(dict.file2016, skip = 1, fill = TRUE, stringsAsFactors = FALSE)

# Set column names for dictionary dataframe 
colnames(df.dict2016) <- c("col.num","col.type","col.name","col.width","col.lbl")

# Remove last row which only contains a closing }
df.dict2016 <- df.dict2016[-nrow(df.dict2016),]


# Remove first row 
df.dict2016 <- df.dict2016[-1,]

# Extract numeric value from column width field
df.dict2016$col.width <- as.integer(sapply(df.dict2016$col.width, gsub, pattern = "[^0-9\\.]", replacement = ""))


# Convert column types to format to be used with read_fwf function
df.dict2016$col.type <- sapply(df.dict2016$col.type, function(x) ifelse(x %in% c("int","byte","long"), "i", ifelse(x == "float", "n", ifelse(x == "double", "d", "c"))))



##data file 
data.file2016 <- here::here( "data-exercise","data", "H14C_R.da") 

# Read the data file into a dataframe
df2016 <- read_fwf(
    file = data.file2016,
    fwf_widths(widths = df.dict2016$col.width, col_names = df.dict2016$col.name),
    col_types = paste(df.dict2016$col.type, collapse = ""))
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
# Add column labels to headers
#attributes(df2016)$variable.labels <- df.dict2016$col.lbl 
head(df2016)
# A tibble: 6 × 240
  HHID   PN    PSUBHH OSUBHH PPN_SP  PCSR PFAMR PFINR PC231 PC234 PC235 PC239
  <chr>  <chr> <chr>  <chr>  <chr>  <int> <int> <int> <int> <int> <int> <int>
1 000003 020   0      0      <NA>       1     1     1     0     1     1     1
2 010001 010   0      0      <NA>       1     1     1     0     1     1     1
3 010003 030   0      0      <NA>       1     1     1     0     1     1     1
4 010004 040   0      0      <NA>       1     1     1     0     1     1     1
5 010013 040   1      1      <NA>       1     1     1     0     1     1     1
6 010013 010   2      2      <NA>       1     1     1     0     1     1     1
# ℹ 228 more variables: PC248 <int>, PC001 <int>, PC185 <int>, PC002 <int>,
#   PC005 <int>, PC006 <int>, PC010 <int>, PC285 <int>, PC214 <int>,
#   PC011 <int>, PC012 <int>, PC236 <int>, PC018 <int>, PC019 <int>,
#   PC020 <int>, PC232u1 <int>, PC021M1 <int>, PC021M2 <int>, PC021M3 <int>,
#   PC021M4 <int>, PC021M5 <int>, PC021M6 <int>, PC021M7 <int>, PC023 <int>,
#   PC024 <int>, PC028 <int>, PC029 <int>, PC030 <int>, PC031 <int>,
#   PC032 <int>, PC033 <int>, PC034 <int>, PC036 <int>, PC037 <int>, …

Data for 2018

Data for 2018 is simple to read because it has R data format. It does not need many steps needed as data for 2014 and 2016.

df2018 <- here::here("data-exercise","data", "2018_h18c_r.dta")
df2018 <- read_dta(df2018)
head(df2018)
# A tibble: 6 × 258
  hhid   pn    qsubhh psubhh QPN_SP  qcsr qfamr qfinr QC231 QC234 QC235 QC239
  <chr>  <chr> <chr>  <chr>  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 010003 030   0      0      ""         1     1     1     0     1     0     1
2 010004 040   0      0      "041"      1     1     1     0     1     0     1
3 010013 040   1      1      ""         1     1     1     0     1     0     1
4 010038 010   0      0      "040"      5     5     1     0     1     0     1
5 010038 040   0      0      "010"      1     1     5     0     1     0     1
6 010059 020   0      0      "030"      1     1     5     0     1     0     1
# ℹ 246 more variables: QC248 <dbl>, QC185 <dbl>, QC001 <dbl>, QC002 <dbl>,
#   QC005 <dbl>, QC006 <dbl>, QC010 <dbl>, QC285 <dbl>, QC011 <dbl>,
#   QC012 <dbl>, QC236 <dbl>, QC214 <dbl>, QC018 <dbl>, QC019 <dbl>,
#   QC020 <dbl>, QC232U1 <dbl>, QC021M1 <dbl>, QC021M2 <dbl>, QC021M3 <dbl>,
#   QC021M4 <dbl>, QC021M5 <dbl>, QC021M6 <dbl>, QC021M7 <dbl>, QC023 <dbl>,
#   QC024 <dbl>, QC028 <dbl>, QC029 <dbl>, QC030 <dbl>, QC031 <dbl>,
#   QC032 <dbl>, QC033 <dbl>, QC034 <dbl>, QC036 <dbl>, QC037 <dbl>, …

Data for 2020

df2020 <- here::here("data-exercise","data", "2020_H20C_R.dta")
df2020 <- read_dta(df2020)
head(df2020)
# A tibble: 6 × 255
  HHID   PN    RSUBHH QSUBHH RPN_SP  RCSR RFAMR RFINR RC231 RC234 RC235 RC239
  <chr>  <chr> <chr>  <chr>  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 010003 030   0      0      ""         1     1     1     1     0     1     0
2 010004 040   1      0      "041"      1     1     1     1     0     1     0
3 010013 040   1      1      ""         1     1     1     1     0     1     0
4 010038 010   0      0      "040"      1     5     1     1     0     1     0
5 010038 040   0      0      "010"      5     1     5     1     0     1     0
6 010050 010   0      0      ""         1     1     1     1     0     1     0
# ℹ 243 more variables: RC248 <dbl>, RC185 <dbl>, RC001 <dbl>, RC002 <dbl>,
#   RC005 <dbl>, RC006 <dbl>, RC010 <dbl>, RC285 <dbl>, RC011 <dbl>,
#   RC012 <dbl>, RC236 <dbl>, RC214 <dbl>, RC018 <dbl>, RC019 <dbl>,
#   RC020 <dbl>, RC021M1 <dbl>, RC021M2 <dbl>, RC021M3 <dbl>, RC021M4 <dbl>,
#   RC021M5 <dbl>, RC023 <dbl>, RC024 <dbl>, RC028 <dbl>, RC029 <dbl>,
#   RC030 <dbl>, RC031 <dbl>, RC032 <dbl>, RC033 <dbl>, RC034 <dbl>,
#   RC036 <dbl>, RC037 <dbl>, RC038 <dbl>, RC039 <dbl>, RC257 <dbl>, …

Data for 2022

The data is in STATA format

## set path  dictionary 
dict.file2022 <- here::here("data-exercise","data", "H22C_R.dct")

file.exists(here::here("data-exercise", "data", "H22C_R.dct")) # If we want to check the working path 
[1] TRUE
# Read the dictionary file 
df.dict2022 <- read.table(dict.file2022, skip = 1, fill = TRUE, stringsAsFactors = FALSE)

# Set column names for dictionary dataframe 
colnames(df.dict2022) <- c("col.num","col.type","col.name","col.width","col.lbl")

# Remove last row which only contains a closing }
df.dict2022 <- df.dict2022[-nrow(df.dict2022),]


# Remove first row 
df.dict2022 <- df.dict2022[-1,]

# Extract numeric value from column width field
df.dict2022$col.width <- as.integer(sapply(df.dict2022$col.width, gsub, pattern = "[^0-9\\.]", replacement = ""))


# Convert column types to format to be used with read_fwf function
df.dict2022$col.type <- sapply(df.dict2022$col.type, function(x) ifelse(x %in% c("int","byte","long"), "i", ifelse(x == "float", "n", ifelse(x == "double", "d", "c"))))



##data file 
data.file2022 <- here::here( "data-exercise","data", "H22C_R.da") 

# Read the data file into a dataframe
df2022 <- read_fwf(
    file = data.file2022,
    fwf_widths(widths = df.dict2022$col.width, col_names = df.dict2022$col.name),
    col_types = paste(df.dict2022$col.type, collapse = ""))

# Add column labels to headers
#attributes(df2022)$variable.labels <- df.dict2022$col.lbl #
head(df2022)
# A tibble: 6 × 245
  HHID   PN    SSUBHH RSUBHH SPN_SP  SCSR SFAMR SFINR SC231 SC234 SC235 SC239
  <chr>  <chr> <chr>  <chr>  <chr>  <int> <int> <int> <int> <int> <int> <int>
1 010001 010   0      0      <NA>       1     1     1     0     1     2     1
2 010003 030   0      0      <NA>       1     1     1     0     1     2     1
3 010004 040   1      1      <NA>       1     1     1     0     1     2     1
4 010013 040   1      1      <NA>       1     1     1     0     1     2     1
5 010038 010   0      0      040        1     5     1     0     1     2     1
6 010038 040   0      0      010        5     1     5     0     1     2     1
# ℹ 233 more variables: SC248 <int>, SC185 <int>, SC001 <int>, SC002 <int>,
#   SC005 <int>, SC006 <int>, SC010 <int>, SC285 <int>, SC011 <int>,
#   SC012 <int>, SC214 <int>, SC018 <int>, SC019 <int>, SC020 <int>,
#   SC024 <int>, SC028 <int>, SC029 <int>, SC030 <int>, SC033 <int>,
#   SC036 <int>, SC037 <int>, SC038 <int>, SC257 <int>, SC258 <int>,
#   SC259 <int>, SC274 <int>, SC275 <int>, SC276 <int>, SC277 <int>,
#   SC040 <int>, SC041 <int>, SC043 <int>, SC044 <int>, SC260 <int>, …

Data Cleaning

Select variables needed for each year

Variables that needed in this study include Household Identification number, Respondent Person Identification number, High Cholesterol Status, Years first had heart attack, Month first had heart attack, Ever had heart attack.

#data2014
da2014 <- df2014 %>% 
  select("HHID", "PN", "OC283", "OC040", "OC258", "OC259") %>%
  rename(hhid = HHID, pn = PN, chol= OC283, ha = OC040, year = OC258, month= OC259)


da2014 <- lapply(da2014, function(x) { attributes(x) <- NULL; x }) #delete the column lables from dataset, because their position is incorrect after deleting some variables. 
da2014 <- as.data.frame(da2014)  # Convert back to a dataframe

#data2016
da2016 <- df2016 %>% 
  select("HHID", "PN", "PC283", "PC040", "PC258", "PC259") %>%
  rename(hhid = HHID, pn = PN, chol= PC283, ha = PC040, year = PC258, month= PC259)


da2016 <- lapply(da2016, function(x) { attributes(x) <- NULL; x }) #delete the column lables from dataset, because their position is incorrect after deleting some variables. 
da2016 <- as.data.frame(da2016)  # Convert back to a dataframe

#data2018 
da2018 <- df2018 %>%
  select("hhid", "pn", "QC283", "QC040", "QC258", "QC259") %>%
  rename( chol = QC283, ha= QC040, year = QC258, month = QC259) # I rename the column names because in this year, they use different name. 

da2018 <- lapply(da2018, function(x) { attributes(x) <- NULL; x }) #delete the column lables from dataset, because their position is incorrect after deleting some variables. 
da2018 <- as.data.frame(da2018)  # Convert back to a dataframe

da2020 <- df2020 %>%
  select("HHID", "PN", "RC283", "RC040", "RC258", "RC259") %>%
  rename( hhid = HHID, pn = PN, chol = RC283, ha= RC040, year = RC258, month = RC259) # I rename the column names because in this year, they use different name. 

da2022 <- df2022 %>%
  select("HHID", "PN", "SC283", "SC040", "SC258", "SC259") %>%
  rename( hhid = HHID, pn = PN, chol = SC283, ha= SC040, year = SC258, month = SC259) # I rename the column names because in this year, they use different name. 

Combine dataset

# Perform a full join step-by-step
merged_data <- da2014 %>%
  full_join(da2016, by = c("hhid", "pn"), suffix = c("_2014", "_2016")) %>%
  full_join(da2018, by = c("hhid", "pn"), suffix = c("", "_2018")) %>%
  full_join(da2020, by = c("hhid", "pn"), suffix = c("", "_2020")) %>%
  full_join(da2022, by = c("hhid", "pn"), suffix = c("", "_2022")) 

# View result
head(merged_data)
    hhid  pn chol_2014 ha_2014 year_2014 month_2014 chol_2016 ha_2016 year_2016
1 000003 020         5      NA        NA         NA        NA      NA        NA
2 010001 010         1      NA        NA         NA        NA      NA        NA
3 010003 030         1      NA        NA         NA        NA      NA        NA
4 010004 040         5      NA        NA         NA        NA      NA        NA
5 010013 040         1      NA        NA         NA        NA      NA        NA
6 010013 010         1       5      1995         NA        NA      NA       995
  month_2016 chol ha year month chol_2020 ha_2020 year_2020 month_2020
1         NA   NA NA   NA    NA        NA      NA        NA         NA
2         NA   NA NA   NA    NA        NA      NA        NA         NA
3         NA    1 NA   NA    NA         1      NA        NA         NA
4         NA    5 NA   NA    NA         1      NA        NA         NA
5         NA    1 NA   NA    NA         1      NA        NA         NA
6         NA   NA NA   NA    NA        NA      NA        NA         NA
  chol_2022 ha_2022 year_2022 month_2022
1        NA      NA        NA         NA
2         5      NA        NA         NA
3         5       5        NA         NA
4         1      NA        NA         NA
5         5      NA        NA         NA
6        NA      NA        NA         NA

I realize that we only need cholesterol status in 2014, therefore, I want to delete cholesterol in 2016 and 2018

df <- merged_data[, !(colnames(merged_data) %in% c("chol_2016", "chol", "chol_2020", "chol_2022"))] %>% # deleting cholesterol from 2016 and 2018 
  rename(ha_2018 = ha, year_2018 =year, month_2018 = month) # I rename variables from 2018 

Exploratory data

I want to create pie chat to show persons having high cholesterol in 2014.

table_1 <- table(df$chol_2014)

chol_df <- as.data.frame(table_1) %>% 
  select(1,2)

colnames(chol_df) <- c("Chol_stat", "Count") # rename the column 

chol_df  <- chol_df  %>%
  filter(!Chol_stat %in% c(8, 9)) %>% 
  mutate(Chol_category = case_when(
    Chol_stat == 1 ~ "Normal",  # If Chol_stat is 1, it's "Normal"
    Chol_stat == 5 ~ "High",    # If Chol_stat is 5, it's "High"
    TRUE ~ "Other"              # For any other value, label as "Other" (optional)
  )) %>%
  select(Chol_category, Count) %>%
  rename(Chol_stat = Chol_category)
  

plot1 <- ggplot(chol_df, aes(x = "", y = Count, fill = Chol_stat)) + 
  geom_bar(stat = "identity", width = 1) +  # Create bar chart
  coord_polar(theta = "y") +  # Convert to pie chart
  geom_text(aes(label = paste0(round(Count / sum(Count) * 100, 1), "%")), 
            position = position_stack(vjust = 0.5), color = "white") +  # Add percentage labels
  labs(title = "Cholesterol status in 2014") +
  theme_void()


figure_file = here("data-exercise","result","figures","chol_status.png")
ggsave(filename = figure_file, plot=plot1) 
Saving 7 x 5 in image
print(plot1)

I want to create plot number of person having Heart Attack in each year

table_2 <- table(df$ha_2014)
table_3 <- table(df$ha_2016)
table_4 <- table(df$ha_2018)
table_5 <- table(df$ha_2020)
table_6 <- table(df$ha_2022)


# Create the combined table
table_ha <- data.frame(
  status = c("1"),  # Heart attack status
  `2014` = table_2[c("1")],  # Heart attack status counts for 2014
  `2016` = table_3[c("1")],  # Heart attack status counts for 2016
  `2018` = table_4[c("1")],  # Heart attack status counts for 2018
  `2020` = table_5[c("1")],  # Heart attack status counts for 2020
  `2022` = table_6[c("1")]   # Heart attack status counts for 2022
) %>%
  rename(ha_2014 =X2014, ha_2016=X2016, ha_2018=X2018, ha_2020 =X2020, ha_2022 =X2022) %>%
  mutate(ha_case = ifelse(status==1, "heart attack", "No heart attack")) %>%
  select(ha_case, everything()) %>%  # Move ha_case to the first column
  select(-status) %>%  # Remove the status column
  select(-ha_case) # I delete it because it prevent to create long tabel 

# Create long table 
table_ha <- table_ha %>%
  pivot_longer(cols = starts_with("ha_"),  # Reshape the ha_ columns into long format
               names_to = "year",  # Column for years
               values_to = "count") %>%
  select(year, count)  

# I want to change the value into numeric year
table_ha <- table_ha %>% 
  mutate(years = case_when(
    year == "ha_2014" ~ "2014",
    year == "ha_2016" ~ "2016",
    year == "ha_2018" ~ "2018", 
    year == "ha_2020" ~ "2020",
    year == "ha_2022" ~ "2022"
  )) %>%
  select(years, everything())

table_ha <- table_ha %>% 
  select(-year)
head(table_ha)
# A tibble: 5 × 2
  years count
  <chr> <int>
1 2014    190
2 2016    349
3 2018    153
4 2020    230
5 2022    110

After creating long table, I want to create plot to show the number of heart attack cases each observation year.

plot2 <- ggplot(table_ha, aes(x = years, y = count)) +
  geom_line(size = 1, color = "blue") +
  geom_point(size = 3, color = "red") +
  labs(title = "Heart Attack Cases Over Time Among Retired People in the USA ", 
       x = "Years", 
       y = "Number of Heart Attack Cases ",
       caption= "Sourse: Health and Retirement Study") +
  theme_minimal() + 
  theme_bw()+
  theme(plot.title = element_text(hjust = 0.5, face = "bold.italic")) + # to adjust the title position, size, and color
   theme(plot.background = element_rect(color = "black", fill = NA, linewidth = 1)) # to create frame of the plot 
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
figure_file = here("data-exercise","result","figures","heart attack cases.png")
ggsave(filename = figure_file, plot=plot2) 
Saving 7 x 5 in image
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?
print(plot2) 
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

Now, I want to perform inclusion and inclusion criteria. I will include person who had high cholesterol level in 2014, but they did not have heart attack (disease free) in 2014.

# Step 1: Filter the dataset

df_filtered <- df %>%
  filter(chol_2014 == 1, # High cholesterol in 2014
         ha_2014 == 5) # No heart attack in 2014
data <- df_filtered