Tidy Tuesday Exercise

Introduction

In this exercise, I am working on TidyTuesday dataset (April 8th 2025). The dataset is about “Timely and Effective Care” in the United States from Centers for Medicare & Medicaid Services. The data was curated by Jon Harmon from Data Science Learning Community. The dataset contains several variables including state, condition, ID, name, score, footnote, and date of admission.

state (character): The two-letter code for the state (or territory, etc) where the hospital is located.
condition (character): The condition for which the patient was admitted. Six categories of conditions are included in the data.
measure_id (character): The ID of the thing being measured. Note that there are 22 unique IDs but only 21 unique names.
measure_name (character): The name of the thing being measured. Note that there are 22 unique IDs but only 21 unique names.
score (character): The score of the measure.
footnote (character): Footnotes that apply to this measure: 5 = “Results are not available for this reporting period.”, 25 = “State and national averages include Veterans Health Administration (VHA) hospital data.”, 26 = “State and national averages include Department of Defense (DoD) hospital data.”.
start_date (date): The date on which measurement began for this measure.
end_date (date): The date on which measurement ended for this measure

Loading Packages

#install.packages("tidyverse")
library(tidyverse)

Warning: package 'tibble' was built under R version 4.4.3

Warning: package 'tidyr' was built under R version 4.4.3

Warning: package 'readr' was built under R version 4.4.3

Warning: package 'purrr' was built under R version 4.4.3

Warning: package 'dplyr' was built under R version 4.4.3

Warning: package 'forcats' was built under R version 4.4.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#install.packages("ggplot2")
library(ggplot2)
library(here)

here() starts at C:/Users/mn27712/MADA_NEW/muhammadnasir-mada2025-portofolio

library(dplyr)
library(tidyverse)
library(lubridate)
#install.packages("gt")
library(gt)
#install.packages("gtsummary")
library(gtsummary)

Warning: package 'gtsummary' was built under R version 4.4.3

#install.packages("cli")
library(cli)

Warning: package 'cli' was built under R version 4.4.3

#install.packages("tidymodels")
library(tidymodels)

Warning: package 'tidymodels' was built under R version 4.4.3

── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
✔ broom        1.0.8     ✔ rsample      1.3.0
✔ dials        1.4.0     ✔ tune         1.3.0
✔ infer        1.0.7     ✔ workflows    1.2.0
✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
✔ parsnip      1.3.1     ✔ yardstick    1.3.2
✔ recipes      1.2.1

Warning: package 'dials' was built under R version 4.4.3

Warning: package 'infer' was built under R version 4.4.3

Warning: package 'modeldata' was built under R version 4.4.3

Warning: package 'parsnip' was built under R version 4.4.3

Warning: package 'recipes' was built under R version 4.4.3

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()

#install.packages("broom.mixed")
#install.packages("purrr")
library(skimr)
#install.packages("naniar")
library(naniar)


Attaching package: 'naniar'

The following object is masked from 'package:skimr':

    n_complete

#install.packages("treemapify")
library(treemapify)
#install.packages("usmap")
library(usmap)
#install.packages("car")
library(car)

Warning: package 'car' was built under R version 4.4.3

Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

#Data sources

Dataset is downloaded from (https://data.cms.gov/provider-data/dataset/apyc-v239). Information regarding the dataset is obtained from tidytuesday github for the weekly data (April 8th 2025)

#Importing the downloaded CSV
care <- read_csv(here("tidytuesday-exercise", "data", "care_state.csv"))

Rows: 1232 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): state, condition, measure_id, measure_name, footnote
dbl  (1): score
date (2): start_date, end_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#check the structure of the dataset
str(care)

spc_tbl_ [1,232 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ state       : chr [1:1232] "AK" "AK" "AK" "AK" ...
 $ condition   : chr [1:1232] "Healthcare Personnel Vaccination" "Healthcare Personnel Vaccination" "Emergency Department" "Emergency Department" ...
 $ measure_id  : chr [1:1232] "HCP_COVID_19" "IMM_3" "OP_18b" "OP_18b_HIGH_MIN" ...
 $ measure_name: chr [1:1232] "Percentage of healthcare personnel who are up to date with COVID-19 vaccinations" "Healthcare workers given influenza vaccination Higher percentages are better" "Average (median) time patients spent in the emergency department before leaving from the visit A lower number o"| __truncated__ "Average time patients spent in the emergency department before being sent home A lower number of minutes is better (high)" ...
 $ score       : num [1:1232] 7.3 80 140 157 136 136 NA 196 230 182 ...
 $ footnote    : chr [1:1232] NA NA "25, 26" "25, 26" ...
 $ start_date  : Date[1:1232], format: "2024-01-01" "2023-10-01" ...
 $ end_date    : Date[1:1232], format: "2024-03-31" "2024-03-31" ...
 - attr(*, "spec")=
  .. cols(
  ..   state = col_character(),
  ..   condition = col_character(),
  ..   measure_id = col_character(),
  ..   measure_name = col_character(),
  ..   score = col_double(),
  ..   footnote = col_character(),
  ..   start_date = col_date(format = ""),
  ..   end_date = col_date(format = "")
  .. )
 - attr(*, "problems")=<externalptr>

Let’s see the data summry using skim()

skim(care) # this is the smart version of summary()

Data summary
Name	care
Number of rows	1232
Number of columns	8
_______________________
Column type frequency:
character	5
Date	2
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
state	0	1.00	2	2	56
condition	0	1.00	11	35	6
measure_id	0	1.00	5	20	22
measure_name	0	1.00	26	172	21
footnote	168	0.86	1	6	3

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
start_date	0	1	2023-01-01	2024-01-01	2023-04-01	4
end_date	0	1	2023-12-31	2024-03-31	2024-03-31	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
score	155	0.87	134.04	102.02	1	70	93	193	730	▇▃▁▁▁

#Data Exploration

Explore each variable

In this step, I would like to explore each variables to understand more the data distribution and pattern

state

table(care$state)


AK AL AR AS AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI MN 
22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 
MO MP MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR RI SC SD TN TX UT VA VI VT 
22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 
WA WI WV WY 
22 22 22 22

All state has same number of entries (22 entries)

condition

# 1. Summarize the condition counts
condition_counts <- care %>%
  count(condition, sort = TRUE)

# 2. Create a nice table
condition_counts %>%
  gt() %>%
  tab_header(
    title = "Condition Frequencies",
    subtitle = "Timely and Effective Care Data"
  ) %>%
  cols_label(
    condition = "Condition",
    n = "Count"
  ) %>%
  fmt_number(
    columns = c(n),
    decimals = 0
  ) %>%
  opt_table_font(
    font = list(
      google_font("Roboto"), 
      default_fonts()
    )
  ) %>%
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_column_labels(everything())
  )

Condition Frequencies
Timely and Effective Care Data
Condition	Count
Emergency Department	672
Sepsis Care	280
Healthcare Personnel Vaccination	112
Cataract surgery outcome	56
Colonoscopy care	56
Electronic Clinical Quality Measure	56

Score of the measure

Distribution of care score

ggplot(care, aes(x = score)) +
  geom_density(fill = "lightgreen", color = "black") +
  labs(
    title = "Distribution of Care Scores",
    x = "Score",
    y = "Count"
  ) +
  theme_minimal()

Warning: Removed 155 rows containing non-finite outside the scale range
(`stat_density()`).

I want to explore the score of each measure

ggplot(care, aes(x = score, fill = measure_id)) +
  geom_histogram(color = "white", bins = 30) +  # white borders between bars
  facet_wrap(vars(measure_id), scales = "free") +  # free scales if measures vary a lot
  labs(title = "Distribution of Score by Measure",
       x = "Score",
       y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

Warning: Removed 155 rows containing non-finite outside the scale range
(`stat_bin()`).

Now I want to explore the average score of each condition.

# 1. Summarize the average score by condition
avg_score_condition <- care %>%
  group_by(condition) %>%
  summarise(avg_score = mean(score, na.rm = TRUE)) %>%
  arrange(desc(avg_score))

# 2. Plot
ggplot(avg_score_condition, aes(x = reorder(condition, avg_score), y = avg_score, fill = condition)) +
  geom_col() +  # bar plot
  coord_flip() +
  labs(title = "Average Score by Condition",
       x = "Condition",
       y = "Average Score") +
  theme_minimal() +
  theme(legend.position = "none")

state_score_avg <- care%>%
  group_by(state) %>%
  summarise(avg_score = mean(score, na.rm = TRUE))

# Plot the map
plot_usmap(data = state_score_avg, values = "avg_score", color = "white") +
  scale_fill_continuous(
    low = "lightgreen", high = "darkgreen", name = "Avg Score", label = scales::comma
  ) +
  labs(
    title = "Average Care Score by State"
  ) +
  theme(legend.position = "right")

Now I want to see the average waiting time in emergency room by state.

# 1. Filter Emergency Department measures
care_emergency <- care %>%
  filter(
    condition == "Emergency Department",
    grepl("Average time patients spent", measure_name, ignore.case = TRUE)
  )

# 2. Calculate average score (visit time) for each state
state_avg_emergency <- care_emergency %>%
  group_by(state) %>%
  summarise(avg_score = mean(score, na.rm = TRUE))

# 3. Get the map points
state_map_points <- usmap::us_map(regions = "states")

# 4. Convert to sf manually
state_map_sf <- sf::st_as_sf(state_map_points, wkt = "geom", crs = 4326)

# 5. Extract centroid coordinates
state_centroids <- state_map_sf %>%
  sf::st_centroid() %>%
  sf::st_coordinates() %>%
  as.data.frame()

Warning: st_centroid assumes attributes are constant over geometries

# 6. Build center dataset
state_centers <- tibble(
  state = state_map_sf$abbr,
  x = state_centroids$X,
  y = state_centroids$Y
)

# 7. Merge centers with average scores
state_avg_emergency_labels <- left_join(state_avg_emergency, state_centers, by = "state")

# 8. onvert score into hours
state_avg_emergency_labels <- state_avg_emergency_labels %>%
  mutate(
    avg_hours = round(avg_score / 60, 1)  # divide by 60 and round to 1 decimal
  )

# 9. Plot the map (now label with hours)
plot_usmap(data = state_avg_emergency, values = "avg_score", color = "white") +
  geom_text(
    data = state_avg_emergency_labels, 
    aes(x = x, y = y, label = avg_hours),  # <-- use avg_hours instead of avg_score
    inherit.aes = FALSE, 
    size = 2.8, 
    color = "black"
  ) +
  scale_fill_gradient(
    low = "mistyrose", 
    high = "darkred", 
    name = "Avg ER Visit Time (minutes)", 
    label = scales::comma
  ) +
  labs(
    title = "Average Emergency Room waiting Time by State (in Hours)"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_text()`).

###Explore missing values

Before doing further analysis, I want to explore the missing value

gg_miss_var(care)+ labs(title = "Number missing values for each variable")

Now I want to explore missing value for each variable by measurement.

gg_miss_var(care, facet = measure_id) + 
  labs(title = "Number missing values for each variable by each measure_id")

Data Wrangling

In this step, I will create a new dataset, which only contain variables needed to answer may research question. In the new dataset, I will convert dataset from long format into wide format. It will make me easy to perform data analysis. The columns are variables, and rows are states of USA. I will drop some variables and delete missing values.

# Load required packages
library(tidyverse)

# 1. Filter the dataset to only include the variables you want
care_selected <- care %>%
  filter(measure_id %in% c(
    "HCP_COVID_19",   # COVID-19 vaccination
    "IMM_3",          # Influenza vaccination
    "OP_18b",         # ED waiting time
    "OP_23",          # Stroke ED brain scan time
    "SAFE_USE_OF_OPIOIDS",  # Safe opioid prescribing
    "SEP_1"           # Sepsis care quality
  )) %>%
  select(state, measure_id, score)

# 2. Pivot to wide format
care_final <- care_selected %>%
  pivot_wider(
    names_from = measure_id,
    values_from = score
  )

# 3. Check the resulting dataset
  glimpse(care_final)

Rows: 56
Columns: 7
$ state               <chr> "AK", "AL", "AR", "AS", "AZ", "CA", "CO", "CT", "D…
$ HCP_COVID_19        <dbl> 7.3, 5.9, 2.7, NA, 17.4, 22.4, 11.8, 13.2, 6.6, 38…
$ IMM_3               <dbl> 80, 76, 81, NA, 85, 73, 92, 85, 94, 84, 60, 76, NA…
$ OP_18b              <dbl> 140, 145, 133, NA, 168, 184, 133, 193, 310, 217, 1…
$ OP_23               <dbl> 60, 67, 72, NA, 68, 73, 70, 72, 26, 73, 69, 69, NA…
$ SAFE_USE_OF_OPIOIDS <dbl> 16, 14, 16, NA, 11, 14, 15, 18, 11, 15, 16, 15, NA…
$ SEP_1               <dbl> 58, 61, 64, NA, 55, 67, 73, 57, 49, 48, 68, 57, NA…

# 4. (Optional) View it nicely
head(care_final)

# A tibble: 6 × 7
  state HCP_COVID_19 IMM_3 OP_18b OP_23 SAFE_USE_OF_OPIOIDS SEP_1
  <chr>        <dbl> <dbl>  <dbl> <dbl>               <dbl> <dbl>
1 AK             7.3    80    140    60                  16    58
2 AL             5.9    76    145    67                  14    61
3 AR             2.7    81    133    72                  16    64
4 AS            NA      NA     NA    NA                  NA    NA
5 AZ            17.4    85    168    68                  11    55
6 CA            22.4    73    184    73                  14    67

dim(care_final)

[1] 56  7

I want to drop missing value

care_final <- care_final %>%
  drop_na()

dim(care_final)

[1] 52  7

For the modeling purpose, the data is slitted into training data (75%) and test data (25%)

seed = 4321

set.seed(seed)  # setting seed
split <- initial_split(care_final, prop = 0.75)

care_train <- training(split)
care_test <- testing(split)

more exploration for the primary outcome

Before going further, I want to explore the score distribution of my primary outcome.

care_final %>%
  ggplot(aes(x = OP_18b)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "skyblue", color = "white", alpha = 0.7) +
  geom_density(color = "darkblue", size = 1) +
  labs(
    title = "Distribution of Emergency Department Waiting Time (OP_18b)",
    x = "Median ED Waiting Time (Minutes)",
    y = "Density"
  ) +
  theme_minimal()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

Note: The score show moderate right-skeewed. It needs to be transformed in the model for better result

Research Questions

Timely access to emergency department (ED) care is essential for reducing patient morbidity and mortality. However, many hospitals across the United States continue to face challenges related to ED overcrowding and extended waiting times. Maintaining a healthy healthcare workforce is critical to ensuring efficient emergency care delivery. Vaccination coverage among healthcare personnel—specifically for COVID-19 and influenza—has been prioritized as a public health measure to reduce workforce illness and absenteeism. Higher rates of staff vaccination may improve hospital operations by preserving adequate staffing levels and maintaining patient flow, particularly in high-pressure environments such as emergency departments. Despite its importance, the relationship between healthcare personnel vaccination rates and operational outcomes like ED waiting times has not been extensively studied at a population level. Understanding whether improved vaccine coverage translates into more efficient emergency care could inform future workforce and infection control policies.

This exercise examines whether higher COVID-19 and influenza vaccination rates among healthcare workers are associated with shorter emergency department waiting times across U.S. states. Using standardized national measures, including healthcare worker vaccination rates and median ED visit times. This research evaluates potential associations between workforce health protection and patient access to timely care. In addition, hospital-level quality indicators such as safe opioid prescribing practices and sepsis care bundle compliance are included to account for broader institutional quality, ensuring a more accurate interpretation of the findings. By focusing on these key factors, this study seeks to provide evidence on whether strengthening healthcare personnel vaccination efforts could yield operational benefits beyond infection prevention, ultimately improving patient experience and emergency care system performance.

RQ: Is higher COVID-19 and influenza vaccination score among healthcare personnel associated with shorter emergency department (ED) waiting times across U.S. states?

Hypotheses: Higher healthcare personnel vaccination score (COVID-19 and influenza) is associated with shorter emergency department waiting times across U.S. states.

Modeling

##Initial Analysis In this initial analysis, I will perform linear model to see if the primary outome needs to be transformed, and to test multicolinearity.

model1 <- lm(OP_18b ~ HCP_COVID_19 + IMM_3 + OP_23 + SAFE_USE_OF_OPIOIDS + SEP_1, data = care_final)

multicolinearity test

vif(model1)

       HCP_COVID_19               IMM_3               OP_23 SAFE_USE_OF_OPIOIDS 
           1.557707            1.239288            1.077637            1.754512 
              SEP_1 
           2.022136

The is no multicolinearity. Thi is good to go to the next step.

Now I want to see the residual to see if the original data can be performedr transfomation is needed.

par(mfrow = c(2, 2))  # Arrange 4 plots in 1 window
plot(model1)

The residuals look mostly scattered randomly, but there’s a small curve (slight nonlinearity).
Normal Q-Q plot shows Most points lie close to the diagonal line.
Diagnostic plots showed that model residuals were approximately normally distributed with no strong evidence of nonlinearity, heteroscedasticity, or influential outliers, supporting the appropriateness of linear regression modeling for this dataset.

I want to try log transformation.

model1_log <- lm(log(OP_18b) ~ HCP_COVID_19 + IMM_3 + OP_23 + SAFE_USE_OF_OPIOIDS + SEP_1, data = care_final)

par(mfrow = c(2, 2))  # Arrange 4 plots in 1 window
plot(model1_log)

There is no much change after using log transformation. Therefore, I will use without log transformation in my model.

##Futher Analysis

In this exercise, I will use three different model, Multiple Linear Regression, LASSO Regression, and Random Forest. The outcome variable is median of waiting time in emergency department (OP_18b). The predictors include Percentage of healthcare personnel who are up to date with COVID-19 vaccinations (HCP_COVID_19), Healthcare workers given influenza vaccination Higher percentages are better (IMM_3), Safe Use of Opioids - Concurrent Prescribing (SAFE_USE_OF_OPIOIDS), Percentage of patients who came to the emergency department with stroke symptoms who received brain scan results within 45 minutes of arrival (OP_23), and Percentage of patients who received appropriate care for severe sepsis and septic shock (SEP_1).

Linear Regression Model

First of all, multiple linear regression is fitted.

set.seed(seed)
# Define a linear model with all predictors
care_lm <- linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")

# work flow 
care_lm_wf <- workflow() %>% 
  add_model(care_lm) %>% 
  add_formula(OP_18b ~ HCP_COVID_19 + IMM_3 + SAFE_USE_OF_OPIOIDS + OP_23 + SEP_1)

# Set up cross-validation (15-fold CV)
care_lm_cv_results <- fit_resamples(
  care_lm_wf,
  resamples = vfold_cv(care_train, v = 15),
  metrics = metric_set(rmse, rsq),
  control = control_resamples(save_pred = TRUE)
)

#Print result 
collect_metrics(care_lm_cv_results)

# A tibble: 2 × 6
  .metric .estimator   mean     n std_err .config             
  <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
1 rmse    standard   24.8      15   3.54  Preprocessor1_Model1
2 rsq     standard    0.582    15   0.104 Preprocessor1_Model1

Now I want to fit into all training data

# Fit model on the training data
care_lm_train <- care_lm_wf %>% fit(care_train)

# Compute predictions on training data
care_lm_preds <- predict(care_lm_train, care_train) %>% bind_cols(care_train)

# Plot observed vs. predicted
ggplot(care_lm_preds, aes(x = OP_18b, y = .pred)) +
  geom_point(color = "purple", alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black") +
  labs(title = "Observed vs. Predicted Values - Linear Regression",
       x = "Observed",
       y = "Predicted") +
  theme_minimal()

The multiple linear regression model reasonably captures the general trend between predictors and ED waiting time across states; however, there is evidence of underestimation, particularly at higher observed values, suggesting some model misspecification or nonlinearity not captured.

lm_coefs <- tidy(care_lm_train)
print(lm_coefs)

# A tibble: 6 × 5
  term                estimate std.error statistic p.value
  <chr>                  <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)          129.       88.5       1.46   0.154 
2 HCP_COVID_19           0.973     0.493     1.97   0.0570
3 IMM_3                  0.376     0.511     0.736  0.467 
4 SAFE_USE_OF_OPIOIDS    1.30      2.42      0.536  0.595 
5 OP_23                  0.256     0.702     0.365  0.717 
6 SEP_1                 -0.946     0.845    -1.12   0.271

###Random Forest

The second model is Random Forest.

set.seed(seed)

# Set up the tuning grid
rf_grid <- grid_regular(
  mtry(range = c(1, 10)),  # mtry between 1 and 10
  min_n(range = c(1, 16)), # min_n between 1 and 16
  levels = 4              # 4 levels for each parameter
)

# Define the model specification using the ranger engine, with fixed trees at 300
rf_cv <- rand_forest(
  mode = "regression",
  mtry = tune(),   
  min_n = tune(),  
  trees = 300      
) %>%
  set_engine("ranger", seed = seed, importance = "impurity") 

# Create a workflow
rf_wf_cv <- workflow() %>%
  add_model(rf_cv) %>%
  add_formula(OP_18b ~ HCP_COVID_19 + IMM_3 + SAFE_USE_OF_OPIOIDS + OP_23 + SEP_1)

set.seed(seed)
# Perform tuning with tune_grid()
rf_tune_results_cv <- tune_grid(
  rf_wf_cv,
  resamples = vfold_cv(care_train, v = 15, repeats = 15),
  grid = rf_grid,
  metrics = metric_set(rmse),
  control = control_grid(save_pred = TRUE) 
)

Warning: package 'ranger' was built under R version 4.4.3

→ A | warning: ! 7 columns were requested but there were 5 predictors in the data.
               ℹ 5 predictors will be used.

There were issues with some computations   A: x1

→ B | warning: ! 10 columns were requested but there were 5 predictors in the data.
               ℹ 5 predictors will be used.

There were issues with some computations   A: x1
There were issues with some computations   A: x2   B: x1
There were issues with some computations   A: x9   B: x8
There were issues with some computations   A: x16   B: x16
There were issues with some computations   A: x27   B: x27
There were issues with some computations   A: x39   B: x38
There were issues with some computations   A: x50   B: x50
There were issues with some computations   A: x62   B: x61
There were issues with some computations   A: x73   B: x73
There were issues with some computations   A: x85   B: x84
There were issues with some computations   A: x96   B: x95
There were issues with some computations   A: x107   B: x107
There were issues with some computations   A: x119   B: x118
There were issues with some computations   A: x130   B: x130
There were issues with some computations   A: x142   B: x141
There were issues with some computations   A: x153   B: x152
There were issues with some computations   A: x164   B: x164
There were issues with some computations   A: x176   B: x175
There were issues with some computations   A: x187   B: x186
There were issues with some computations   A: x198   B: x198
There were issues with some computations   A: x210   B: x209
There were issues with some computations   A: x220   B: x219
There were issues with some computations   A: x231   B: x230
There were issues with some computations   A: x242   B: x241
There were issues with some computations   A: x253   B: x253
There were issues with some computations   A: x265   B: x264
There were issues with some computations   A: x276   B: x276
There were issues with some computations   A: x287   B: x287
There were issues with some computations   A: x299   B: x298
There were issues with some computations   A: x310   B: x309
There were issues with some computations   A: x321   B: x321
There were issues with some computations   A: x333   B: x332
There were issues with some computations   A: x344   B: x343
There were issues with some computations   A: x355   B: x355
There were issues with some computations   A: x367   B: x366
There were issues with some computations   A: x378   B: x377
There were issues with some computations   A: x389   B: x389
There were issues with some computations   A: x401   B: x400
There were issues with some computations   A: x412   B: x411
There were issues with some computations   A: x423   B: x423
There were issues with some computations   A: x434   B: x433
There were issues with some computations   A: x444   B: x443
There were issues with some computations   A: x455   B: x454
There were issues with some computations   A: x466   B: x466
There were issues with some computations   A: x478   B: x477
There were issues with some computations   A: x489   B: x489
There were issues with some computations   A: x501   B: x500
There were issues with some computations   A: x512   B: x511
There were issues with some computations   A: x523   B: x523
There were issues with some computations   A: x535   B: x534
There were issues with some computations   A: x546   B: x545
There were issues with some computations   A: x557   B: x557
There were issues with some computations   A: x569   B: x568
There were issues with some computations   A: x580   B: x579
There were issues with some computations   A: x591   B: x591
There were issues with some computations   A: x603   B: x602
There were issues with some computations   A: x614   B: x613
There were issues with some computations   A: x625   B: x625
There were issues with some computations   A: x637   B: x636
There were issues with some computations   A: x647   B: x647
There were issues with some computations   A: x659   B: x658
There were issues with some computations   A: x669   B: x668
There were issues with some computations   A: x680   B: x679
There were issues with some computations   A: x691   B: x691
There were issues with some computations   A: x703   B: x702
There were issues with some computations   A: x714   B: x713
There were issues with some computations   A: x725   B: x725
There were issues with some computations   A: x737   B: x736
There were issues with some computations   A: x748   B: x747
There were issues with some computations   A: x759   B: x759
There were issues with some computations   A: x771   B: x770
There were issues with some computations   A: x782   B: x781
There were issues with some computations   A: x793   B: x792
There were issues with some computations   A: x804   B: x804
There were issues with some computations   A: x811   B: x810
There were issues with some computations   A: x817   B: x816
There were issues with some computations   A: x823   B: x822
There were issues with some computations   A: x830   B: x829
There were issues with some computations   A: x837   B: x836
There were issues with some computations   A: x843   B: x842
There were issues with some computations   A: x847   B: x847
There were issues with some computations   A: x851   B: x850
There were issues with some computations   A: x855   B: x854
There were issues with some computations   A: x858   B: x857
There were issues with some computations   A: x862   B: x862
There were issues with some computations   A: x867   B: x866
There were issues with some computations   A: x872   B: x872
There were issues with some computations   A: x877   B: x876
There were issues with some computations   A: x881   B: x880
There were issues with some computations   A: x885   B: x884
There were issues with some computations   A: x888   B: x887
There were issues with some computations   A: x892   B: x891
There were issues with some computations   A: x900   B: x899
There were issues with some computations   A: x900   B: x900

#show_notes(rf_tune_results_cv_tt)
autoplot(rf_tune_results_cv)

best_rf_params <- rf_tune_results_cv %>% select_best(metric = "rmse")

rf_wf_final <- finalize_workflow(rf_wf_cv, best_rf_params)


# Fit the final random forest model on the training dataset
rf_model_final <- fit(rf_wf_final, data = care_train)

# Generate predictions on the training set and combine them with the actual outcomes
rf_preds <- predict(rf_model_final, care_train) %>%
  bind_cols(care_train)

# Create the observed vs predicted plot for the random forest model
ggplot(rf_preds, aes(x = OP_18b, y = .pred)) +
  geom_point(color = "red", alpha = 0.7) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black") +
  labs(title = "Observed vs Predicted: Random Forest Model",
       x = "Observed Quality Score",
       y = "Predicted Quality Score") +
  theme_minimal()

# Summarize the tuning results by RMSE
rf_tune_results_cv %>% 
  collect_metrics() %>% 
  filter(.metric == "rmse") %>% 
  arrange(mean) %>% 
  print(n = Inf)

# A tibble: 16 × 8
    mtry min_n .metric .estimator  mean     n std_err .config              
   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1     1    16 rmse    standard    25.5   225   0.811 Preprocessor1_Model13
 2     1    11 rmse    standard    25.7   225   0.809 Preprocessor1_Model09
 3     4    16 rmse    standard    25.8   225   0.822 Preprocessor1_Model14
 4     1     6 rmse    standard    25.9   225   0.807 Preprocessor1_Model05
 5     7    16 rmse    standard    26.0   225   0.823 Preprocessor1_Model15
 6    10    16 rmse    standard    26.0   225   0.823 Preprocessor1_Model16
 7     4    11 rmse    standard    26.0   225   0.815 Preprocessor1_Model10
 8     1     1 rmse    standard    26.2   225   0.802 Preprocessor1_Model01
 9     7    11 rmse    standard    26.2   225   0.814 Preprocessor1_Model11
10    10    11 rmse    standard    26.2   225   0.814 Preprocessor1_Model12
11     4     6 rmse    standard    26.3   225   0.803 Preprocessor1_Model06
12     7     6 rmse    standard    26.5   225   0.799 Preprocessor1_Model07
13    10     6 rmse    standard    26.5   225   0.799 Preprocessor1_Model08
14     4     1 rmse    standard    26.6   225   0.798 Preprocessor1_Model02
15     7     1 rmse    standard    26.8   225   0.796 Preprocessor1_Model03
16    10     1 rmse    standard    26.8   225   0.796 Preprocessor1_Model04

rf_best_result <- rf_tune_results_cv %>% select_best(metric = "rmse")
rf_best_result

# A tibble: 1 × 3
   mtry min_n .config              
  <int> <int> <chr>                
1     1    16 Preprocessor1_Model13

rf_tune_results_cv %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  arrange(mean) %>%
  slice(1)

# A tibble: 1 × 8
   mtry min_n .metric .estimator  mean     n std_err .config              
  <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1     1    16 rmse    standard    25.5   225   0.811 Preprocessor1_Model13

LASSO Regression

set.seed(seed)
lasso_grid <- grid_regular(
  penalty(range = c(0.0001, 1)),  ## from 0.0001 to 1 (no log)
  levels = 30                # 30 grid points (fine search)
)
#setting up Lasso model specification
lasso_spec <- linear_reg(
  penalty = 0.1,  #Regularization strength 
  mixture = 1     # 1 = Lasso
) %>% 
  set_engine("glmnet")


#fitting Lasso model
lasso_fit <- lasso_spec %>% 
  fit(OP_18b ~ HCP_COVID_19 + IMM_3 + SAFE_USE_OF_OPIOIDS + OP_23 + SEP_1, 
      data = care_train)

# Example lasso workflow
lasso_wf <- workflow() %>%
  add_model(lasso_spec) %>%   
  add_formula(OP_18b ~ HCP_COVID_19 + IMM_3 + SAFE_USE_OF_OPIOIDS + OP_23 + SEP_1 )   


# Cross-validation
lasso_cv_results <- tune_grid(
  lasso_wf,
  resamples = vfold_cv(care_train, v = 15),
  grid = lasso_grid,                 # if you tuned lambda
  metrics = metric_set(rmse),
  control = control_grid(save_pred = TRUE)
)

Warning: No tuning parameters have been detected, performance will be evaluated
using the resamples with no tuning. Did you want to [tune()] parameters?

Warning: package 'glmnet' was built under R version 4.4.3

# Get the best RMSE
lasso_cv_results %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  arrange(mean) %>%
  slice(1)

# A tibble: 1 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard    24.8    15    3.51 Preprocessor1_Model1

# coefficients (non-zero ones only)
tidy(lasso_fit) %>% 
  filter(estimate != 0) %>% 
  arrange(desc(abs(estimate)))

# A tibble: 6 × 3
  term                estimate penalty
  <chr>                  <dbl>   <dbl>
1 (Intercept)          131.        0.1
2 SAFE_USE_OF_OPIOIDS    1.25      0.1
3 HCP_COVID_19           0.964     0.1
4 SEP_1                 -0.930     0.1
5 IMM_3                  0.369     0.1
6 OP_23                  0.238     0.1

#making predictions
train_results <- care_train%>% 
  bind_cols(predict(lasso_fit, new_data = care_train))

#evaluating performance
lasso_metrics <- train_results %>% 
  metrics(truth = OP_18b , estimate = .pred)
#Print 
print(lasso_metrics)

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      23.8  
2 rsq     standard       0.190
3 mae     standard      18.4

#Observed vs Predicted plot
ggplot(train_results, aes(x = OP_18b, y = .pred)) +
  geom_point(alpha = 0.5, color = "pink") +
  geom_abline(slope = 1, linetype = "dashed") +
  labs(title = "Lasso Regression Performance",
       subtitle = paste("Penalty =", lasso_fit$spec$args$penalty),
       x = "Observed Scores", 
       y = "Predicted Scores") +
  theme_minimal()

# Linear Regression RMSE
lm_rmse <- care_lm_cv_results %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  summarize(mean_rmse = mean(mean)) %>%
  pull(mean_rmse)

# Lasso RMSE
lasso_rmse <- lasso_cv_results %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  summarize(mean_rmse = mean(mean)) %>%
  pull(mean_rmse)

# Random Forest RMSE
rf_rmse <- rf_tune_results_cv %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  arrange(mean) %>%
  slice(1) %>%
  pull(mean)

# Create a comparison table
rmse_results <- tibble(
  Model = c("Linear Regression", "Lasso", "Random Forest"),
  RMSE = c(lm_rmse, lasso_rmse, rf_rmse)
)

# Print the table
print(rmse_results)

# A tibble: 3 × 2
  Model              RMSE
  <chr>             <dbl>
1 Linear Regression  24.8
2 Lasso              24.8
3 Random Forest      25.5

ggplot(rmse_results, aes(x = reorder(Model, RMSE), y = RMSE)) +
  geom_segment(aes(xend = Model, y = 0, yend = RMSE), color = "grey") +
  geom_point(size = 4, color = "blue") +
  geom_text(aes(label = round(RMSE, 2)), vjust = -1, size = 3.5) +
  labs(
    title = " RMSE by Model",
    x = "Model",
    y = "RMSE"
  ) +
  theme_minimal()+
  coord_flip()

From the RMSE, Linear regression perform best model fit.

Final Model

Based on the RMSE, Linear Regression outformed the other models. The model can be used to looking at the effect of vaciination on waiting time in the emergency department cross the United States.

As a final step, I will fit the model into test data using linear model.

# Make predictions on test data
lm_test_preds <- predict(care_lm_train , new_data = care_test) %>%
  bind_cols(care_test)

# Calculate RMSE
lm_test_rmse <- rmse(lm_test_preds, truth = OP_18b, estimate = .pred)

# Print RMSE
cat("Test RMSE:", lm_test_rmse$.estimate, "\n")

Test RMSE: 62.10533

I wnat to show the predicted vs observed value both train and test data

care_lm_preds <- care_lm_preds %>%
  mutate(dataset = "Train")

lm_test_preds <- lm_test_preds %>%
  mutate(dataset = "Test")
# combined predicted values 

combined_preds <- bind_rows(care_lm_preds, lm_test_preds)

 #4. Now plot
ggplot(combined_preds, aes(x = OP_18b, y = .pred, color = dataset)) +
  geom_point(size = 2) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "black") +
  labs(
    title = "Observed vs Predicted Waiting Time (Train and Test)",
    x = "Observed Waiting Time (OP_18b)",
    y = "Predicted Waiting Time",
    color = "Dataset"
  ) +
  theme_minimal()

###Discussion

Three models were performed in this project: Linear Regression, Lasso, and Random Forest. Among the models, Linear Regression achieved the lowest RMSE (24.77), closely followed by Lasso (24.79), while Random Forest had a higher RMSE (25.46). This indicates that Linear Regression performed the best, yielding the most accurate predictions on the dataset compared to Lasso and Random Forest, although the difference between Linear Regression and Lasso was very small.

The scatterplot comparing observed and predicted emergency department (ED) waiting times of the final model (linear model) illustrates the performance of the linear regression model on both training and testing datasets. In general, the predicted values follow the trend of the observed values, although considerable variability remains. Most predictions for mid-range observed waiting times (approximately 120 to 180 minutes) are relatively close to the ideal 45-degree line, indicating reasonable model calibration in this range. However, for states with higher observed waiting times (above 200 minutes), the model consistently underestimates the true waiting times, suggesting limited ability to capture extreme values. Notably, the pattern of errors between the training and testing sets is similar, indicating that the model is not overfitting the training data. Nonetheless, the dispersion around the ideal line and underprediction at high observed values point to opportunities for model improvement, such as using more flexible modeling approaches like Random Forest or incorporating additional predictors to better account for variation in ED waiting times.