Fitting Models Exercise

Install Packages

#install.packages("tidyverse")
library(tidyverse)

Warning: package 'tibble' was built under R version 4.4.3

Warning: package 'tidyr' was built under R version 4.4.3

Warning: package 'readr' was built under R version 4.4.3

Warning: package 'purrr' was built under R version 4.4.3

Warning: package 'dplyr' was built under R version 4.4.3

Warning: package 'forcats' was built under R version 4.4.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#install.packages("ggplot2")
library(ggplot2)
library(here)

here() starts at C:/Users/mn27712/MADA_NEW/muhammadnasir-mada2025-portofolio

#install.packages("patchwork")  # This package is to redefine "/" operator for plot arrangement
library(patchwork)

Warning: package 'patchwork' was built under R version 4.4.3

#install.packages("writexl")
library(writexl)
library(haven)
#install.packages("ggforce")
library(ggforce)

Warning: package 'ggforce' was built under R version 4.4.3

install.packages("dplyr")

The following package(s) will be installed:
- dplyr [1.1.4]
These packages will be installed into "C:/Users/mn27712/MADA_NEW/muhammadnasir-mada2025-portofolio/renv/library/windows/R-4.4/x86_64-w64-mingw32".

# Installing packages --------------------------------------------------------
- Installing dplyr ...                          OK [linked from cache]
Successfully installed 1 package in 26 milliseconds.

library(dplyr)
library(tidyverse)
library(lubridate)
#install.packages("ggridges") 
library(ggridges)  #

Warning: package 'ggridges' was built under R version 4.4.3

library(forcats)
#install.packages("gt")
library(gt)
#install.packages("gtExtras", dependencies = TRUE)
library(gtExtras)

Warning: package 'gtExtras' was built under R version 4.4.3

#install.packages("gtsummary")
library(gtsummary)

Warning: package 'gtsummary' was built under R version 4.4.3

#install.packages("cli")
library(cli)

Warning: package 'cli' was built under R version 4.4.3

#install.packages("tidymodels")
library(tidymodels)  # for the parsnip package, along with the rest of tidymodels

Warning: package 'tidymodels' was built under R version 4.4.3

── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
✔ broom        1.0.8     ✔ rsample      1.3.0
✔ dials        1.4.0     ✔ tune         1.3.0
✔ infer        1.0.7     ✔ workflows    1.2.0
✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
✔ parsnip      1.3.1     ✔ yardstick    1.3.2
✔ recipes      1.2.1

Warning: package 'dials' was built under R version 4.4.3

Warning: package 'infer' was built under R version 4.4.3

Warning: package 'modeldata' was built under R version 4.4.3

Warning: package 'parsnip' was built under R version 4.4.3

Warning: package 'recipes' was built under R version 4.4.3

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()

#install.packages("broom.mixed")
library(broom.mixed) # for converting bayesian models to tidy tibbles

Warning: package 'broom.mixed' was built under R version 4.4.3

#install.packages("dotwhisker")
library(dotwhisker)  # for visualizing regression results

Warning: package 'dotwhisker' was built under R version 4.4.3

#install.packages("ggcorrplot")
library(ggcorrplot)

Warning: package 'ggcorrplot' was built under R version 4.4.3

#install.packages("corrplot")
library(corrplot)

corrplot 0.95 loaded

#install.packages("ggpubr")# combine plot
# Load the library
library(ggpubr)

#install.packages("broom") 
library(broom)# Installs broom separately (optional)
#install.packages("gtsummary")
library(gtsummary)


#install.packages("rsample")
#install.packages("purrr")
library(rsample)
library(purrr)

Read the dataset

#Install package for the data 
#install.packages("nlmixr2data")
#ibrary(nlmixr2data)

# Load the dataset
data_loc <- here("fitting-exercise", "data", "Mavoglurant_A2121_nmpk.csv") 
data <- read_csv(data_loc)

Rows: 2678 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (17): ID, CMT, EVID, EVI2, MDV, DV, LNDV, AMT, TIME, DOSE, OCC, RATE, AG...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(data)

# A tibble: 6 × 17
     ID   CMT  EVID  EVI2   MDV    DV  LNDV   AMT  TIME  DOSE   OCC  RATE   AGE
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1   793     1     1     1     1     0  0       25 0        25     1    75    42
2   793     2     0     0     0   491  6.20     0 0.2      25     1     0    42
3   793     2     0     0     0   605  6.40     0 0.25     25     1     0    42
4   793     2     0     0     0   556  6.32     0 0.367    25     1     0    42
5   793     2     0     0     0   310  5.74     0 0.533    25     1     0    42
6   793     2     0     0     0   237  5.47     0 0.7      25     1     0    42
# ℹ 4 more variables: SEX <dbl>, RACE <dbl>, WT <dbl>, HT <dbl>

dim(data)

[1] 2678   17

colnames(data)

 [1] "ID"   "CMT"  "EVID" "EVI2" "MDV"  "DV"   "LNDV" "AMT"  "TIME" "DOSE"
[11] "OCC"  "RATE" "AGE"  "SEX"  "RACE" "WT"   "HT"

Note for the variables :

ID: Subject ID
CMT: Compartment Number
EVID: Event ID
MDV: Missing DV
DV: Dependent Variable, Mavoglurant
AMT: Dose Amount Keyword
TIME: Time (hr)
DOSE: Dose
OCC: Occasion
RATE: Rate
AGE: Age
SEX: Sex (1= male, 2= female) # (based on the paper referred )
WT: Weight
HT: Height

summary(data)

       ID             CMT             EVID              EVI2       
 Min.   :793.0   Min.   :1.000   Min.   :0.00000   Min.   :0.0000  
 1st Qu.:832.0   1st Qu.:2.000   1st Qu.:0.00000   1st Qu.:0.0000  
 Median :860.0   Median :2.000   Median :0.00000   Median :0.0000  
 Mean   :858.8   Mean   :1.926   Mean   :0.07394   Mean   :0.1613  
 3rd Qu.:888.0   3rd Qu.:2.000   3rd Qu.:0.00000   3rd Qu.:0.0000  
 Max.   :915.0   Max.   :2.000   Max.   :1.00000   Max.   :4.0000  
      MDV                DV               LNDV            AMT        
 Min.   :0.00000   Min.   :   0.00   Min.   :0.000   Min.   : 0.000  
 1st Qu.:0.00000   1st Qu.:  23.52   1st Qu.:3.158   1st Qu.: 0.000  
 Median :0.00000   Median :  74.20   Median :4.306   Median : 0.000  
 Mean   :0.09373   Mean   : 179.93   Mean   :4.085   Mean   : 2.763  
 3rd Qu.:0.00000   3rd Qu.: 283.00   3rd Qu.:5.645   3rd Qu.: 0.000  
 Max.   :1.00000   Max.   :1730.00   Max.   :7.456   Max.   :50.000  
      TIME             DOSE            OCC             RATE       
 Min.   : 0.000   Min.   :25.00   Min.   :1.000   Min.   :  0.00  
 1st Qu.: 0.583   1st Qu.:25.00   1st Qu.:1.000   1st Qu.:  0.00  
 Median : 2.250   Median :37.50   Median :1.000   Median :  0.00  
 Mean   : 5.851   Mean   :37.37   Mean   :1.378   Mean   : 16.55  
 3rd Qu.: 6.363   3rd Qu.:50.00   3rd Qu.:2.000   3rd Qu.:  0.00  
 Max.   :48.217   Max.   :50.00   Max.   :2.000   Max.   :300.00  
      AGE            SEX             RACE              WT        
 Min.   :18.0   Min.   :1.000   Min.   : 1.000   Min.   : 56.60  
 1st Qu.:26.0   1st Qu.:1.000   1st Qu.: 1.000   1st Qu.: 73.30  
 Median :31.0   Median :1.000   Median : 1.000   Median : 82.60  
 Mean   :32.9   Mean   :1.128   Mean   : 7.415   Mean   : 83.16  
 3rd Qu.:40.0   3rd Qu.:1.000   3rd Qu.: 2.000   3rd Qu.: 90.60  
 Max.   :50.0   Max.   :2.000   Max.   :88.000   Max.   :115.30  
       HT       
 Min.   :1.520  
 1st Qu.:1.710  
 Median :1.780  
 Mean   :1.762  
 3rd Qu.:1.820  
 Max.   :1.930

(data$ID)

   [1] 793 793 793 793 793 793 793 793 793 793 793 793 793 793 793 793 794 794
  [19] 794 794 794 794 794 794 794 794 794 794 794 794 794 794 795 795 795 795
  [37] 795 795 795 795 795 795 795 795 795 795 795 795 796 796 796 796 796 796
  [55] 796 796 796 796 796 796 796 796 796 796 797 797 797 797 797 797 797 797
  [73] 797 797 797 797 797 797 797 797 798 798 798 798 798 798 798 798 798 798
  [91] 798 798 798 798 798 798 799 799 799 799 799 799 799 799 799 799 799 799
 [109] 799 799 799 799 800 800 800 800 800 800 800 800 800 800 800 800 800 800
 [127] 800 800 801 801 801 801 801 801 801 801 801 801 801 801 801 801 801 801
 [145] 802 802 802 802 802 802 802 802 802 802 802 802 802 802 802 802 803 803
 [163] 803 803 803 803 803 803 803 803 803 803 803 803 803 803 804 804 804 804
 [181] 804 804 804 804 804 804 804 804 804 804 804 804 805 805 805 805 805 805
 [199] 805 805 805 805 805 805 805 805 805 805 806 806 806 806 806 806 806 806
 [217] 806 806 806 806 806 806 806 806 807 807 807 807 807 807 807 807 807 807
 [235] 807 807 807 807 807 808 808 808 808 808 808 808 808 808 808 808 808 808
 [253] 808 808 808 809 809 809 809 809 809 809 809 809 809 809 809 809 809 809
 [271] 810 810 810 810 810 810 810 810 810 810 810 810 810 810 810 810 811 811
 [289] 811 811 811 811 811 811 811 811 811 811 811 811 811 811 812 812 812 812
 [307] 812 812 812 812 812 812 812 812 812 812 812 813 813 813 813 813 813 813
 [325] 813 813 813 813 813 813 813 813 813 814 814 814 814 814 814 814 814 814
 [343] 814 814 814 814 814 814 814 815 815 815 815 815 815 815 815 815 815 815
 [361] 815 815 815 815 815 816 816 816 816 816 816 816 816 816 816 816 816 816
 [379] 816 816 816 817 817 817 817 817 817 817 817 817 817 817 817 817 817 817
 [397] 817 818 818 818 818 818 818 818 818 818 818 818 818 818 818 818 818 819
 [415] 819 819 819 819 819 819 819 819 819 819 819 819 819 819 819 820 820 820
 [433] 820 820 820 820 820 820 820 820 820 820 820 820 820 821 821 821 821 821
 [451] 821 821 821 821 821 821 821 821 821 821 821 822 822 822 822 822 822 822
 [469] 822 822 822 822 822 822 822 822 822 823 823 823 823 823 823 823 823 823
 [487] 823 823 823 823 823 823 823 824 824 824 824 824 824 824 824 824 824 824
 [505] 824 824 824 824 824 825 825 825 825 825 825 825 825 825 825 825 825 825
 [523] 825 825 825 826 826 826 826 826 826 826 826 826 826 826 826 826 826 826
 [541] 826 827 827 827 827 827 827 827 827 827 827 827 827 827 827 827 827 828
 [559] 828 828 828 828 828 828 828 828 828 828 828 828 828 828 828 829 829 829
 [577] 829 829 829 829 829 829 829 829 829 829 829 829 829 829 829 829 829 829
 [595] 829 829 829 829 829 830 830 830 830 830 830 830 830 830 830 830 830 830
 [613] 830 830 830 830 830 830 830 830 830 830 830 830 830 831 831 831 831 831
 [631] 831 831 831 831 831 831 831 831 831 831 831 831 831 831 831 831 831 831
 [649] 831 831 831 832 832 832 832 832 832 832 832 832 832 832 832 832 832 832
 [667] 832 832 832 832 832 832 832 832 832 832 832 833 833 833 833 833 833 833
 [685] 833 833 833 833 833 833 833 833 833 833 833 833 833 833 833 833 833 833
 [703] 833 834 834 834 834 834 834 834 834 834 834 834 834 834 834 834 834 834
 [721] 834 834 834 834 834 834 834 834 834 835 835 835 835 835 835 835 835 835
 [739] 835 835 835 835 835 835 835 835 835 835 835 835 835 835 835 835 835 836
 [757] 836 836 836 836 836 836 836 836 836 836 836 836 836 836 836 836 836 836
 [775] 836 836 836 836 836 836 836 837 837 837 837 837 837 837 837 837 837 837
 [793] 837 837 837 837 837 837 837 837 837 837 837 837 837 837 837 838 838 838
 [811] 838 838 838 838 838 838 838 838 838 838 838 838 838 838 838 838 838 838
 [829] 838 838 838 838 838 840 840 840 840 840 840 840 840 840 840 840 840 840
 [847] 840 840 840 840 840 840 840 840 840 840 840 840 840 841 841 841 841 841
 [865] 841 841 841 841 841 841 841 841 841 841 841 841 841 841 841 841 841 841
 [883] 841 841 841 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842
 [901] 842 842 842 842 842 842 842 842 842 842 842 843 843 843 843 843 843 843
 [919] 843 843 843 843 843 843 843 843 843 843 843 843 843 843 843 843 843 843
 [937] 843 844 844 844 844 844 844 844 844 844 844 844 844 844 844 844 844 844
 [955] 844 844 844 844 844 844 844 844 844 845 845 845 845 845 845 845 845 845
 [973] 845 845 845 845 845 845 845 845 845 845 845 845 845 845 845 845 845 846
 [991] 846 846 846 846 846 846 846 846 846 846 846 846 846 846 846 846 846 846
[1009] 846 846 846 846 846 846 846 847 847 847 847 847 847 847 847 847 847 847
[1027] 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 848 848 848
[1045] 848 848 848 848 848 848 848 848 848 848 848 848 848 848 848 848 848 848
[1063] 848 848 848 848 848 849 849 849 849 849 849 849 849 849 849 849 849 849
[1081] 849 849 849 849 849 849 849 849 849 849 849 849 849 850 850 850 850 850
[1099] 850 850 850 850 850 850 850 850 850 850 850 850 850 850 850 850 850 850
[1117] 850 850 850 851 851 851 851 851 851 851 851 851 851 851 851 851 851 851
[1135] 851 851 851 851 851 851 851 851 851 851 851 852 852 852 852 852 852 852
[1153] 852 852 852 852 852 852 853 853 853 853 853 853 853 853 853 853 853 853
[1171] 853 853 853 853 853 853 853 853 853 853 853 853 853 853 854 854 854 854
[1189] 854 854 854 854 854 854 854 854 854 854 854 854 854 854 854 854 854 854
[1207] 854 854 854 854 855 855 855 855 855 855 855 855 855 855 855 855 855 855
[1225] 855 855 855 855 855 855 855 855 855 855 855 855 857 857 857 857 857 857
[1243] 857 857 857 857 857 857 857 857 857 857 857 857 857 857 857 857 857 857
[1261] 857 857 858 858 858 858 858 858 858 858 858 858 858 858 858 858 858 858
[1279] 858 858 858 858 858 858 858 858 858 858 859 859 859 859 859 859 859 859
[1297] 859 859 859 859 859 859 859 859 859 859 859 859 859 859 859 859 859 859
[1315] 860 860 860 860 860 860 860 860 860 860 860 860 860 860 860 860 860 860
[1333] 860 860 860 860 860 860 860 860 861 861 861 861 861 861 861 861 861 861
[1351] 861 861 861 861 861 861 861 861 861 861 861 861 861 861 861 861 862 862
[1369] 862 862 862 862 862 862 862 862 862 862 862 862 862 862 862 862 862 862
[1387] 862 862 862 862 862 862 863 863 863 863 863 863 863 863 863 863 863 863
[1405] 863 863 863 863 863 863 863 863 863 863 863 863 863 863 864 864 864 864
[1423] 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864
[1441] 864 864 864 864 865 865 865 865 865 865 865 865 865 865 865 865 865 865
[1459] 865 865 865 865 865 865 865 865 865 865 865 866 866 866 866 866 866 866
[1477] 866 866 866 866 866 866 866 866 866 866 866 866 866 866 866 866 866 866
[1495] 866 867 867 867 867 867 867 867 867 867 867 867 867 867 867 867 867 867
[1513] 867 867 867 867 867 867 867 867 867 868 868 868 868 868 868 868 868 868
[1531] 868 868 868 868 868 868 868 868 868 868 868 868 868 868 868 868 868 869
[1549] 869 869 869 869 869 869 869 869 869 869 869 869 869 869 869 869 869 869
[1567] 869 869 869 869 869 869 869 870 870 870 870 870 870 870 870 870 870 870
[1585] 870 870 870 870 870 870 870 870 870 870 870 870 870 870 870 871 871 871
[1603] 871 871 871 871 871 871 871 871 871 871 871 871 871 871 871 871 871 871
[1621] 871 871 871 871 871 872 872 872 872 872 872 872 872 872 872 872 872 872
[1639] 872 872 872 872 872 872 872 872 872 872 872 872 872 873 873 873 873 873
[1657] 873 873 873 873 873 873 873 873 873 873 873 873 873 873 873 873 873 873
[1675] 873 873 873 874 874 874 874 874 874 874 874 874 874 874 874 874 874 874
[1693] 874 874 874 874 874 874 874 874 874 874 874 875 875 875 875 875 875 875
[1711] 875 875 875 875 875 875 876 876 876 876 876 876 876 876 876 876 876 876
[1729] 876 876 876 876 876 876 876 876 876 876 876 876 876 876 877 877 877 877
[1747] 877 877 877 877 877 877 877 877 877 877 877 877 877 877 877 877 877 877
[1765] 877 877 877 877 878 878 878 878 878 878 878 878 878 878 878 878 878 878
[1783] 878 878 878 878 878 878 878 878 878 878 878 878 879 879 879 879 879 879
[1801] 879 879 879 879 879 879 879 879 879 879 879 879 879 879 879 879 879 879
[1819] 879 879 880 880 880 880 880 880 880 880 880 880 880 880 880 880 880 880
[1837] 880 880 880 880 880 880 880 880 880 880 881 881 881 881 881 881 881 881
[1855] 881 881 881 881 881 881 881 881 881 881 881 881 881 881 881 881 881 881
[1873] 882 882 882 882 882 882 882 882 882 882 882 882 882 882 882 882 882 882
[1891] 882 882 882 882 882 882 882 882 883 883 883 883 883 883 883 883 883 883
[1909] 883 883 883 884 884 884 884 884 884 884 884 884 884 884 884 884 885 885
[1927] 885 885 885 885 885 885 885 885 885 885 885 886 886 886 886 886 886 886
[1945] 886 886 886 886 886 886 886 886 886 886 886 886 886 886 886 886 886 886
[1963] 886 887 887 887 887 887 887 887 887 887 887 887 887 887 887 887 887 887
[1981] 887 887 887 887 887 887 887 887 887 888 888 888 888 888 888 888 888 888
[1999] 888 888 888 888 888 888 888 888 888 888 888 888 888 888 888 888 888 889
[2017] 889 889 889 889 889 889 889 889 889 889 889 889 889 889 889 889 889 889
[2035] 889 889 889 889 889 889 889 890 890 890 890 890 890 890 890 890 890 890
[2053] 890 890 890 890 890 890 890 890 890 890 890 890 890 890 890 891 891 891
[2071] 891 891 891 891 891 891 891 891 891 891 891 891 891 891 891 891 891 891
[2089] 891 891 891 891 891 892 892 892 892 892 892 892 892 892 892 892 892 892
[2107] 892 892 892 892 892 892 892 892 892 892 892 892 892 893 893 893 893 893
[2125] 893 893 893 893 893 893 893 893 893 893 893 893 893 893 893 893 893 893
[2143] 893 893 893 894 894 894 894 894 894 894 894 894 894 894 894 894 894 894
[2161] 894 894 894 894 894 894 894 894 894 894 894 895 895 895 895 895 895 895
[2179] 895 895 895 895 895 895 896 896 896 896 896 896 896 896 896 896 896 896
[2197] 896 896 896 896 896 896 896 896 896 896 896 896 896 896 897 897 897 897
[2215] 897 897 897 897 897 897 897 897 897 897 897 897 897 897 897 897 897 897
[2233] 897 897 897 897 898 898 898 898 898 898 898 898 898 898 898 898 898 898
[2251] 898 898 898 898 898 898 898 898 898 898 898 898 899 899 899 899 899 899
[2269] 899 899 899 899 899 899 899 899 899 899 899 899 899 899 899 899 899 899
[2287] 899 899 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900 900
[2305] 900 900 900 900 900 900 900 900 900 900 901 901 901 901 901 901 901 901
[2323] 901 901 901 901 901 901 901 901 901 901 901 901 901 901 901 901 901 901
[2341] 902 902 902 902 902 902 902 902 902 902 902 902 902 902 902 902 902 902
[2359] 902 902 902 902 902 902 902 902 903 903 903 903 903 903 903 903 903 903
[2377] 903 903 903 903 903 903 903 903 903 903 903 903 903 903 903 903 905 905
[2395] 905 905 905 905 905 905 905 905 905 905 905 905 905 905 905 905 905 905
[2413] 905 905 905 905 905 905 906 906 906 906 906 906 906 906 906 906 906 906
[2431] 906 906 906 906 906 906 906 906 906 906 906 906 906 906 907 907 907 907
[2449] 907 907 907 907 907 907 907 907 907 907 907 907 907 907 907 907 907 907
[2467] 907 907 907 907 908 908 908 908 908 908 908 908 908 908 908 908 908 908
[2485] 908 908 908 908 908 908 908 908 908 908 908 908 909 909 909 909 909 909
[2503] 909 909 909 909 909 909 909 909 909 909 909 909 909 909 909 909 909 909
[2521] 909 909 910 910 910 910 910 910 910 910 910 910 910 910 910 910 910 910
[2539] 910 910 910 910 910 910 910 910 910 910 911 911 911 911 911 911 911 911
[2557] 911 911 911 911 911 911 911 911 911 911 911 911 911 911 911 911 911 911
[2575] 912 912 912 912 912 912 912 912 912 912 912 912 912 912 912 912 912 912
[2593] 912 912 912 912 912 912 912 912 913 913 913 913 913 913 913 913 913 913
[2611] 913 913 913 913 913 913 913 913 913 913 913 913 913 913 913 913 914 914
[2629] 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914 914
[2647] 914 914 914 914 914 914 915 915 915 915 915 915 915 915 915 915 915 915
[2665] 915 915 915 915 915 915 915 915 915 915 915 915 915 915

Data Exploration

Initial data visualization

First of all, it is important to visualize the main variable of interest. In this case, Mavoglurant is the main variable interest (variable response). Spaghetti plot is created to show the individual level of Mavoglurant over the time based on Dose (25, 37.5, and 50).

# Spaghetti plot
spaghetti_pot <- ggplot(data, aes(x = TIME, y = DV, group = ID, color = as.factor(ID))) +
  geom_line(alpha = 0.6) +  # Adds individual lines with transparency
  facet_wrap(~DOSE)+ # to facet by dose
  theme_minimal() +  # Uses a clean theme
  labs(title = "Individual level of Mavoglurant over Time by Dose",
       x = "Time",
       y = "DV",
       color = "Subject ID") +
  theme(legend.position = "none")  # Hides legend if too many IDs
print(spaghetti_pot)

Occasion Rate

A bar chart for Occasion Rate (OCC) is created to look at the distribution of OCC. OCC is one of the interest in this exercise.

plot_occ <- ggplot(data, aes(x = factor(OCC))) +
  geom_bar(aes(fill = factor(OCC)), show.legend = FALSE, width = 0.7) +  # Color bars dynamically
  scale_fill_brewer(palette = "Dark2") +  # Use a more vibrant color palette
  labs(title = "Distribution of Occasion Rate (OCC)",
       x = "Occasion Rate (OCC)",
       y = "Count") +
  theme_minimal(base_size = 14) +  # Increase text size for readability
  theme(
    plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
    axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),  # Rotate X labels for clarity
    panel.grid.major = element_blank(),  # Remove major grid lines for a cleaner look
    panel.grid.minor = element_blank()
  )
print(plot_occ)

Data Cleaning

First step in the data cleaning, we want to select only OCC= 1, and calculate summary of Mavoglurant

#OCC =1 
data1 <- data %>%
  filter(OCC==1) #select rows with OCC=1 only

Summary_DV <- data1 %>%
  filter(TIME != 0) %>%  # Remove rows where TIME is 0
  group_by(ID) %>%
  summarize(Y = sum(DV, na.rm = TRUE))  # Corrected sum function
print(Summary_DV)

# A tibble: 120 × 2
      ID     Y
   <dbl> <dbl>
 1   793 2691.
 2   794 2639.
 3   795 2150.
 4   796 1789.
 5   797 3126.
 6   798 2337.
 7   799 3007.
 8   800 2796.
 9   801 3866.
10   802 1762.
# ℹ 110 more rows

Only including data with time= 0

time_zero <- data1 %>%
  filter(TIME == 0)
print(time_zero)

# A tibble: 120 × 17
      ID   CMT  EVID  EVI2   MDV    DV  LNDV   AMT  TIME  DOSE   OCC  RATE   AGE
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1   793     1     1     1     1     0     0    25     0    25     1    75    42
 2   794     1     1     1     1     0     0    25     0    25     1   150    24
 3   795     1     1     1     1     0     0    25     0    25     1   150    31
 4   796     1     1     1     1     0     0    25     0    25     1   150    46
 5   797     1     1     1     1     0     0    25     0    25     1   150    41
 6   798     1     1     1     1     0     0    25     0    25     1   150    27
 7   799     1     1     1     1     0     0    25     0    25     1   150    23
 8   800     1     1     1     1     0     0    25     0    25     1   150    20
 9   801     1     1     1     1     0     0    25     0    25     1   150    23
10   802     1     1     1     1     0     0    25     0    25     1   150    28
# ℹ 110 more rows
# ℹ 4 more variables: SEX <dbl>, RACE <dbl>, WT <dbl>, HT <dbl>

Combine two data frames (Summary_DV and time_zero)

data_combined <- left_join(time_zero, Summary_DV, by = join_by(ID))
print(data_combined)

# A tibble: 120 × 18
      ID   CMT  EVID  EVI2   MDV    DV  LNDV   AMT  TIME  DOSE   OCC  RATE   AGE
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1   793     1     1     1     1     0     0    25     0    25     1    75    42
 2   794     1     1     1     1     0     0    25     0    25     1   150    24
 3   795     1     1     1     1     0     0    25     0    25     1   150    31
 4   796     1     1     1     1     0     0    25     0    25     1   150    46
 5   797     1     1     1     1     0     0    25     0    25     1   150    41
 6   798     1     1     1     1     0     0    25     0    25     1   150    27
 7   799     1     1     1     1     0     0    25     0    25     1   150    23
 8   800     1     1     1     1     0     0    25     0    25     1   150    20
 9   801     1     1     1     1     0     0    25     0    25     1   150    23
10   802     1     1     1     1     0     0    25     0    25     1   150    28
# ℹ 110 more rows
# ℹ 5 more variables: SEX <dbl>, RACE <dbl>, WT <dbl>, HT <dbl>, Y <dbl>

df <- data_combined %>%
  select(Y, DOSE, AGE, SEX, RACE, WT, HT) %>% #Selecting variables of Interest
  mutate(RACE = as_factor(RACE), SEX = as_factor(SEX)) #To convert race and sex to factor variables

str(df)# check the variable classes

tibble [120 × 7] (S3: tbl_df/tbl/data.frame)
 $ Y   : num [1:120] 2691 2639 2150 1789 3126 ...
 $ DOSE: num [1:120] 25 25 25 25 25 25 25 25 25 25 ...
 $ AGE : num [1:120] 42 24 31 46 41 27 23 20 23 28 ...
 $ SEX : Factor w/ 2 levels "1","2": 1 1 1 2 2 1 1 1 1 1 ...
 $ RACE: Factor w/ 4 levels "1","2","7","88": 2 2 1 1 2 2 1 4 2 1 ...
 $ WT  : num [1:120] 94.3 80.4 71.8 77.4 64.3 ...
 $ HT  : num [1:120] 1.77 1.76 1.81 1.65 1.56 ...

saveRDS(df, file = here("machine-learning", "data", "df.Rds"))

Exploratory Data Analysis

In this section, the data is visualized using appropriate table, charts or graphics based on the data type. it is better to provide a big picture of the data by providing summary table (Characteristic Table). The table will help understand the data better.

Create summary table

tbl_summary(df)

Characteristic	N = 120¹
Y	2,349 (1,689, 3,054)
DOSE
25	59 (49%)
37.5	12 (10%)
50	49 (41%)
AGE	31 (26, 41)
SEX
1	104 (87%)
2	16 (13%)
RACE
1	74 (62%)
2	36 (30%)
7	2 (1.7%)
88	8 (6.7%)
WT	82 (73, 90)
HT	1.77 (1.70, 1.82)
¹ Median (Q1, Q3); n (%)

Based on the table above, 120 participants received a single 25 mg, 37.5 mg or 50 mg of MVG (49%, 10%, and 41% respectively). Most subjects were young (mean age of 33 years), race group one (62 %), male (87 %) with a mean BW of 83 kg and mean of HT of 1.76 m.

Summary Table for all variables based on sex

# Summary table for all variables by SEX
df %>%
  tbl_summary(by=SEX, type=list(where(is.numeric) ~ "continuous"), # Specifies that all numeric variables should be treated as
              statistic=list(all_continuous() ~ "{median} ({p25}, {p75})"), #Numeric (continuous) variables will be summarized using the median and interquartile range (IQR: 25th and 75th percentiles).
              digits=list(all_continuous() ~ 0, HT ~ 2), # Specifies that all continuous variables should be rounded to 0 decimal places, except for HT (Height), which is rounded to 2 decimal places.

              label=list(Y ~ "Response",
                         DOSE ~ "Drug dose",
                         AGE ~ "Age",
                         RACE ~ "Race",
                         WT ~ "Weight",
                         HT ~ "Height")) %>%
  add_p(test=list(all_continuous() ~ "wilcox.test",
                  all_categorical() ~ "fisher.test"), # test differences between groups (SEX in this case).
        pvalue_fun=function(x) style_number(x, digits=3)) %>%
  modify_header(p.value="*p*-value") %>%
  modify_spanning_header(all_stat_cols() ~ "**Sex**") %>%
  as_gt()

The following errors were returned during `as_gt()`:
✖ For variable `AGE` (`SEX`) and "p.value" statistic: The package "cardx" (>=
  0.2.3) is required.
✖ For variable `DOSE` (`SEX`) and "p.value" statistic: The package "cardx" (>=
  0.2.3) is required.
✖ For variable `HT` (`SEX`) and "p.value" statistic: The package "cardx" (>=
  0.2.3) is required.
✖ For variable `RACE` (`SEX`) and "p.value" statistic: The package "cardx" (>=
  0.2.3) is required.
✖ For variable `WT` (`SEX`) and "p.value" statistic: The package "cardx" (>=
  0.2.3) is required.
✖ For variable `Y` (`SEX`) and "p.value" statistic: The package "cardx" (>=
  0.2.3) is required.

Characteristic	Sex		p-value
Characteristic	1 N = 104¹	2 N = 16¹	p-value
Response	2,398 (1,722, 3,083)	2,060 (1,478, 2,728)
Drug dose	38 (25, 50)	25 (25, 44)
Age	30 (25, 39)	42 (38, 46)
Race
1	63 (61%)	11 (69%)
2	33 (32%)	3 (19%)
7	1 (1.0%)	1 (6.3%)
88	7 (6.7%)	1 (6.3%)
Weight	83 (75, 92)	70 (63, 82)
Height	1.78 (1.73, 1.82)	1.63 (1.58, 1.67)
¹ Median (Q1, Q3); n (%)

For better understanding about the data, data are visualized in Histogram and Boxplot

Histogram of Weight and height distribution

# Create a histogram for WT (Weight)
plot1 <- ggplot(df, aes(x = WT)) + 
  geom_histogram(binwidth = 5, fill = "#EEAEEE", color = "black", alpha = 0.7) +  # Blue fill, black border, transparency
  labs(title = "Histogram of Weight (WT)",
       x = "Weight (kg)",
       y = "Count") +
  theme_minimal(base_size = 14) +  # Improves readability
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  ) +
  geom_density(aes(y = ..count.. * 5), color = "red", linetype = "dashed", size = 1)  # Add smooth density curve

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

# Create a histogram for HT (Height)
plot2 <- ggplot(df, aes(x = HT)) + 
  geom_histogram(position = "identity", fill = "#EEB4B4", color = "black", alpha = 0.7) +  # Blue fill, black border, transparency
  labs(title = "Histogram of Height (HT)",
       x = "Height (cm)",
       y = "Count") +
  theme_minimal()
#Combine the histogram 
plot1 + plot2

Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(count)` instead.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Based on the Histogram of Weight, the weight is slighly skewed to the right. On the other hand, the height is skewed to the left. It is indicating that the data is not normally distributed both Weight and Heigth.

Boxplot of Y variable based on Categorical Variables

Boxplots were created to better visualization between response variable based on sex, race and dose.

# Boxplot of Y by SEX
boxplot_sex <- ggplot(df, aes(x = factor(SEX), y = Y, fill = factor(SEX))) +
  geom_boxplot(alpha = 0.7, color = "black") +  # Add transparency and black borders
  labs(title = "Boxplot of Y by SEX",
       x = "Sex",
       y = "Y") +
  theme_minimal(base_size = 14) +  # Improve readability
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    legend.position = "none",  # Remove legend since SEX is on x-axis
    panel.border = element_rect(color = "black", fill = NA, size = 1)  # Add frame
  ) +
  scale_fill_brewer(palette = "Pastel1")  # Use a nice color palette

Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.

# Boxplot of Y by DOSE
boxplot_dose <- ggplot(df, aes(x = factor(DOSE), y = Y, fill = factor(DOSE))) +
  geom_boxplot(alpha = 0.7, color = "black") +  # Add transparency and black borders
  labs(title = "Boxplot of Y by DOSE",
       x = "Dose",
       y = "Y") +
  theme_minimal(base_size = 14) +  # Improve readability
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    legend.position = "none",  # Remove legend since DOSE is on x-axis
    panel.border = element_rect(color = "black", fill = NA, size = 1)  # Add frame
  ) +
  scale_fill_brewer(palette = "Set2")  # Use a distinct color palette

# Boxplot of Y by RACE
boxplot_race <- ggplot(df, aes(x = factor(RACE), y = Y, fill = factor(RACE))) +
  geom_boxplot(alpha = 0.7, color = "black") +  # Transparent boxes with black borders
  labs(title = "Boxplot of Y by RACE",
       x = "Race",
       y = "Y") +
  theme_minimal(base_size = 14) +  # Improve readability
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    legend.position = "none",  # Remove legend since RACE is already labeled on x-axis
    panel.border = element_rect(color = "black", fill = NA, size = 1)  # Add frame
  ) +
  scale_fill_brewer(palette = "Set3")  # Apply a nice color palette

# Combine the boxplots
(boxplot_sex + boxplot_dose) / boxplot_race

Based on the boxplot above, the mean of response (y) is higher in sex1 compared to sex 2. In the dose category, it can be seen that the higher dose, the higher mean of response variable Y. Moreover, Race 1 and 2 have higher mean of Y compared two other races.

scatter plot of y based on continouse variables

Scatter plot is better way to show the relationship between continouse variable (both response and predictors). Loess method is used to draw the regression line to clearly look at nonlinear relationship between Y and predictors.

# Scatterplot of Y by AGE
plot_age <- ggplot(df, aes(x = AGE, y = Y)) +
  geom_point(alpha = 0.7, size = 2, fill= "#EE00EE", color="#EE00EE", stroke=1, shape=18) +  # Transparent points for better visibility
  geom_smooth(method = "loess", se = TRUE, color = "black", linetype = "dashed", size = 1) +  # Regression line
  labs(title = "Y vs Age",
       x = "Age",
       y = "Y") + 
 theme_bw()+
  theme(axis.title=element_text(size=10, color="black", face="bold"),
        axis.text=element_text(size=8, color="black"),
        plot.title=element_text(size=12, color="black", face="bold", hjust= 0.5,))

# Scatterplot of Y by WT (Weight)
plot_wt <- ggplot(df, aes(x = WT, y = Y)) +
  geom_point(alpha = 0.7, size = 2, fill= "#9ACD32", color="#9ACD32", stroke=1, shape=18) +  # Transparent points for better visibility
  geom_smooth(method = "loess", se = TRUE, color = "black", linetype = "dashed", size = 1) +  # Regression line
  labs(title = "Y vs WT",
       x = "Weight",
       y = "Y") + 
 theme_bw()+
  theme(axis.title=element_text(size=10, color="black", face="bold"),
        axis.text=element_text(size=8, color="black"),
        plot.title=element_text(size=12, color="black", face="bold", hjust= 0.5))

# Scatterplot of Y by HT (Height) with correct color scale
plot_ht<- ggplot(df, aes(x = HT, y = Y)) +
  geom_point(alpha = 0.7, size = 2, fill= "#EE4000", color="#EE4000", stroke=1, shape=18) +  # Transparent points for better visibility
  geom_smooth(method = "loess", se = TRUE, color = "black", linetype = "dashed", size = 1) +  # Regression line
  labs(title = "Y vs HT",
       x = "Height",
       y = "Y") + 
 theme_bw()+
  theme(axis.title=element_text(size=10, color="black", face="bold"),
        axis.text=element_text(size=8, color="black"),
        plot.title=element_text(size=12, color="black", face="bold", hjust= 0.5))
 
# Combine and output the three scatterplots
ggarrange(plot_age, plot_wt, plot_ht, ncol=3, nrow=1, align="h", 
          heights=c(1, 1, 1))

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Based on the scatter plots above, it can be seen that there is no linear relationship between Y and the predictors. - For Age, the trend appears somewhat flat with fluctuations, indicating weak or no strong association. The confidence interval (shaded area) is wide, especially at the edges, suggesting greater uncertainty in predictions at extreme ages. There is a large spread of data points, meaning variability in Y is high across different age.

For Weight, the confidence interval is relatively narrow in the middle but widens at lower and higher WT values, indicating more uncertainty at extreme weights. There is a cluster of data points around a moderate WT range, with more variability at lower and higher weights.
For height, the LOESS curve exhibits a U-shaped or fluctuating trend, suggesting a non-linear relationship between Height and Y. Initially, Y decreases with increasing Height, but at certain points, it fluctuates and slightly increases. The confidence interval is wider at extreme heights, suggesting greater uncertainty in predictions. The spread of data is relatively uniform, but there are some extreme Y values.

# Scatterplot Y vs Race
plot_race <- ggplot(df, aes(x = factor(RACE), y = Y, color = factor(RACE))) +
  geom_jitter(alpha = 0.7, size = 2, width = 0.2) +  # Jitter to avoid overlapping points
  geom_boxplot(outlier.shape = NA, alpha = 0.3) +  # Boxplot for distribution
  labs(title = "Scatterplot of Y by Race",
       x = "Race",
       y = "Outcome Variable (Y)") +
  theme_minimal(base_size = 14) +
  scale_color_brewer(palette = "Set1") +
  theme(
    panel.border = element_rect(color = "black", fill = NA, size = 1)  # Add frame border
  )

# Scatterplot Y vs SEX
plot_sex <- ggplot(df, aes(x = factor(SEX), y = Y, color = factor(SEX))) +
  geom_jitter(alpha = 0.7, size = 2, width = 0.2) +  # Jitter to separate overlapping points
  geom_boxplot(outlier.shape = NA, alpha = 0.3) +  # Boxplot for visualization
  labs(title = "Scatterplot of Y by Sex",
       x = "Sex",
       y = "Outcome Variable (Y)") +
  theme_minimal(base_size = 14) +
  scale_color_brewer(palette = "Dark2") +  # Different color scheme
  theme(
    panel.border = element_rect(color = "black", fill = NA, size = 1)  # Add frame border
  )

# Scatterplot Y vs Doses
plot_dose <- ggplot(df, aes(x = factor(DOSE), y = Y, color = factor(DOSE))) +
  geom_jitter(alpha = 0.7, size = 2, width = 0.2) +  # Jitter for better visualization
  geom_boxplot(outlier.shape = NA, alpha = 0.3) +  # Boxplot for distribution
  labs(title = "Scatterplot of Y by Dose",
       x = "Dose",
       y = "Outcome Variable (Y)") +
  theme_minimal(base_size = 14) +
  scale_color_brewer(palette = "Set3") +  # Different color scheme
  theme(
    panel.border = element_rect(color = "black", fill = NA, size = 1)  # Add frame border
  )

# Combine and output the three scatterplots
ggarrange(
  ggarrange(plot_race, plot_sex, ncol = 2, nrow = 1),  # Row 1: Two plots side by side
  plot_dose,  # Row 2: Full-width plot
  ncol = 1, nrow = 2,  # 2 rows total
  heights = c(2, 2)  # Equal height for both rows
)

Based on the scatter plot above, race 1 has a wider spread of Y values with more extreme values (outliers). The spread of Y for Sex 1 is slightly larger than for Sex 2. Sex 2 appears to have slightly higher median Y values. The interquartile range (IQR) suggests that the distribution of Y values differs between sexes.The median Y value increases with increasing dose. The spread of Y also increases as the dose increases. The dose group 50 shows the widest variability, with more extreme values.

# Select only continuous variables (excluding categorical variables like SEX, RACE, DOSE)
df_cont <- df %>% select(where(is.numeric))

# Compute correlation matrix
cor_matrix <- cor(df_cont, use = "complete.obs")  # Use only complete cases

# Visualize correlation matrix with correlation values
corrplot(cor_matrix, 
         method = "color",  # Color-coded visualization
         type = "lower",  # Show only lower triangle to reduce redundancy
         tl.col = "black",  # Black text labels for variable names
         tl.srt = 45,  # Rotate labels for better readability
         addCoef.col = "black",  # Show correlation values in black
         number.cex = 0.8)  # Adjust text size for correlation numbers

Based on the correlation matrix, DOSE is the strongest predictor of Y, showing a positive correlation (0.72). AGE does not seem to have a meaningful relationship with any variable. Weight (WT) and Height (HT) are moderately correlated (0.60), which makes sense biologically. There are weak negative correlations of Y with WT and HT, but their impact is likely small.

Model Fitting

Model 1: Y ~ Dose For model fitting, I will start with simple model (y~ DOSE). Tidymodels is used in this modeling.

df$DOSE <- as.factor(df$DOSE)

# Linear regression: Y ~ DOSE
m1 <- linear_reg() %>%
  set_engine("lm") %>%
  fit(Y ~ DOSE, df)

# Output the fitting result
tidy(m1)

# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    1783.      87.9     20.3  3.97e-40
2 DOSE37.5        681.     214.       3.19 1.84e- 3
3 DOSE50         1456.     130.      11.2  3.79e-20

Interpretation:

Intercept (Baseline: DOSE 25): the estimated mean Y when using DOSE 25 is 1782.67
DOSE 37.5 (681.24): increasing DOSE from 25 to 37.5 leads to an increase in Y by 681.24 on average with p-value = 0.0018 (< 0.05) → This effect is statistically significant.
DOSE 50 (1456.20): increasing DOSE from 25 to 50 leads to an increase in Y by 1456.20 on average. p-value < 0.0001 → Strong evidence that this effect is statistically significant.

Model2: Y ~ all predictors

m2 <- linear_reg() %>%
  set_engine("lm") %>%
  fit(Y ~ ., df)

# Output the fitting result
tidy(m2)

# A tibble: 10 × 5
   term        estimate std.error statistic  p.value
   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)  4891.     1823.       2.68  8.42e- 3
 2 DOSE37.5      664.      200.       3.31  1.26e- 3
 3 DOSE50       1499.      122.      12.2   2.92e-22
 4 AGE             3.52      7.90     0.446 6.57e- 1
 5 SEX2         -360.      218.      -1.65  1.01e- 1
 6 RACE2         149.      130.       1.15  2.54e- 1
 7 RACE7        -421.      451.      -0.933 3.53e- 1
 8 RACE88        -65.3     247.      -0.264 7.92e- 1
 9 WT            -23.3       6.44    -3.62  4.54e- 4
10 HT           -741.     1108.      -0.669 5.05e- 1

Interpretation: - The intercept (4890.92) represents the expected Y value when all predictors are zero. - Dose 37.5, Dose 50, and Weight are statistically significant with Y. Doses have positive relationship with Y, while Weight had significant effect on Y. Other predictors are not statistically correlated with Y.

compute RMSE and R-squared

m1_RMSE_R2<- predict(m1, df) %>%
  bind_cols(df) %>%
  metrics(truth=Y, estimate=.pred) %>%
  print()

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard     666.   
2 rsq     standard       0.516
3 mae     standard     517.

m2_RMSE_R2<- predict(m2, df) %>%
  bind_cols(df) %>%
  metrics(truth=Y, estimate=.pred) %>%
  print()

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard     590.   
2 rsq     standard       0.620
3 mae     standard     445.

Based on the RMSE and R-Squared, model 2 (all predictors) performed better fit to the data compared to Model 1 (Dose Only) with RMSE and R-Squared (666.31 and 0.51 respectively for model 1, and 590.31 and 0.62 for model 2).

Model 3: Logistic Regression (Sex ~ Dose)

#SEX ~ DOSE
m3 <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification") %>%
  fit(SEX ~ DOSE, df)
tidy(m3)

# A tibble: 3 × 5
  term        estimate std.error statistic    p.value
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)  -1.59       0.347   -4.58   0.00000465
2 DOSE37.5     -0.0202     0.849   -0.0238 0.981     
3 DOSE50       -0.831      0.627   -1.33   0.185

Interpretation:

The Intercept is significant, meaning there is an underlying distribution of SEX probabilities. Neither DOSE37.5 nor DOSE50 significantly affect SEX because their p-values > 0.05. This suggests that DOSE does not strongly predict SEX.

Model 4: Logistic Regression (Sex ~ All Predictors)

#SEX ~ all predictors
m4 <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification") %>%
  fit(SEX ~ DOSE + AGE + RACE + WT + HT, df)

# Output the fitting result
tidy(m4)

# A tibble: 9 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  55.7      16.8      3.31    0.000937
2 DOSE37.5     -3.11      1.84    -1.69    0.0916  
3 DOSE50       -2.47      1.33    -1.86    0.0633  
4 AGE           0.114     0.0696   1.64    0.101   
5 RACE2        -2.40      1.39    -1.73    0.0843  
6 RACE7         0.0168    3.56     0.00473 0.996   
7 RACE88       -1.74      2.34    -0.743   0.457   
8 WT           -0.0463    0.0716  -0.646   0.518   
9 HT          -33.3      10.7     -3.10    0.00191

Based on the output, only Height is significantly correlated with Sex. Dose levels have a small, negative, but non-significant effect on Sex. Age, race and WT are not significant with Sex.

Calculate RMSE and R-squared

m3_RMSE_R2 <- predict(m3, df, type="class") %>%
  bind_cols(predict(m3, df, type="prob")) %>%
  bind_cols(df) %>%
  metrics(truth=SEX, estimate=.pred_class, .pred_1) %>%
  print()

# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.867
2 kap         binary         0    
3 mn_log_loss binary         0.384
4 roc_auc     binary         0.592

m4_RMSE_R2 <- predict(m4, df, type="class") %>%
  bind_cols(predict(m4, df, type="prob")) %>%
  bind_cols(df) %>%
  metrics(truth=SEX, estimate=.pred_class, .pred_1) %>%
  print()

# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.95 
2 kap         binary         0.772
3 mn_log_loss binary         0.133
4 roc_auc     binary         0.978

# Create a data frame for both models
metrics_df <- data.frame(
  metric = rep(c("accuracy", "kap", "mn_log_loss", "roc_auc"), 2),  # Repeating metrics
  estimate = c(0.8667, 0.0000, 0.3843, 0.5919,   # First model
               0.9500, 0.7716, 0.1334, 0.9784),  # Second model
  model = rep(c("Model 1", "Model 2"), each = 4)  # Model labels
)

# Create a grouped bar plot
ggplot(metrics_df, aes(x = metric, y = estimate, fill = model)) +
  geom_bar(stat = "identity", position = "dodge", color = "black", alpha = 0.8) +  # Grouped bars
  labs(title = "Comparison of Model Performance Metrics",
       x = "Metric",
       y = "Score",
       fill = "Model") +
  theme_minimal(base_size = 14) +  # Clean theme
  scale_fill_manual(values = c("Model 1" = "lightpink", "Model 2" = "lightblue")) +  # Custom colors
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12)
  )

Interpretation: - Model 2 (accuracy 0.95) is more accurate in making correct predictions than Model 1 (accuracy 0.87). - Kappa measures how well the model’s predictions agree with the true labels, adjusting for chance. Model 2 shows a much stronger agreement with the actual data, while Model 1 shows almost no agreement beyond chance. - Lower log loss indicates better probabilistic predictions. Model 2 has a significantly lower log loss, meaning its probability estimates are more reliable. - A higher ROC AUC means the model is better at distinguishing between classes. Model 2 is much better than Model 1 at identifying positive vs. negative classes.

Module 10 Part 1

In this exercise, we will use the data from previous part.

m10_df <- df %>% 
  select(-RACE) %>%
  mutate(DOSE = as.factor(DOSE)) # convert DOSE into factor

colnames(m10_df)

[1] "Y"    "DOSE" "AGE"  "SEX"  "WT"   "HT"

summary(m10_df)

       Y            DOSE         AGE        SEX           WT        
 Min.   : 826.4   25  :59   Min.   :18.00   1:104   Min.   : 56.60  
 1st Qu.:1700.5   37.5:12   1st Qu.:26.00   2: 16   1st Qu.: 73.17  
 Median :2349.1   50  :49   Median :31.00           Median : 82.10  
 Mean   :2445.4             Mean   :33.00           Mean   : 82.55  
 3rd Qu.:3050.2             3rd Qu.:40.25           3rd Qu.: 90.10  
 Max.   :5606.6             Max.   :50.00           Max.   :115.30  
       HT       
 Min.   :1.520  
 1st Qu.:1.700  
 Median :1.770  
 Mean   :1.759  
 3rd Qu.:1.813  
 Max.   :1.930

First, I am setting a seed rngseed using a 7%% for train set and a 25% for test set.

# Set seed
rngseed <- 1234
set.seed(rngseed)

# Split dataset into a 75% train set and a 25% test set
splitted_m10 <- initial_split(m10_df, prop=.75)
train_data <- training(splitted_m10)
test_data  <- testing(splitted_m10)

Model Fitting

In this part, I will fit two models with y as an outcome. Model one has one predictor (dose), and another model include all predictors. Plus Null Model, which can be used to compare the RMSE.

The First Model

#  Y ~ DOSE 
m1_10 <- linear_reg() %>%
  set_engine("lm") %>%
  fit(Y ~ DOSE, train_data)


#Output 
tidy(m1_10)

# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    1873.      109.     17.2  1.07e-29
2 DOSE37.5        651.      275.      2.36 2.03e- 2
3 DOSE50         1336.      158.      8.45 5.97e-13

The second model

#Y ~ DOSE + AGE + SEX + WT + HT

m2_10 <- linear_reg() %>%
  set_engine("lm") %>%
  fit(Y ~ DOSE + AGE + SEX + WT + HT , train_data)

# output 
tidy(m2_10)

# A tibble: 7 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  5764.      2178.      2.65   9.72e- 3
2 DOSE37.5      640.       255.      2.51   1.39e- 2
3 DOSE50       1384.       147.      9.44   8.65e-15
4 AGE            -0.119      9.66   -0.0123 9.90e- 1
5 SEX2         -571.       287.     -1.99   5.00e- 2
6 WT            -22.8        7.72   -2.95   4.13e- 3
7 HT          -1117.      1368.     -0.817  4.17e- 1

#  Null Model 
m3_10 <- null_model() %>%
  set_engine("parsnip") %>%
  set_mode("regression") %>%
  fit(Y ~ ., train_data)


#Output 
tidy(m3_10)

# A tibble: 1 × 1
  value
  <dbl>
1 2509.

Model Performance Assessment 1

#Compute prediction on train data
dose_preds_train <-  predict(m1_10, train_data) %>% bind_cols(train_data)
all_preds_train <-  predict(m2_10, train_data) %>% bind_cols(train_data)
null_preds_train <- predict(m3_10, train_data) %>% bind_cols(train_data)

# Print RMSE for  Y~DOSE 
RMSE_train_DOSE <- rmse(dose_preds_train, truth = Y, estimate = .pred)

# Print RMSE for Y~ All Predictors 
RMSE_train_All <- rmse(all_preds_train, truth = Y, estimate = .pred)

# Print RMSE for the NUll model 
RMSE_train_Null <- rmse(null_preds_train, truth = Y, estimate = .pred)

Based on the output from three different models, RMSEs for model with DOSE only, all predictors, and null are 702.8, 627.3, 948.36 respectively. We can conclude that Regression model including all predictors is the best model for the dataset with the lowest RMSE.

Model Performance Assessment 2

In this part, cross validation is performed by using a 10 fold cross-validation (CV) to examine the performance of the models.

#set seed
set.seed(rngseed)

#Getting the CV folds established
folds <- vfold_cv(train_data, v = 10)

Now I am going to fit the model with only DOSE as predictor to 9 of the splits for 10 times, and calculate the RMSE

#Y~DOSE 
# Model setting
m1_10_spec <- linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")

# Set the workflow: model 1
m1_10_cv <- workflow() %>%
  add_model(m1_10_spec) %>%
  add_formula(Y ~ DOSE)

# Set seed
set.seed(rngseed)

# Fit the data
m1_10_cv_fit <- m1_10_cv  %>% 
  fit_resamples(folds)

# Mean and SE of RMSE
collect_metrics(m1_10_cv_fit)

# A tibble: 2 × 6
  .metric .estimator    mean     n std_err .config             
  <chr>   <chr>        <dbl> <int>   <dbl> <chr>               
1 rmse    standard   697.       10 68.1    Preprocessor1_Model1
2 rsq     standard     0.500    10  0.0605 Preprocessor1_Model1

Fitting model with all predictors

#Y~ All Predictors 
# Model setting
m2_10_spec <- linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")

# Set the workflow: model 1
m2_10_cv <- workflow() %>%
  add_model(m1_10_spec) %>%
  add_formula(Y ~ DOSE + AGE + SEX + WT + HT )

# Set seed
set.seed(rngseed)

# Fit the data
m2_10_cv_fit <- m2_10_cv  %>% 
  fit_resamples(folds)

# Mean and SE of RMSE
collect_metrics(m2_10_cv_fit)

# A tibble: 2 × 6
  .metric .estimator    mean     n std_err .config             
  <chr>   <chr>        <dbl> <int>   <dbl> <chr>               
1 rmse    standard   653.       10 63.6    Preprocessor1_Model1
2 rsq     standard     0.561    10  0.0717 Preprocessor1_Model1

Based on the output, Model 2 has a lower RMSE (652.77 vs. 696.71), indicating that it predicts the response variable Y more accurately than Model 1. The standard error (std_err) is slightly lower in Model 2 (63.60 vs. 68.10), which suggests that Model 2 is more stable across different CV folds.For the R², Model 2 has a higher R² (0.5608 vs. 0.5000), meaning it explains 56.08% of the variance in Y, compared to Model 1, which explains only 50.00%. The standard error of R² is slightly higher in Model 2, meaning there is more variation in R² across folds, but the difference is minor.

Validate using different set.deed

# New Set seed
set.seed(3456)

#10-fold sapling 
folds_new_cv <- vfold_cv(train_data, v=10)


#Y~DOSE 

set.seed(3456)
# Model setting
m1_10_spec <- linear_reg() %>% 
  set_engine("lm") %>% 
  set_mode("regression")


# Set seed
set.seed(3456)

# Fit the data
m1_10_cv_fit_new <- m1_10_cv  %>% 
  fit_resamples(folds_new_cv )


# Mean and SE of RMSE
collect_metrics(m1_10_cv_fit_new)

# A tibble: 2 × 6
  .metric .estimator    mean     n std_err .config             
  <chr>   <chr>        <dbl> <int>   <dbl> <chr>               
1 rmse    standard   691.       10 72.7    Preprocessor1_Model1
2 rsq     standard     0.562    10  0.0638 Preprocessor1_Model1

#Y~ All Predictors 
# Set seed


# Fit the data
m2_10_cv_fit_new <- m2_10_cv  %>% 
  fit_resamples(folds_new_cv)

# Mean and SE of RMSE
collect_metrics(m2_10_cv_fit_new)

# A tibble: 2 × 6
  .metric .estimator    mean     n std_err .config             
  <chr>   <chr>        <dbl> <int>   <dbl> <chr>               
1 rmse    standard   643.       10 66.7    Preprocessor1_Model1
2 rsq     standard     0.588    10  0.0510 Preprocessor1_Model1

Model 2 performs better than Model 1, as it has a lower RMSE (643.02 vs. 690.78). This suggests that Model 2 has improved predictive accuracy and makes smaller errors in predicting Y. fFr R², Model 2 has a higher R² (0.5876 vs. 0.5624), meaning it explains more variance in Y. A higher R² suggests Model 2 fits the data better than Model 1. The result is consistent with original model.

Module 10 Part 2 : This section was added by Yufei Wu

Model predictions

# Generate predictions and combine with observed values
predictions_dose <- predict(m1_10, train_data) %>%
  mutate(Model = "Y ~ DOSE") %>%
  bind_cols(train_data %>% select(Y))

predictions_all <- predict(m2_10, train_data) %>%
  mutate(Model = "Y ~ All Predictors") %>%
  bind_cols(train_data %>% select(Y))

predictions_null <- predict(m3_10, train_data) %>%
  mutate(Model = "Null Model") %>%
  bind_cols(train_data %>% select(Y))

# Combine all predictions into one dataframe
predictions_df <- bind_rows(predictions_dose, predictions_all, predictions_null) %>%
  rename(Observed = Y, Predicted = .pred)

# Create the scatter plot
ggplot(predictions_df, aes(x = Observed, y = Predicted, color = Model, shape = Model)) +
  geom_point(alpha = 0.7) + 
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black") + # 45-degree line
  scale_x_continuous(limits = c(0, 5000)) + 
  scale_y_continuous(limits = c(0, 5000)) +
  labs(title = "Observed vs Predicted Values",
       x = "Observed Values",
       y = "Predicted Values") +
  theme_minimal() +
  theme(legend.position = "bottom")

Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).

From the plot, we can see that the data from the null model are a straight horizontal line (red line) since we predict the same mean for each observation. For model 1, which only includes dose, the data falls along three horizontal lines (blue lines). This may be because the DOSE variable only takes three values (25 mg, 37.5 mg, or 50 mg). Thus, we only get three different predicted values for the outcome. The model 2 looks the best since the points fall along the 45 degree line. However, there seems to be some pattern to the scatter with model predictions lower than observed values for high values.

Now plot predicted versus residuals:

# Compute residuals for Model 2
residuals_df <- predict(m2_10, train_data) %>%
  bind_cols(train_data %>% select(Y)) %>%
  mutate(Residuals = .pred - Y, Predicted = .pred)

# Find the max absolute residual for symmetric y-axis limits
residual_limit <- max(abs(residuals_df$Residuals))

# Load ggplot2 for visualization
library(ggplot2)

# Create residuals plot
ggplot(residuals_df, aes(x = Predicted, y = Residuals)) +
  geom_point(alpha = 0.7, color = "blue") +  # Scatter plot of residuals
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") + # Reference line at 0
  scale_y_continuous(limits = c(-residual_limit, residual_limit)) + # Symmetric y-axis
  labs(title = "Predicted vs Residuals (Model 2)",
       x = "Predicted Values",
       y = "Residuals") +
  theme_minimal()

There is a residual pattern that there are more and higher negative values than positive values.

Model predictions and uncertainty

#set seed
set.seed(rngseed)

# Create 100 bootstrap samples
dat_bs <- bootstraps(train_data, times = 100)

# Fit the model to each bootstrap sample and store predictions
pred_bs <- matrix(NA, nrow = nrow(train_data), ncol = 100)

for (i in 1:100) {
  # Extract bootstrap sample
  dat_sample <- analysis(dat_bs$splits[[i]])
  
  # Fit model to bootstrap sample
  m2_bs <- linear_reg() %>%
    set_engine("lm") %>%
    fit(Y ~ DOSE + AGE + SEX + WT + HT, data = dat_sample)
  
  # Make predictions for original training data
  pred_bs[, i] <- predict(m2_bs, train_data)$.pred
}

# Compute mean and confidence intervals for predictions
preds <- apply(pred_bs, 1, quantile, probs = c(0.055, 0.5, 0.945)) |> t()
colnames(preds) <- c("Lower", "Median", "Upper")

# Compute point estimate from the original model
point_estimate <- predict(m2_10, train_data)$.pred

# Create dataframe with observed values, point estimates, and bootstrap statistics
plot_data <- train_data %>%
  mutate(Point_Estimate = point_estimate,
         Median = preds[, "Median"],
         Lower = preds[, "Lower"],
         Upper = preds[, "Upper"])

# Load ggplot2 for visualization
library(ggplot2)

# Create the plot
ggplot(plot_data, aes(x = Y)) +
  geom_point(aes(y = Point_Estimate), color = "black", shape = 16, alpha = 0.7) + # Point estimates
  geom_point(aes(y = Median), color = "blue", shape = 16, alpha = 0.7) + # Bootstrap median
  geom_errorbar(aes(ymin = Lower, ymax = Upper), color = "red", width = 0.2, alpha = 0.5) + # Confidence intervals
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black") + # 45-degree line
  scale_x_continuous(limits = c(0, 6000)) + 
  scale_y_continuous(limits = c(0, 6000)) +
  labs(title = "Observed vs Bootstrap Confidence Intervals",
       x = "Observed Values",
       y = "Predicted Values") +
  theme_minimal() +
  theme(legend.position = "bottom")

From the figure, we can see that the bootstrap median values align closely with the point estimate, which suggests that the model is fairly stable across different resampled datasets. Besides, the error bars show that the confidence intervals are wider at some higher values, ndicate higher uncertainty in those predictions. Again, most points fall along the 45 degree line except some predictions lower than observed values at high value area.

Module 10 Part 3

# Make predictions on test data

all_preds_test <-  predict(m2_10, test_data) %>% bind_cols(test_data)

#compute RMSE for test data '

all_rmse_test <- rmse(all_preds_test, truth = Y, estimate = .pred)

print(all_rmse_test)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        518.

predictions_train  <- all_preds_train %>% mutate(Dataset = "Train")
predictions_test  <- all_preds_test  %>% mutate(Dataset = "Test")



#Combine train and test predictions
pred_combined <- bind_rows(predictions_train, predictions_test)

ggplot(pred_combined , aes(x = Y, y = .pred, color = Dataset, shape = Dataset)) +
  geom_point(alpha = 0.6) +  # Scatter points for observed vs predicted
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black") +  # Reference 45-degree line
  scale_x_continuous(limits = c(0, 5000)) +  # Set x-axis range
  scale_y_continuous(limits = c(0, 5000)) +  # Set y-axis range
  labs(title = "Observed vs Predicted Values (Train vs Test Data)",
       x = "Observed Y",
       y = "Predicted Y",
       color = "Dataset") +
  theme_minimal()

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

The plots Observed vs. Predicted values for the training and test datasets. This scatterplot compares predicted values of the outcome variable 𝑌 from the fitted model against observed values, separately for training and test data. The dashed 45-degree reference line represents perfect prediction (i.e., predicted = observed). Points closer to this line indicate better predictive performance. Triangles (blue) represent the training set, and circles (pink) represent the test set. The distribution around the reference line suggests the model captures the general trend in both datasets, though some variability is observed, particularly at lower and higher ranges of 𝑌, with a slightly wider spread in the test set. Overall, this plot supports the finding of Model with all predictors included porfoms better accross different datasets.

I want to print RMSE for all models including train and data test

summary_MRSE <- data.frame(
  Model = c("Null Model", "Model 1 (DOSE only)", "Model 2 (All Predictors)", "Model 2 (Test Data)"),
  RMSE = c(RMSE_train_Null$.estimate, RMSE_train_DOSE$.estimate, RMSE_train_All$.estimate, all_rmse_test$.estimate)
)

print(summary_MRSE)

                     Model     RMSE
1               Null Model 948.3526
2      Model 1 (DOSE only) 702.7909
3 Model 2 (All Predictors) 627.2724
4      Model 2 (Test Data) 518.2239

Conclusion

The predictive performance of the models was assessed using root mean squared error (RMSE). The null model, which includes no predictors, had the highest RMSE (948.35), indicating poor predictive accuracy. Model 1, which included only the DOSE variable, showed a substantial improvement with an RMSE of 702.79. Model 2, incorporating all available predictors, further reduced the RMSE to 627.27 on the training data, suggesting that additional variables contributed meaningful predictive value. Model 2 achieved its best performance on the test data, with an RMSE of 518.22, indicating good generalization to unseen data. These results highlight the importance of including multiple relevant predictors to enhance model accuracy and reliability