Retrospective Health-data Analysis

The Data

Description of the Data

Data Definition

Data Preparation and Preprocessing

Explanatory Variable and Dependent Variable

Exploratory Data Analysis

Descriptive Statistics
Correlation Analysis

Outcome Prediction

Data Splitting
Model Building and Tuning
Model Evaluation

Conclusion

Retrospective Health-data Analysis

Retrospective exploratory data analysis of intensive care unit health-data downloaded from kaggle.com already preprocessed by user @saurabhshahane.

NB: The data used is part of a whole data from MIMIC III original database.

Mainly and

The Data

We load the libraries for reading and analyzing the data first.

library("reshape2")

# Load the tidyverse library for data manipulation and visualization
library('tidyverse')

Next we import the data.frame

# Import the data
health_rec <- read_csv('https://raw.githubusercontent.com/xrander/Health-data/master/Health_data.csv')

# Preview the first ten rows of the data
head(health_rec, n = 10)

ABCDEFGHIJ0123456789

group_num <dbl>	ID <dbl>	age <dbl>	gender <dbl>	BMI <dbl>	hypertensive <dbl>	atrialfibrillation <dbl>
1	125047	72	1	37.58818	0	0
1	139812	75	2	NA	0	0
1	109787	83	2	26.57263	0	0
1	130587	43	2	83.26463	0	0
1	138290	75	2	31.82484	1	0
1	154653	76	1	24.26229	1	1
1	194420	72	1	39.66743	1	0
1	153461	83	2	22.31111	1	1
1	113076	61	2	19.99224	1	1
1	147252	67	1	45.03203	1	0

Description of the Data

The MIMIC-III database is publicly a available critical care database containing de-identified data on 46520 patients and 58976 admissions to the ICU of the Beth Israel Deaconess Medical Center, Boston, USA between 2001 and 2008. The data used for this study is limited to a bit more than 1000 patient only.

Data Definition

group_num
ID: The patient ID
outcome: 0 = Alive and 1 = Dead
age: the age of the patient
gender: 1 = Male and 2 = Female
BMI: Body Mass Index
hypertensive: 0 = with and 1 = without
atrialfibrillation: 0 = with and 1 = without
CHD_with_no_MI: Coronary heart disease with Myocardial infarction, 0 = with and 1 without
diabetes: 0 = with and 1 without
deficiencyanemias: 0 = with and 1 = without
depression: 0 = with and 1 = without
Hyperlipemia:this refers to high concentration of fats or lipids in the blood. 0 = with and 1 = without
Renal_failure: 0 = without renal failure and 1 = with renal failure
COPD: chronic obstructive pulmonary disease(A disease causing airflow blockage and breathing related problems). 0 = without and 1 = without
heart_rate: a range of 60 to 100 bpm is considered normal
Systolic_blood_pressure: measure of the pressure in arteries when the heart beats.
Diastolic_blood_pressure: measure of the pressure in the arteries when the heart rests between beats
Respiratory_rate: this is the number of breathes per minute. The normal range is between 12-20, anything under above or below this range is abnormal. 25 is considered too high.
temperature
SP_O2: Oxygen saturation (measure of how much oxygen the blood carries as a percentage of the maximum it could carry). The normal range for healthy individuals is between 96% to 99%
Urine_output
hematocrit: percentage volume of red blood cells in the blood.
RBC: acronym for Red Blood Count.
MCH: Mean Corpuscular Hemoglobin, this is the average amount of hemoglobin in each of the red blood cells
MCHC: Mean Cell Hemoglobin Concentration, this is the average concentration of hemoglobin in a given volume of blood.
MCV: Mean Corpuscular Volume, this is the measure of the average size of the red clood cells
RDW: red cell distribution width. it is the differences in the volume and size of the red blood cells
Leucocyte
Platelets
Neutrophils
Basophils
Lymphocyte
PT: measure of the time it takes the liquid portion of the blood to clot.
INR
NT_proBNP: a measure of heart failure.
Creatine_kinase
Creatinine
Urea_nitrogen
glucose
Blood_potassium
Blood_sodium
Blood_calcium
Chloride
Anion_gap: measures the difference between the negatively and positively charged electrolytes in the blood.
Magnesium_ion
PH
Bicarbonate
Lactic_acid
PCO2: Partial Pressure of Carbon Dioxide, this measures the carbondioxide within the arterial or venous blood.
EF: Ejection Fraction: Used to gauge how healthy the heart is. It is the amount of blood that the heart pumps each time it beats.

Data Preparation and Preprocessing

Note: The PostgreSQL code solution to the descriptive statistic questions asked are here

Explanatory Variable and Dependent Variable

The dependent variable for this analysis is the outcome, while other variables are the explanatory variable.

##Understanding the Data

It is important we understand the data

The Data Structure

# data structure
str(health_rec)

## spc_tbl_ [1,177 × 51] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ group_num               : num [1:1177] 1 1 1 1 1 1 1 1 1 1 ...
##  $ ID                      : num [1:1177] 125047 139812 109787 130587 138290 ...
##  $ outcome                 : num [1:1177] 0 0 0 0 0 0 0 0 0 0 ...
##  $ age                     : num [1:1177] 72 75 83 43 75 76 72 83 61 67 ...
##  $ gender                  : num [1:1177] 1 2 2 2 2 1 1 2 2 1 ...
##  $ BMI                     : num [1:1177] 37.6 NA 26.6 83.3 31.8 ...
##  $ hypertensive            : num [1:1177] 0 0 0 0 1 1 1 1 1 1 ...
##  $ atrialfibrillation      : num [1:1177] 0 0 0 0 0 1 0 1 1 0 ...
##  $ CHD_with_no_MI          : num [1:1177] 0 0 0 0 0 0 0 0 0 0 ...
##  $ diabetes                : num [1:1177] 1 0 0 0 0 0 0 1 1 1 ...
##  $ deficiencyanemias       : num [1:1177] 1 1 1 0 1 1 0 1 0 0 ...
##  $ depression              : num [1:1177] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyperlipemia            : num [1:1177] 1 0 0 0 0 1 1 0 0 0 ...
##  $ Renal_failure           : num [1:1177] 1 0 1 0 1 1 1 0 1 0 ...
##  $ COPD                    : num [1:1177] 0 1 0 0 1 1 1 0 0 0 ...
##  $ heart_rate              : num [1:1177] 68.8 101.4 72.3 94.5 67.9 ...
##  $ Systolic_blood_pressure : num [1:1177] 156 140 135 126 157 ...
##  $ Diastolic_blood_pressure: num [1:1177] 68.3 65 61.4 73.2 58.1 ...
##  $ Respiratory_rate        : num [1:1177] 16.6 20.9 23.6 21.9 21.4 ...
##  $ temperature             : num [1:1177] 36.7 36.7 36.5 36.3 36.8 ...
##  $ SP_O2                   : num [1:1177] 98.4 96.9 95.3 93.8 99.3 ...
##  $ Urine_output            : num [1:1177] 2155 1425 2425 8760 4455 ...
##  $ hematocrit              : num [1:1177] 26.3 30.8 27.7 36.6 29.9 ...
##  $ RBC                     : num [1:1177] 2.96 3.14 2.62 4.28 3.29 ...
##  $ MCH                     : num [1:1177] 28.2 31.1 34.3 26.1 30.7 ...
##  $ MCHC                    : num [1:1177] 31.5 31.7 31.3 30.4 33.7 ...
##  $ MCV                     : num [1:1177] 89.9 98.2 109.8 85.6 91 ...
##  $ RDW                     : num [1:1177] 16.2 14.3 23.8 17 16.3 ...
##  $ Leucocyte               : num [1:1177] 7.65 12.74 5.48 8.22 8.83 ...
##  $ Platelets               : num [1:1177] 305 246 204 216 251 ...
##  $ Neutrophils             : num [1:1177] 74.7 NA 68.1 81.8 NA ...
##  $ Basophils               : num [1:1177] 0.4 NA 0.55 0.15 NA 0.3 0.2 NA 0.55 NA ...
##  $ Lymphocyte              : num [1:1177] 13.3 NA 24.5 14.5 NA ...
##  $ PT                      : num [1:1177] 10.6 NA 11.3 27.1 NA ...
##  $ INR                     : num [1:1177] 1 NA 0.95 2.67 NA ...
##  $ NT_proBNP               : num [1:1177] 1956 2384 4081 668 30802 ...
##  $ Creatine_kinase         : num [1:1177] 148 60.6 16 85 111.7 ...
##  $ Creatinine              : num [1:1177] 1.958 1.122 1.871 0.586 1.95 ...
##  $ Urea_nitrogen           : num [1:1177] 50 20.3 33.9 15.3 43 ...
##  $ glucose                 : num [1:1177] 115 148 149 128 146 ...
##  $ Blood_potassium         : num [1:1177] 4.82 4.45 5.83 4.39 4.78 ...
##  $ Blood_sodium            : num [1:1177] 139 139 141 138 137 ...
##  $ Blood_calcium           : num [1:1177] 7.46 8.16 8.27 9.48 8.73 ...
##  $ Chloride                : num [1:1177] 109.2 98.4 105.9 92.1 104.5 ...
##  $ Anion_gap               : num [1:1177] 13.2 11.4 10 12.4 15.2 ...
##  $ Magnesium_ion           : num [1:1177] 2.62 1.89 2.16 1.94 1.65 ...
##  $ PH                      : num [1:1177] 7.23 7.22 7.27 7.37 7.25 ...
##  $ Bicarbonate             : num [1:1177] 21.2 33.4 30.6 38.6 22 ...
##  $ Lactic_acid             : num [1:1177] 0.5 0.5 0.5 0.6 0.6 ...
##  $ PCO2                    : num [1:1177] 40 78 71.5 75 50 ...
##  $ EF                      : num [1:1177] 55 55 35 55 55 35 55 75 50 55 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   group_num = col_double(),
##   ..   ID = col_double(),
##   ..   outcome = col_double(),
##   ..   age = col_double(),
##   ..   gender = col_double(),
##   ..   BMI = col_double(),
##   ..   hypertensive = col_double(),
##   ..   atrialfibrillation = col_double(),
##   ..   CHD_with_no_MI = col_double(),
##   ..   diabetes = col_double(),
##   ..   deficiencyanemias = col_double(),
##   ..   depression = col_double(),
##   ..   Hyperlipemia = col_double(),
##   ..   Renal_failure = col_double(),
##   ..   COPD = col_double(),
##   ..   heart_rate = col_double(),
##   ..   Systolic_blood_pressure = col_double(),
##   ..   Diastolic_blood_pressure = col_double(),
##   ..   Respiratory_rate = col_double(),
##   ..   temperature = col_double(),
##   ..   SP_O2 = col_double(),
##   ..   Urine_output = col_double(),
##   ..   hematocrit = col_double(),
##   ..   RBC = col_double(),
##   ..   MCH = col_double(),
##   ..   MCHC = col_double(),
##   ..   MCV = col_double(),
##   ..   RDW = col_double(),
##   ..   Leucocyte = col_double(),
##   ..   Platelets = col_double(),
##   ..   Neutrophils = col_double(),
##   ..   Basophils = col_double(),
##   ..   Lymphocyte = col_double(),
##   ..   PT = col_double(),
##   ..   INR = col_double(),
##   ..   NT_proBNP = col_double(),
##   ..   Creatine_kinase = col_double(),
##   ..   Creatinine = col_double(),
##   ..   Urea_nitrogen = col_double(),
##   ..   glucose = col_double(),
##   ..   Blood_potassium = col_double(),
##   ..   Blood_sodium = col_double(),
##   ..   Blood_calcium = col_double(),
##   ..   Chloride = col_double(),
##   ..   Anion_gap = col_double(),
##   ..   Magnesium_ion = col_double(),
##   ..   PH = col_double(),
##   ..   Bicarbonate = col_double(),
##   ..   Lactic_acid = col_double(),
##   ..   PCO2 = col_double(),
##   ..   EF = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

At first glance, all the variables are numeric type. There are 1177 observations and 51 variables. Some of the variables of the data does not follow the data definition. A quick descriptive summary of the data shows this.

# quick descriptive statistics of the data
summary(health_rec)

##    group_num           ID            outcome            age       
##  Min.   :1.000   Min.   :100213   Min.   :0.0000   Min.   :19.00  
##  1st Qu.:1.000   1st Qu.:125603   1st Qu.:0.0000   1st Qu.:65.00  
##  Median :1.000   Median :151901   Median :0.0000   Median :77.00  
##  Mean   :1.299   Mean   :150778   Mean   :0.1352   Mean   :74.06  
##  3rd Qu.:2.000   3rd Qu.:176048   3rd Qu.:0.0000   3rd Qu.:85.00  
##  Max.   :2.000   Max.   :199952   Max.   :1.0000   Max.   :99.00  
##                                   NA's   :1                       
##      gender           BMI          hypertensive    atrialfibrillation
##  Min.   :1.000   Min.   : 13.35   Min.   :0.0000   Min.   :0.0000    
##  1st Qu.:1.000   1st Qu.: 24.33   1st Qu.:0.0000   1st Qu.:0.0000    
##  Median :2.000   Median : 28.31   Median :1.0000   Median :0.0000    
##  Mean   :1.525   Mean   : 30.19   Mean   :0.7179   Mean   :0.4511    
##  3rd Qu.:2.000   3rd Qu.: 33.63   3rd Qu.:1.0000   3rd Qu.:1.0000    
##  Max.   :2.000   Max.   :104.97   Max.   :1.0000   Max.   :1.0000    
##                  NA's   :215                                         
##  CHD_with_no_MI       diabetes      deficiencyanemias   depression    
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.000     Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.000     1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000   Median :0.000     Median :0.0000  
##  Mean   :0.08581   Mean   :0.4214   Mean   :0.339     Mean   :0.1189  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.000     3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.000     Max.   :1.0000  
##                                                                       
##   Hyperlipemia    Renal_failure         COPD           heart_rate    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   : 36.00  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.: 72.37  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median : 83.61  
##  Mean   :0.3798   Mean   :0.3653   Mean   :0.07562   Mean   : 84.58  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.: 95.91  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :135.71  
##                                                      NA's   :13      
##  Systolic_blood_pressure Diastolic_blood_pressure Respiratory_rate
##  Min.   : 75.0           Min.   : 24.74           Min.   :11.14   
##  1st Qu.:105.4           1st Qu.: 52.17           1st Qu.:17.93   
##  Median :116.1           Median : 58.46           Median :20.37   
##  Mean   :118.0           Mean   : 59.53           Mean   :20.80   
##  3rd Qu.:128.6           3rd Qu.: 65.46           3rd Qu.:23.39   
##  Max.   :203.0           Max.   :107.00           Max.   :40.90   
##  NA's   :16              NA's   :16               NA's   :13      
##   temperature        SP_O2         Urine_output    hematocrit   
##  Min.   :33.25   Min.   : 75.92   Min.   :   0   Min.   :20.31  
##  1st Qu.:36.29   1st Qu.: 95.00   1st Qu.: 980   1st Qu.:28.16  
##  Median :36.65   Median : 96.45   Median :1675   Median :30.80  
##  Mean   :36.68   Mean   : 96.27   Mean   :1899   Mean   :31.91  
##  3rd Qu.:37.02   3rd Qu.: 97.92   3rd Qu.:2500   3rd Qu.:35.01  
##  Max.   :39.13   Max.   :100.00   Max.   :8820   Max.   :55.42  
##  NA's   :19      NA's   :13       NA's   :36                    
##       RBC             MCH             MCHC            MCV        
##  Min.   :2.030   Min.   :18.12   Min.   :27.82   Min.   : 62.60  
##  1st Qu.:3.120   1st Qu.:28.25   1st Qu.:32.01   1st Qu.: 86.25  
##  Median :3.490   Median :29.75   Median :32.99   Median : 90.00  
##  Mean   :3.575   Mean   :29.54   Mean   :32.86   Mean   : 89.90  
##  3rd Qu.:3.900   3rd Qu.:31.24   3rd Qu.:33.83   3rd Qu.: 93.86  
##  Max.   :6.575   Max.   :40.31   Max.   :37.01   Max.   :116.71  
##                                                                  
##       RDW          Leucocyte       Platelets         Neutrophils   
##  Min.   :12.09   Min.   : 0.10   Min.   :   9.571   Min.   : 5.00  
##  1st Qu.:14.46   1st Qu.: 7.44   1st Qu.: 168.909   1st Qu.:74.78  
##  Median :15.51   Median : 9.68   Median : 222.667   Median :82.47  
##  Mean   :15.95   Mean   :10.71   Mean   : 241.504   Mean   :80.11  
##  3rd Qu.:16.94   3rd Qu.:12.74   3rd Qu.: 304.250   3rd Qu.:87.45  
##  Max.   :29.05   Max.   :64.75   Max.   :1028.200   Max.   :98.00  
##                                                     NA's   :144    
##    Basophils        Lymphocyte            PT             INR        
##  Min.   :0.1000   Min.   : 0.9667   Min.   :10.10   Min.   :0.8714  
##  1st Qu.:0.2000   1st Qu.: 6.6500   1st Qu.:13.16   1st Qu.:1.1400  
##  Median :0.3000   Median :10.4750   Median :14.63   Median :1.3000  
##  Mean   :0.4056   Mean   :12.2330   Mean   :17.48   Mean   :1.6255  
##  3rd Qu.:0.5000   3rd Qu.:15.4625   3rd Qu.:18.80   3rd Qu.:1.7364  
##  Max.   :8.8000   Max.   :83.5000   Max.   :71.27   Max.   :8.3429  
##  NA's   :259      NA's   :145       NA's   :20      NA's   :20      
##    NT_proBNP      Creatine_kinase      Creatinine      Urea_nitrogen    
##  Min.   :    50   Min.   :    8.00   Min.   : 0.2667   Min.   :  5.357  
##  1st Qu.:  2251   1st Qu.:   46.00   1st Qu.: 0.9400   1st Qu.: 20.833  
##  Median :  5840   Median :   89.25   Median : 1.2875   Median : 30.667  
##  Mean   : 11014   Mean   :  246.78   Mean   : 1.6428   Mean   : 36.298  
##  3rd Qu.: 14968   3rd Qu.:  185.19   3rd Qu.: 1.9000   3rd Qu.: 45.250  
##  Max.   :118928   Max.   :42987.50   Max.   :15.5273   Max.   :161.750  
##                   NA's   :165                                           
##     glucose       Blood_potassium  Blood_sodium   Blood_calcium   
##  Min.   : 66.67   Min.   :3.000   Min.   :114.7   Min.   : 6.700  
##  1st Qu.:113.94   1st Qu.:3.900   1st Qu.:136.7   1st Qu.: 8.149  
##  Median :136.40   Median :4.115   Median :139.2   Median : 8.500  
##  Mean   :148.80   Mean   :4.177   Mean   :138.9   Mean   : 8.501  
##  3rd Qu.:169.50   3rd Qu.:4.400   3rd Qu.:141.6   3rd Qu.: 8.869  
##  Max.   :414.10   Max.   :6.567   Max.   :154.7   Max.   :10.950  
##  NA's   :18                                       NA's   :1       
##     Chloride        Anion_gap      Magnesium_ion         PH       
##  Min.   : 80.27   Min.   : 6.636   Min.   :1.400   Min.   :7.090  
##  1st Qu.: 99.00   1st Qu.:12.250   1st Qu.:1.956   1st Qu.:7.335  
##  Median :102.50   Median :13.667   Median :2.092   Median :7.380  
##  Mean   :102.28   Mean   :13.925   Mean   :2.120   Mean   :7.379  
##  3rd Qu.:105.57   3rd Qu.:15.417   3rd Qu.:2.242   3rd Qu.:7.430  
##  Max.   :122.53   Max.   :25.500   Max.   :4.073   Max.   :7.580  
##                                                    NA's   :292    
##   Bicarbonate     Lactic_acid         PCO2             EF       
##  Min.   :12.86   Min.   :0.500   Min.   :18.75   Min.   :15.00  
##  1st Qu.:23.45   1st Qu.:1.200   1st Qu.:37.04   1st Qu.:40.00  
##  Median :26.50   Median :1.600   Median :43.00   Median :55.00  
##  Mean   :26.91   Mean   :1.853   Mean   :45.54   Mean   :48.72  
##  3rd Qu.:29.88   3rd Qu.:2.200   3rd Qu.:50.59   3rd Qu.:55.00  
##  Max.   :47.67   Max.   :8.333   Max.   :98.60   Max.   :75.00  
##                  NA's   :229     NA's   :294

We can expect two data type according to the data description. The first is categorical variables for data which min is 0 and max is 1, and the other is numeric variables.

The descriptive statistics also shows the min and max age as 19 and 99.

Missing values

Next, we investigate the data for missing values. It is alright to have some missing values for some variables, but the outcome should be known, as it is important to see if patients are dead or alive. Outcome will be used to handle the missing data.

# Remove missing data from the outcome variable
health_rec <- health_rec%>%
  filter(!is.na(outcome))

There is no missing data present in the outcome of patients. The data have been reduced from 1177 to dim(health_rec)[1]. To deal with missing values from the other variables other than outcome, we need to know the missing value for the categorical and numeric variable and take the necessary steps.

Categorical Varaibles

categorical <- health_rec %>%
  select(gender, hypertensive:COPD)

unique(is.na(categorical))

##      gender hypertensive atrialfibrillation CHD_with_no_MI diabetes
## [1,]  FALSE        FALSE              FALSE          FALSE    FALSE
##      deficiencyanemias depression Hyperlipemia Renal_failure  COPD
## [1,]             FALSE      FALSE        FALSE         FALSE FALSE

The categorical variables do not have missing values.

Numerical Variables distribution

For the numerical variable, the descriptive statistics carried out earlier shows that some numerical variables contains missing values

numerical <- health_rec %>%
  select(-colnames(categorical), -c(1:4))

str(numerical)

## tibble [1,176 × 37] (S3: tbl_df/tbl/data.frame)
##  $ BMI                     : num [1:1176] 37.6 NA 26.6 83.3 31.8 ...
##  $ heart_rate              : num [1:1176] 68.8 101.4 72.3 94.5 67.9 ...
##  $ Systolic_blood_pressure : num [1:1176] 156 140 135 126 157 ...
##  $ Diastolic_blood_pressure: num [1:1176] 68.3 65 61.4 73.2 58.1 ...
##  $ Respiratory_rate        : num [1:1176] 16.6 20.9 23.6 21.9 21.4 ...
##  $ temperature             : num [1:1176] 36.7 36.7 36.5 36.3 36.8 ...
##  $ SP_O2                   : num [1:1176] 98.4 96.9 95.3 93.8 99.3 ...
##  $ Urine_output            : num [1:1176] 2155 1425 2425 8760 4455 ...
##  $ hematocrit              : num [1:1176] 26.3 30.8 27.7 36.6 29.9 ...
##  $ RBC                     : num [1:1176] 2.96 3.14 2.62 4.28 3.29 ...
##  $ MCH                     : num [1:1176] 28.2 31.1 34.3 26.1 30.7 ...
##  $ MCHC                    : num [1:1176] 31.5 31.7 31.3 30.4 33.7 ...
##  $ MCV                     : num [1:1176] 89.9 98.2 109.8 85.6 91 ...
##  $ RDW                     : num [1:1176] 16.2 14.3 23.8 17 16.3 ...
##  $ Leucocyte               : num [1:1176] 7.65 12.74 5.48 8.22 8.83 ...
##  $ Platelets               : num [1:1176] 305 246 204 216 251 ...
##  $ Neutrophils             : num [1:1176] 74.7 NA 68.1 81.8 NA ...
##  $ Basophils               : num [1:1176] 0.4 NA 0.55 0.15 NA 0.3 0.2 NA 0.55 NA ...
##  $ Lymphocyte              : num [1:1176] 13.3 NA 24.5 14.5 NA ...
##  $ PT                      : num [1:1176] 10.6 NA 11.3 27.1 NA ...
##  $ INR                     : num [1:1176] 1 NA 0.95 2.67 NA ...
##  $ NT_proBNP               : num [1:1176] 1956 2384 4081 668 30802 ...
##  $ Creatine_kinase         : num [1:1176] 148 60.6 16 85 111.7 ...
##  $ Creatinine              : num [1:1176] 1.958 1.122 1.871 0.586 1.95 ...
##  $ Urea_nitrogen           : num [1:1176] 50 20.3 33.9 15.3 43 ...
##  $ glucose                 : num [1:1176] 115 148 149 128 146 ...
##  $ Blood_potassium         : num [1:1176] 4.82 4.45 5.83 4.39 4.78 ...
##  $ Blood_sodium            : num [1:1176] 139 139 141 138 137 ...
##  $ Blood_calcium           : num [1:1176] 7.46 8.16 8.27 9.48 8.73 ...
##  $ Chloride                : num [1:1176] 109.2 98.4 105.9 92.1 104.5 ...
##  $ Anion_gap               : num [1:1176] 13.2 11.4 10 12.4 15.2 ...
##  $ Magnesium_ion           : num [1:1176] 2.62 1.89 2.16 1.94 1.65 ...
##  $ PH                      : num [1:1176] 7.23 7.22 7.27 7.37 7.25 ...
##  $ Bicarbonate             : num [1:1176] 21.2 33.4 30.6 38.6 22 ...
##  $ Lactic_acid             : num [1:1176] 0.5 0.5 0.5 0.6 0.6 ...
##  $ PCO2                    : num [1:1176] 40 78 71.5 75 50 ...
##  $ EF                      : num [1:1176] 55 55 35 55 55 35 55 75 50 55 ...

We are going to impute the missing values in the numerical variables using their mean. We use the e1071 package impute() function to impute the means here.

# Load e1071 to impute missing values, and svm model training
library("e1071")

imputed_means <- numerical %>%
  impute(what = "mean")

Distibution of Numerical Variables

The new data will be tidied to visualize the distribution of the numerical variables.

as_tibble(imputed_means) %>%
  pivot_longer(cols = everything(),
               names_to = "variable",
               values_to = "value") %>%
  ggplot(show.legend = F)+
  geom_density(aes(value), col = "tomato2")+
  facet_wrap(~variable, scales = "free")+
  theme_minimal()

###### Normally distributed Variables Variables provides information on the body’s electrolyte and Mineral content such as anion gap, bicarbonate, blood calcium, blood potassium, blood sodium and chloride are normally distributed. Vital signs and Cardiovascular parameters such as temperature, PH, Diastolic blood pressure, and heart rate are also normally distributed. Only three Hematology parameters, MCH, MCHC and MCV are normally distributed.

Right-skewed Variables

Respiratory rate, RBW, RBC, urea nitrogen, urine output, basophils, BMI, creatine-Kinase, creatinine, glucose, hematocrit, INR, lactic acid, leucocyte, lymphocyte, magnesium ion, NT Pro BMP, PCO2, platelets, PT, and systolic blood pressure are all right skewed.

Left-skewed Variables

Only three variables are left skewed. These includes neutrophils, EF, and SP O2.

Imputing Missing Values

Next we replace the old values with new ones in the main data frame. In the process, the group_num and ID column were also removed.

health_rec_clean <- health_rec %>%
  select(-c(1,2),-c(names(numerical))) %>% # remove unwanted variables and old numerical variables
  bind_cols(imputed_means) %>% # add imputed values columns to the data frame
  mutate(id = row_number()) %>% # add row numbers to the data frame
  select(id, everything()) # rearrange variables to make id come first

tail(health_rec_clean, 10)

ABCDEFGHIJ0123456789

id <int>	outcome <dbl>	age <dbl>	gender <dbl>	hypertensive <dbl>	atrialfibrillation <dbl>	CHD_with_no_MI <dbl>	diabetes <dbl>	deficiencyanemias <dbl>
1167	0	83	1	1	1	1	1	0
1168	1	80	1	1	1	0	1	1
1169	0	88	2	1	0	0	1	1
1170	0	79	1	1	1	0	0	0
1171	0	92	1	1	0	0	0	0
1172	0	62	1	1	1	0	1	0
1173	0	78	1	0	1	0	1	0
1174	0	85	2	1	1	0	1	1
1175	0	79	2	0	0	1	1	1
1176	0	47	1	1	0	0	1	0

Now we check to find if there are still missing values.

unique(is.na(health_rec_clean))

##         id outcome   age gender hypertensive atrialfibrillation CHD_with_no_MI
## [1,] FALSE   FALSE FALSE  FALSE        FALSE              FALSE          FALSE
##      diabetes deficiencyanemias depression Hyperlipemia Renal_failure  COPD
## [1,]    FALSE             FALSE      FALSE        FALSE         FALSE FALSE
##        BMI heart_rate Systolic_blood_pressure Diastolic_blood_pressure
## [1,] FALSE      FALSE                   FALSE                    FALSE
##      Respiratory_rate temperature SP_O2 Urine_output hematocrit   RBC   MCH
## [1,]            FALSE       FALSE FALSE        FALSE      FALSE FALSE FALSE
##       MCHC   MCV   RDW Leucocyte Platelets Neutrophils Basophils Lymphocyte
## [1,] FALSE FALSE FALSE     FALSE     FALSE       FALSE     FALSE      FALSE
##         PT   INR NT_proBNP Creatine_kinase Creatinine Urea_nitrogen glucose
## [1,] FALSE FALSE     FALSE           FALSE      FALSE         FALSE   FALSE
##      Blood_potassium Blood_sodium Blood_calcium Chloride Anion_gap
## [1,]           FALSE        FALSE         FALSE    FALSE     FALSE
##      Magnesium_ion    PH Bicarbonate Lactic_acid  PCO2    EF
## [1,]         FALSE FALSE       FALSE       FALSE FALSE FALSE

All numerical variable missing values are replaced by their means.

Exploratory Data Analysis

A majority of the question asked are descriptive statistics questions and answers will be provided using visualization and tables.

Descriptive Statistics

What is the percentage age distribution according to age groups

The age group with the highest frequency: To do this, we have to bin the age variable, creating a class range, then we can create a summary of the age_group and see their frequency

# Create bins
age_categories <- seq(0, 120, by = 20) # We assume the highest age is not above 120 years old

# Create age interval variable
health_rec_clean <- health_rec_clean %>%
  mutate(age_group = cut(age,
                         breaks = age_categories,
                         include.lowest = T, # To indicate if x is the lowest
                         ordered_result = F),#This ensures that the factor variable is not ordered
         age_group =  ifelse(age_group == "[0,20]", "0-20",
                             ifelse(age_group == "(20,40]", "20-40",
                                    ifelse(age_group == "(40,60]", "40-60",
                                           ifelse(age_group == "(60,80]", "60-80", "80-100")))), # Change the group names to ensure consistency
         age_group = factor(age_group, levels = c("0-20", "20-40","40-60", "60-80", "80-100")))

health_rec_clean %>%
  group_by(age_group) %>%
  summarize(frequency = n()) %>%
  ggplot(aes("", frequency, fill = age_group))+
  geom_col(color = "white")+
  geom_text(aes(label = paste0(round(frequency/sum(frequency)*100, 1), "%")),
              position = position_stack(vjust = 0.5))+
  scale_fill_manual(values = c("pink", "coral", "tomato1", "orangered1", "indianred1"))+
  labs(x = NULL,
       y = NULL,
       fill = "Age Group",
       title = "Pie Chart of Age Groups")+
  coord_polar(theta = "y", start = 0)+
  theme(legend.position = "top",
        axis.line = element_blank(),
        plot.background = element_blank(),
        rect = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank())

Older people are the highest in the hospital, with people between 60 - 80 years old having being the highest in the clinic and people more than 80 years old being the next most occurring patients. To know the age(not age group) that is most occurring, we can simply group by the age them.

Top Occuring ages in the Hospital

health_rec_clean %>%
  group_by(age) %>%
  summarize(frequency = n()) %>%
  arrange(desc(frequency)) %>%
  top_n(n = 6)

ABCDEFGHIJ0123456789

age <dbl>	frequency <int>
89	141
84	48
81	43
80	40
78	39
83	39

The highest count of patients in the clinic are aged 89, 84 and 81 years respectively.

What is the number of dead patients for each gender*

health_rec_clean %>%
  filter(outcome == 1) %>%
  group_by(gender, age_group) %>%
  summarize(dead_patients = length(outcome)) %>%
  arrange(desc(dead_patients)) %>%
  ggplot(aes(age_group, dead_patients, fill = factor(gender))) +
  geom_col(position = "dodge")+
  scale_fill_manual(values = c("tomato2", "thistle4"),
                    labels = c("Male", "Female"))+
  labs(x = "Age group",
       y = "Dead patients",
       fill = "Gender",
       title = "Dead counts across the different gender")+
  expand_limits(y = c(0, 40))+
  theme_bw()

The chart shows the dead patients according to age group and gender. The age group 80-100 years for females have the highest count of death.

Gender Frequency

The gender with the highest count in the clinic is the female gender as show below

health_rec_clean %>%
  select(gender) %>%
  mutate(gender = ifelse(gender == 1, "Male", "Female")) %>%
  group_by(gender) %>%
  count() %>%
  ggplot(aes(gender, n))+
  geom_col(fill = c("tan3", "wheat4"))+
  ggtitle("Frequency of Patients According to Their Gender")+
  theme_bw()

#what is the death rate of both gender for a given age group?

health_rec_clean %>%
  mutate(outcome = ifelse(outcome == 0, "Alive", "Dead"),
         gender = ifelse(gender == 1, "Male", "Female")) %>%
  group_by(gender, outcome, age_group) %>%
  count() %>%
  pivot_wider(names_from = outcome,
              values_from = n,
              values_fill = 0) %>%
  mutate(total = sum(Alive, Dead),
         death_per_100 = Dead/100,
         death_ratio = round(Dead/total, 2))

ABCDEFGHIJ0123456789

gender <chr>	age_group <fct>	Alive <int>	Dead <int>	total <int>	death_per_100 <dbl>	death_ratio <dbl>
Female	0-20	1	0	1	0.00	0.00
Female	20-40	5	0	5	0.00	0.00
Female	40-60	76	9	85	0.09	0.11
Female	60-80	231	33	264	0.33	0.12
Female	80-100	226	37	263	0.37	0.14
Male	0-20	1	0	1	0.00	0.00
Male	20-40	11	2	13	0.02	0.15
Male	40-60	76	10	86	0.10	0.12
Male	60-80	219	34	253	0.34	0.13
Male	80-100	171	34	205	0.34	0.17

Disease and Death

Now we will investigate the 159 death occurrences that is present in the clinic and relate it to some of the diseases.

dead_patients <- health_rec_clean %>%
  filter(outcome == 1) %>%
  select(outcome:COPD)

Tidying the data for visualization

dead_patients_tidy <- dead_patients %>%
  pivot_longer(cols = c(hypertensive:COPD),
               names_to = "diseases",
               values_to = "state") %>%
  mutate(outcome = ifelse(outcome == 0, "Alive", "Dead"),
         gender = ifelse(gender == 1, "Male", "Female"),
         state = ifelse(diseases == "Renal_failure", "Not having", "Having"),
         state = ifelse(diseases != "Renal_failure", "Having", "Not Having"))

dead_patients_tidy %>%
  ggplot(aes(state, fill = gender))+
  geom_bar(position = "dodge")+
  scale_fill_manual(values = c("wheat4", "tan3"))+
  facet_wrap(~diseases, scales = "free_y")+
  theme_bw()

All dead patients seems to be having a disease or health condition, while Renal_failure have not been related to death.

Correlation Analysis

# Load corrplot package for correlation matrix
library('corrplot')

## corrplot 0.92 loaded

# Correlation for numerical variables
correlation_plot <- cor(health_rec_clean[,c(1:2, 13:49)])

# Visualize correlation
ggplot(melt(correlation_plot), aes(Var1, Var2, fill = value, label = round(value,2)))+
  geom_tile()+
  geom_text(aes(label= ifelse(value>0.1, as.character(round(value, 1)), "")))+
  scale_fill_gradient(low = "tomato",
                      high = "skyblue")+
  labs(title = "Correlation plot of variables")+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Outcome Prediction

Machine Learning Algorithms to be used We are going to predict the outcome of any one patients making use of four machine learning algorithms viz:

Logistic Regression
Random Forest Classifier
Support Vector Machines (SVMs)
Gradient Boosting Algorithms (XGboosts)

Models will be evaluated using:

ROC curve
Confusion Matrix
Precision scores While these are used, Regression related metrics such as MAE, MSE, RMSE and R-squared will also be used based on the algorithm.

Data Splitting

Split the data and set seed.

# Load caret package for data spliting
library("caret")

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

# Set seed to ensure reproducibility
set.seed(1599)

# Split the data to train and test data
training_data_index <- createDataPartition(health_rec_clean$outcome,
                    p = 0.8,
                    list = F)

# Training data
health_rec_train <- health_rec_clean[training_data_index, ]
# Test data
health_rec_test <- health_rec_clean[-training_data_index,]

# Remove id number and age_group columns
health_rec_train <- health_rec_train %>%
  select(-c(1, 51))

health_rec_test <- health_rec_test %>%
  select(-c(1, 51))

Preview split data

head(health_rec_test, n = 5)

ABCDEFGHIJ0123456789

outcome <dbl>	age <dbl>	gender <dbl>	hypertensive <dbl>	atrialfibrillation <dbl>	diabetes <dbl>	deficiencyanemias <dbl>
0	61	2	1	1	1	0
0	70	2	1	0	1	0
0	87	2	1	0	1	0
0	89	2	0	0	0	1
1	89	1	1	1	0	0

head(health_rec_train, n = 5)

ABCDEFGHIJ0123456789

age <dbl>	gender <dbl>	hypertensive <dbl>	diabetes <dbl>	deficiencyanemias <dbl>
72	1	0	1	1
75	2	0	0	1
83	2	0	0	1
43	2	0	0	0
75	2	1	0	1

Model Building and Tuning

Logistic Regression

####Training the model

logistic_model <- glm(formula = outcome ~ .,
                      family = "binomial",
                      data = health_rec_train)

Ten variables blood calcium, urea nitrogen, creatinine, platelets, leucocytes, PCO2, heart rate, diastolic blood pressure, COPD, and renal failure have a significant effect on the outcome.

Making Predictions

logistic_pred <- ifelse(predict(logistic_model, health_rec_test) >= 0.5, 1, 0)

mean(health_rec_test$outcome == logistic_pred)

## [1] 0.8808511

The logistic model is having a 88.1% accuracy.

Random Forest Classifier

Random forest classifier uses multiple decision tree models and try to mitigate the overfitting from a single decision tree by aggregating the decision from multiple decision trees.

# load randomForest library
library("randomForest")

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

Hypertuning parameter for Random Forest

# Create parameter grid for search values

param_grid <- expand.grid(mtry = c(50, 200, 300,500, 700))

# define control function for tuning
control <- trainControl(method = "cv", number = 5) # Use 5 folds cross-validation

model <- train(factor(outcome) ~ ., data = health_rec_train, method = "rf", trControl = control, tuneGrid = param_grid)
print(model)

## Random Forest 
## 
## 941 samples
##  48 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 752, 754, 752, 753, 753 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    50   0.8767351  0.2895012
##   200   0.8746017  0.2697694
##   300   0.8809792  0.3021321
##   500   0.8767351  0.2762276
##   700   0.8767181  0.2821706
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 300.

Training the Random Forest Model

The best parameter for ntree = 300 will be used for the model.

rf_model <- randomForest(outcome ~ .,
                         data = health_rec_train,
                         ntree = 300) # train rf model with best parameter from hypertuning

Predicting the outcome

rf_pred <- ifelse(predict(rf_model, health_rec_test, type = "response") >= 0.5, 1, 0)

mean(health_rec_test$outcome == rf_pred)

## [1] 0.8765957

The model is having 87.6% accuracy a bit lower than the logistic regression model.

Support Vector Machines (SVM)

# mutate outcome to factor type
health_rec_train2 <- health_rec_train %>%
  mutate(outcome = factor(outcome))

health_rec_test2 <- health_rec_test %>%
  mutate(outcome = factor(outcome))
svm_model <- svm(outcome ~ .,
                 data = health_rec_train2,
                 kernel = "linear")

summary(svm_model)

## 
## Call:
## svm(formula = outcome ~ ., data = health_rec_train2, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  261
## 
##  ( 140 121 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Predicting with the outcome with SVM

svm_pred <- predict(svm_model, health_rec_test2)

mean(health_rec_test2$outcome == svm_pred)

## [1] 0.8808511

The model is having 88.1% accuracy, similar to logistic regression

XGBoost

Next we will use the xgboost package

#load the xgboost library
library(xgboost)

XGBoost data Preparation

We create the xgboost data type

# Create XGBosst data structure for test and train data
dtrain <- xgb.DMatrix(data = as.matrix(health_rec_train[, -1]), label = as.matrix(health_rec_train[, 1]))
dtest <- xgb.DMatrix(data = as.matrix(health_rec_test[, -1]))

Training the Data

Next we train the data

# Set parameter structure
params <- list(
  objective = "binary:logistic",
  max_depth = 10,
  eval_metric = "logloss"
)
# Train the XGBoost model
xgb_model <- xgboost(params = params,
                     data = dtrain,
                     nthread = 2,
                     nrounds = 200)

Predicting with the Model

Now we can make predictions

xgboost_pred <- ifelse(predict(xgb_model, dtest, type = "class") >=0.5, 1, 0)

mean(xgboost_pred == health_rec_test$outcome)

## [1] 0.9021277

The model is having a 90% accuracy which is the best performance so far.

Model Evaluation

The confusion matrix is used as the only evaluation method for this study. ### Logistic Regression

table(log_predicted = logistic_pred, actual = health_rec_test$outcome)

##              actual
## log_predicted   0   1
##             0 204  27
##             1   1   3

The model performs poorly at predicting dead individuals. It is having more false positives, predicting that some patients are alive instead dead.

Random Forests Classifier

table(random_forest_predicted = rf_pred, actual = health_rec_test$outcome)

##                        actual
## random_forest_predicted   0   1
##                       0 203  27
##                       1   2   3

This performance is also similar to logistic model, it is also having more false positives, predicting that some individuals are dead instead of alive.

SVM

table(prediction = svm_pred, actual = health_rec_test2$outcome)

##           actual
## prediction   0   1
##          0 204  27
##          1   1   3

XGBoosts

table(xgboost_pred, health_rec_test$outcome)

##             
## xgboost_pred   0   1
##            0 203  21
##            1   2   9

Predictions by XGBoost is the best so far, it performs better in predicting actual death than other models

Conclusion

Four models, XGBoost, logistic regression, SVM and random forest, were tested to see which is having the best outcome. XGBoost is having the best prediction with 90% probability of success.

Back to Homepage

Predicting Health Outcome (Classification Model)

Olamide Adu

2023-03-25

Retrospective Health-data Analysis

The Data

Description of the Data

Data Definition

Data Preparation and Preprocessing

Explanatory Variable and Dependent Variable

The Data Structure

Missing values

Categorical Varaibles

Numerical Variables distribution

Distibution of Numerical Variables

Right-skewed Variables

Left-skewed Variables

Imputing Missing Values

Exploratory Data Analysis

Descriptive Statistics

What is the percentage age distribution according to age groups

Top Occuring ages in the Hospital

What is the number of dead patients for each gender*

Gender Frequency

Disease and Death

Correlation Analysis

Outcome Prediction

Data Splitting

Model Building and Tuning

Logistic Regression

Making Predictions

Random Forest Classifier

Hypertuning parameter for Random Forest

Training the Random Forest Model

Predicting the outcome

Support Vector Machines (SVM)

Predicting with the outcome with SVM

XGBoost

XGBoost data Preparation

Training the Data

Predicting with the Model

Model Evaluation

Random Forests Classifier

SVM

XGBoosts

Conclusion