Retrospective exploratory data analysis of intensive care unit health-data downloaded from kaggle.com already preprocessed by user @saurabhshahane.
NB: The data used is part of a whole data from MIMIC III original database.
Mainly  and 
We load the libraries for reading and analyzing the data first.
library("reshape2")
# Load the tidyverse library for data manipulation and visualization
library('tidyverse')Next we import the data.frame
# Import the data
health_rec <- read_csv('https://raw.githubusercontent.com/xrander/Health-data/master/Health_data.csv')
# Preview the first ten rows of the data
head(health_rec, n = 10)The MIMIC-III database is publicly a available critical care database
containing de-identified data on 46520 patients and 58976 admissions to
the ICU of the Beth Israel Deaconess Medical Center, Boston, USA between
2001 and 2008. The data used for this study is limited to a bit more
than 1000 patient only. 
group_num
ID: The patient ID
outcome: 0 = Alive and 1 = Dead
age: the age of the patient
gender: 1 = Male and 2 = Female
BMI: Body Mass Index
hypertensive: 0 = with and 1 = without
atrialfibrillation: 0 = with and 1 = without
CHD_with_no_MI: Coronary heart disease with Myocardial infarction, 0 = with and 1 without
diabetes: 0 = with and 1 without
deficiencyanemias: 0 = with and 1 = without
depression: 0 = with and 1 = without
Hyperlipemia:this refers to high concentration of fats or lipids in the blood. 0 = with and 1 = without
Renal_failure: 0 = without renal failure and 1 = with renal failure
COPD: chronic obstructive pulmonary disease(A disease causing airflow blockage and breathing related problems). 0 = without and 1 = without
heart_rate: a range of 60 to 100 bpm is considered normal
Systolic_blood_pressure: measure of the pressure in arteries when the heart beats.
Diastolic_blood_pressure: measure of the pressure in the arteries when the heart rests between beats
Respiratory_rate: this is the number of breathes per minute. The normal range is between 12-20, anything under above or below this range is abnormal. 25 is considered too high.
temperature
SP_O2: Oxygen saturation (measure of how much oxygen the blood carries as a percentage of the maximum it could carry). The normal range for healthy individuals is between 96% to 99%
Urine_output
hematocrit: percentage volume of red blood cells in the blood.
RBC: acronym for Red Blood Count.
MCH: Mean Corpuscular Hemoglobin, this is the average amount of hemoglobin in each of the red blood cells
MCHC: Mean Cell Hemoglobin Concentration, this is the average concentration of hemoglobin in a given volume of blood.
MCV: Mean Corpuscular Volume, this is the measure of the average size of the red clood cells
RDW: red cell distribution width. it is the differences in the volume and size of the red blood cells
Leucocyte
Platelets
Neutrophils
Basophils
Lymphocyte
PT: measure of the time it takes the liquid portion of the blood to clot.
INR
NT_proBNP: a measure of heart failure.
Creatine_kinase
Creatinine
Urea_nitrogen
glucose
Blood_potassium
Blood_sodium
Blood_calcium
Chloride
Anion_gap: measures the difference between the negatively and positively charged electrolytes in the blood.
Magnesium_ion
PH
Bicarbonate
Lactic_acid
PCO2: Partial Pressure of Carbon Dioxide, this measures the carbondioxide within the arterial or venous blood.
EF: Ejection Fraction: Used to gauge how healthy the heart is. It is the amount of blood that the heart pumps each time it beats.
Note: The PostgreSQL code solution to the descriptive statistic questions asked are here
The dependent variable for this analysis is the outcome, while other variables are the explanatory variable.
##Understanding the Data
It is important we understand the data
## spc_tbl_ [1,177 × 51] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ group_num               : num [1:1177] 1 1 1 1 1 1 1 1 1 1 ...
##  $ ID                      : num [1:1177] 125047 139812 109787 130587 138290 ...
##  $ outcome                 : num [1:1177] 0 0 0 0 0 0 0 0 0 0 ...
##  $ age                     : num [1:1177] 72 75 83 43 75 76 72 83 61 67 ...
##  $ gender                  : num [1:1177] 1 2 2 2 2 1 1 2 2 1 ...
##  $ BMI                     : num [1:1177] 37.6 NA 26.6 83.3 31.8 ...
##  $ hypertensive            : num [1:1177] 0 0 0 0 1 1 1 1 1 1 ...
##  $ atrialfibrillation      : num [1:1177] 0 0 0 0 0 1 0 1 1 0 ...
##  $ CHD_with_no_MI          : num [1:1177] 0 0 0 0 0 0 0 0 0 0 ...
##  $ diabetes                : num [1:1177] 1 0 0 0 0 0 0 1 1 1 ...
##  $ deficiencyanemias       : num [1:1177] 1 1 1 0 1 1 0 1 0 0 ...
##  $ depression              : num [1:1177] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Hyperlipemia            : num [1:1177] 1 0 0 0 0 1 1 0 0 0 ...
##  $ Renal_failure           : num [1:1177] 1 0 1 0 1 1 1 0 1 0 ...
##  $ COPD                    : num [1:1177] 0 1 0 0 1 1 1 0 0 0 ...
##  $ heart_rate              : num [1:1177] 68.8 101.4 72.3 94.5 67.9 ...
##  $ Systolic_blood_pressure : num [1:1177] 156 140 135 126 157 ...
##  $ Diastolic_blood_pressure: num [1:1177] 68.3 65 61.4 73.2 58.1 ...
##  $ Respiratory_rate        : num [1:1177] 16.6 20.9 23.6 21.9 21.4 ...
##  $ temperature             : num [1:1177] 36.7 36.7 36.5 36.3 36.8 ...
##  $ SP_O2                   : num [1:1177] 98.4 96.9 95.3 93.8 99.3 ...
##  $ Urine_output            : num [1:1177] 2155 1425 2425 8760 4455 ...
##  $ hematocrit              : num [1:1177] 26.3 30.8 27.7 36.6 29.9 ...
##  $ RBC                     : num [1:1177] 2.96 3.14 2.62 4.28 3.29 ...
##  $ MCH                     : num [1:1177] 28.2 31.1 34.3 26.1 30.7 ...
##  $ MCHC                    : num [1:1177] 31.5 31.7 31.3 30.4 33.7 ...
##  $ MCV                     : num [1:1177] 89.9 98.2 109.8 85.6 91 ...
##  $ RDW                     : num [1:1177] 16.2 14.3 23.8 17 16.3 ...
##  $ Leucocyte               : num [1:1177] 7.65 12.74 5.48 8.22 8.83 ...
##  $ Platelets               : num [1:1177] 305 246 204 216 251 ...
##  $ Neutrophils             : num [1:1177] 74.7 NA 68.1 81.8 NA ...
##  $ Basophils               : num [1:1177] 0.4 NA 0.55 0.15 NA 0.3 0.2 NA 0.55 NA ...
##  $ Lymphocyte              : num [1:1177] 13.3 NA 24.5 14.5 NA ...
##  $ PT                      : num [1:1177] 10.6 NA 11.3 27.1 NA ...
##  $ INR                     : num [1:1177] 1 NA 0.95 2.67 NA ...
##  $ NT_proBNP               : num [1:1177] 1956 2384 4081 668 30802 ...
##  $ Creatine_kinase         : num [1:1177] 148 60.6 16 85 111.7 ...
##  $ Creatinine              : num [1:1177] 1.958 1.122 1.871 0.586 1.95 ...
##  $ Urea_nitrogen           : num [1:1177] 50 20.3 33.9 15.3 43 ...
##  $ glucose                 : num [1:1177] 115 148 149 128 146 ...
##  $ Blood_potassium         : num [1:1177] 4.82 4.45 5.83 4.39 4.78 ...
##  $ Blood_sodium            : num [1:1177] 139 139 141 138 137 ...
##  $ Blood_calcium           : num [1:1177] 7.46 8.16 8.27 9.48 8.73 ...
##  $ Chloride                : num [1:1177] 109.2 98.4 105.9 92.1 104.5 ...
##  $ Anion_gap               : num [1:1177] 13.2 11.4 10 12.4 15.2 ...
##  $ Magnesium_ion           : num [1:1177] 2.62 1.89 2.16 1.94 1.65 ...
##  $ PH                      : num [1:1177] 7.23 7.22 7.27 7.37 7.25 ...
##  $ Bicarbonate             : num [1:1177] 21.2 33.4 30.6 38.6 22 ...
##  $ Lactic_acid             : num [1:1177] 0.5 0.5 0.5 0.6 0.6 ...
##  $ PCO2                    : num [1:1177] 40 78 71.5 75 50 ...
##  $ EF                      : num [1:1177] 55 55 35 55 55 35 55 75 50 55 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   group_num = col_double(),
##   ..   ID = col_double(),
##   ..   outcome = col_double(),
##   ..   age = col_double(),
##   ..   gender = col_double(),
##   ..   BMI = col_double(),
##   ..   hypertensive = col_double(),
##   ..   atrialfibrillation = col_double(),
##   ..   CHD_with_no_MI = col_double(),
##   ..   diabetes = col_double(),
##   ..   deficiencyanemias = col_double(),
##   ..   depression = col_double(),
##   ..   Hyperlipemia = col_double(),
##   ..   Renal_failure = col_double(),
##   ..   COPD = col_double(),
##   ..   heart_rate = col_double(),
##   ..   Systolic_blood_pressure = col_double(),
##   ..   Diastolic_blood_pressure = col_double(),
##   ..   Respiratory_rate = col_double(),
##   ..   temperature = col_double(),
##   ..   SP_O2 = col_double(),
##   ..   Urine_output = col_double(),
##   ..   hematocrit = col_double(),
##   ..   RBC = col_double(),
##   ..   MCH = col_double(),
##   ..   MCHC = col_double(),
##   ..   MCV = col_double(),
##   ..   RDW = col_double(),
##   ..   Leucocyte = col_double(),
##   ..   Platelets = col_double(),
##   ..   Neutrophils = col_double(),
##   ..   Basophils = col_double(),
##   ..   Lymphocyte = col_double(),
##   ..   PT = col_double(),
##   ..   INR = col_double(),
##   ..   NT_proBNP = col_double(),
##   ..   Creatine_kinase = col_double(),
##   ..   Creatinine = col_double(),
##   ..   Urea_nitrogen = col_double(),
##   ..   glucose = col_double(),
##   ..   Blood_potassium = col_double(),
##   ..   Blood_sodium = col_double(),
##   ..   Blood_calcium = col_double(),
##   ..   Chloride = col_double(),
##   ..   Anion_gap = col_double(),
##   ..   Magnesium_ion = col_double(),
##   ..   PH = col_double(),
##   ..   Bicarbonate = col_double(),
##   ..   Lactic_acid = col_double(),
##   ..   PCO2 = col_double(),
##   ..   EF = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>At first glance, all the variables are numeric type. There are 1177 observations and 51 variables. Some of the variables of the data does not follow the data definition. A quick descriptive summary of the data shows this.
##    group_num           ID            outcome            age       
##  Min.   :1.000   Min.   :100213   Min.   :0.0000   Min.   :19.00  
##  1st Qu.:1.000   1st Qu.:125603   1st Qu.:0.0000   1st Qu.:65.00  
##  Median :1.000   Median :151901   Median :0.0000   Median :77.00  
##  Mean   :1.299   Mean   :150778   Mean   :0.1352   Mean   :74.06  
##  3rd Qu.:2.000   3rd Qu.:176048   3rd Qu.:0.0000   3rd Qu.:85.00  
##  Max.   :2.000   Max.   :199952   Max.   :1.0000   Max.   :99.00  
##                                   NA's   :1                       
##      gender           BMI          hypertensive    atrialfibrillation
##  Min.   :1.000   Min.   : 13.35   Min.   :0.0000   Min.   :0.0000    
##  1st Qu.:1.000   1st Qu.: 24.33   1st Qu.:0.0000   1st Qu.:0.0000    
##  Median :2.000   Median : 28.31   Median :1.0000   Median :0.0000    
##  Mean   :1.525   Mean   : 30.19   Mean   :0.7179   Mean   :0.4511    
##  3rd Qu.:2.000   3rd Qu.: 33.63   3rd Qu.:1.0000   3rd Qu.:1.0000    
##  Max.   :2.000   Max.   :104.97   Max.   :1.0000   Max.   :1.0000    
##                  NA's   :215                                         
##  CHD_with_no_MI       diabetes      deficiencyanemias   depression    
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.000     Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.000     1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000   Median :0.000     Median :0.0000  
##  Mean   :0.08581   Mean   :0.4214   Mean   :0.339     Mean   :0.1189  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.000     3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.000     Max.   :1.0000  
##                                                                       
##   Hyperlipemia    Renal_failure         COPD           heart_rate    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   : 36.00  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.: 72.37  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median : 83.61  
##  Mean   :0.3798   Mean   :0.3653   Mean   :0.07562   Mean   : 84.58  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.: 95.91  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :135.71  
##                                                      NA's   :13      
##  Systolic_blood_pressure Diastolic_blood_pressure Respiratory_rate
##  Min.   : 75.0           Min.   : 24.74           Min.   :11.14   
##  1st Qu.:105.4           1st Qu.: 52.17           1st Qu.:17.93   
##  Median :116.1           Median : 58.46           Median :20.37   
##  Mean   :118.0           Mean   : 59.53           Mean   :20.80   
##  3rd Qu.:128.6           3rd Qu.: 65.46           3rd Qu.:23.39   
##  Max.   :203.0           Max.   :107.00           Max.   :40.90   
##  NA's   :16              NA's   :16               NA's   :13      
##   temperature        SP_O2         Urine_output    hematocrit   
##  Min.   :33.25   Min.   : 75.92   Min.   :   0   Min.   :20.31  
##  1st Qu.:36.29   1st Qu.: 95.00   1st Qu.: 980   1st Qu.:28.16  
##  Median :36.65   Median : 96.45   Median :1675   Median :30.80  
##  Mean   :36.68   Mean   : 96.27   Mean   :1899   Mean   :31.91  
##  3rd Qu.:37.02   3rd Qu.: 97.92   3rd Qu.:2500   3rd Qu.:35.01  
##  Max.   :39.13   Max.   :100.00   Max.   :8820   Max.   :55.42  
##  NA's   :19      NA's   :13       NA's   :36                    
##       RBC             MCH             MCHC            MCV        
##  Min.   :2.030   Min.   :18.12   Min.   :27.82   Min.   : 62.60  
##  1st Qu.:3.120   1st Qu.:28.25   1st Qu.:32.01   1st Qu.: 86.25  
##  Median :3.490   Median :29.75   Median :32.99   Median : 90.00  
##  Mean   :3.575   Mean   :29.54   Mean   :32.86   Mean   : 89.90  
##  3rd Qu.:3.900   3rd Qu.:31.24   3rd Qu.:33.83   3rd Qu.: 93.86  
##  Max.   :6.575   Max.   :40.31   Max.   :37.01   Max.   :116.71  
##                                                                  
##       RDW          Leucocyte       Platelets         Neutrophils   
##  Min.   :12.09   Min.   : 0.10   Min.   :   9.571   Min.   : 5.00  
##  1st Qu.:14.46   1st Qu.: 7.44   1st Qu.: 168.909   1st Qu.:74.78  
##  Median :15.51   Median : 9.68   Median : 222.667   Median :82.47  
##  Mean   :15.95   Mean   :10.71   Mean   : 241.504   Mean   :80.11  
##  3rd Qu.:16.94   3rd Qu.:12.74   3rd Qu.: 304.250   3rd Qu.:87.45  
##  Max.   :29.05   Max.   :64.75   Max.   :1028.200   Max.   :98.00  
##                                                     NA's   :144    
##    Basophils        Lymphocyte            PT             INR        
##  Min.   :0.1000   Min.   : 0.9667   Min.   :10.10   Min.   :0.8714  
##  1st Qu.:0.2000   1st Qu.: 6.6500   1st Qu.:13.16   1st Qu.:1.1400  
##  Median :0.3000   Median :10.4750   Median :14.63   Median :1.3000  
##  Mean   :0.4056   Mean   :12.2330   Mean   :17.48   Mean   :1.6255  
##  3rd Qu.:0.5000   3rd Qu.:15.4625   3rd Qu.:18.80   3rd Qu.:1.7364  
##  Max.   :8.8000   Max.   :83.5000   Max.   :71.27   Max.   :8.3429  
##  NA's   :259      NA's   :145       NA's   :20      NA's   :20      
##    NT_proBNP      Creatine_kinase      Creatinine      Urea_nitrogen    
##  Min.   :    50   Min.   :    8.00   Min.   : 0.2667   Min.   :  5.357  
##  1st Qu.:  2251   1st Qu.:   46.00   1st Qu.: 0.9400   1st Qu.: 20.833  
##  Median :  5840   Median :   89.25   Median : 1.2875   Median : 30.667  
##  Mean   : 11014   Mean   :  246.78   Mean   : 1.6428   Mean   : 36.298  
##  3rd Qu.: 14968   3rd Qu.:  185.19   3rd Qu.: 1.9000   3rd Qu.: 45.250  
##  Max.   :118928   Max.   :42987.50   Max.   :15.5273   Max.   :161.750  
##                   NA's   :165                                           
##     glucose       Blood_potassium  Blood_sodium   Blood_calcium   
##  Min.   : 66.67   Min.   :3.000   Min.   :114.7   Min.   : 6.700  
##  1st Qu.:113.94   1st Qu.:3.900   1st Qu.:136.7   1st Qu.: 8.149  
##  Median :136.40   Median :4.115   Median :139.2   Median : 8.500  
##  Mean   :148.80   Mean   :4.177   Mean   :138.9   Mean   : 8.501  
##  3rd Qu.:169.50   3rd Qu.:4.400   3rd Qu.:141.6   3rd Qu.: 8.869  
##  Max.   :414.10   Max.   :6.567   Max.   :154.7   Max.   :10.950  
##  NA's   :18                                       NA's   :1       
##     Chloride        Anion_gap      Magnesium_ion         PH       
##  Min.   : 80.27   Min.   : 6.636   Min.   :1.400   Min.   :7.090  
##  1st Qu.: 99.00   1st Qu.:12.250   1st Qu.:1.956   1st Qu.:7.335  
##  Median :102.50   Median :13.667   Median :2.092   Median :7.380  
##  Mean   :102.28   Mean   :13.925   Mean   :2.120   Mean   :7.379  
##  3rd Qu.:105.57   3rd Qu.:15.417   3rd Qu.:2.242   3rd Qu.:7.430  
##  Max.   :122.53   Max.   :25.500   Max.   :4.073   Max.   :7.580  
##                                                    NA's   :292    
##   Bicarbonate     Lactic_acid         PCO2             EF       
##  Min.   :12.86   Min.   :0.500   Min.   :18.75   Min.   :15.00  
##  1st Qu.:23.45   1st Qu.:1.200   1st Qu.:37.04   1st Qu.:40.00  
##  Median :26.50   Median :1.600   Median :43.00   Median :55.00  
##  Mean   :26.91   Mean   :1.853   Mean   :45.54   Mean   :48.72  
##  3rd Qu.:29.88   3rd Qu.:2.200   3rd Qu.:50.59   3rd Qu.:55.00  
##  Max.   :47.67   Max.   :8.333   Max.   :98.60   Max.   :75.00  
##                  NA's   :229     NA's   :294We can expect two data type according to the data description. The first is categorical variables for data which min is 0 and max is 1, and the other is numeric variables.
The descriptive statistics also shows the min and max age as 19 and 99.
Next, we investigate the data for missing values. It is alright to have some missing values for some variables, but the outcome should be known, as it is important to see if patients are dead or alive. Outcome will be used to handle the missing data.
There is no missing data present in the outcome of patients. The data
have been reduced from 1177 to dim(health_rec)[1]. To deal
with missing values from the other variables other than outcome, we need
to know the missing value for the categorical and numeric variable and
take the necessary steps.
##      gender hypertensive atrialfibrillation CHD_with_no_MI diabetes
## [1,]  FALSE        FALSE              FALSE          FALSE    FALSE
##      deficiencyanemias depression Hyperlipemia Renal_failure  COPD
## [1,]             FALSE      FALSE        FALSE         FALSE FALSEThe categorical variables do not have missing values.
For the numerical variable, the descriptive statistics carried out earlier shows that some numerical variables contains missing values
## tibble [1,176 × 37] (S3: tbl_df/tbl/data.frame)
##  $ BMI                     : num [1:1176] 37.6 NA 26.6 83.3 31.8 ...
##  $ heart_rate              : num [1:1176] 68.8 101.4 72.3 94.5 67.9 ...
##  $ Systolic_blood_pressure : num [1:1176] 156 140 135 126 157 ...
##  $ Diastolic_blood_pressure: num [1:1176] 68.3 65 61.4 73.2 58.1 ...
##  $ Respiratory_rate        : num [1:1176] 16.6 20.9 23.6 21.9 21.4 ...
##  $ temperature             : num [1:1176] 36.7 36.7 36.5 36.3 36.8 ...
##  $ SP_O2                   : num [1:1176] 98.4 96.9 95.3 93.8 99.3 ...
##  $ Urine_output            : num [1:1176] 2155 1425 2425 8760 4455 ...
##  $ hematocrit              : num [1:1176] 26.3 30.8 27.7 36.6 29.9 ...
##  $ RBC                     : num [1:1176] 2.96 3.14 2.62 4.28 3.29 ...
##  $ MCH                     : num [1:1176] 28.2 31.1 34.3 26.1 30.7 ...
##  $ MCHC                    : num [1:1176] 31.5 31.7 31.3 30.4 33.7 ...
##  $ MCV                     : num [1:1176] 89.9 98.2 109.8 85.6 91 ...
##  $ RDW                     : num [1:1176] 16.2 14.3 23.8 17 16.3 ...
##  $ Leucocyte               : num [1:1176] 7.65 12.74 5.48 8.22 8.83 ...
##  $ Platelets               : num [1:1176] 305 246 204 216 251 ...
##  $ Neutrophils             : num [1:1176] 74.7 NA 68.1 81.8 NA ...
##  $ Basophils               : num [1:1176] 0.4 NA 0.55 0.15 NA 0.3 0.2 NA 0.55 NA ...
##  $ Lymphocyte              : num [1:1176] 13.3 NA 24.5 14.5 NA ...
##  $ PT                      : num [1:1176] 10.6 NA 11.3 27.1 NA ...
##  $ INR                     : num [1:1176] 1 NA 0.95 2.67 NA ...
##  $ NT_proBNP               : num [1:1176] 1956 2384 4081 668 30802 ...
##  $ Creatine_kinase         : num [1:1176] 148 60.6 16 85 111.7 ...
##  $ Creatinine              : num [1:1176] 1.958 1.122 1.871 0.586 1.95 ...
##  $ Urea_nitrogen           : num [1:1176] 50 20.3 33.9 15.3 43 ...
##  $ glucose                 : num [1:1176] 115 148 149 128 146 ...
##  $ Blood_potassium         : num [1:1176] 4.82 4.45 5.83 4.39 4.78 ...
##  $ Blood_sodium            : num [1:1176] 139 139 141 138 137 ...
##  $ Blood_calcium           : num [1:1176] 7.46 8.16 8.27 9.48 8.73 ...
##  $ Chloride                : num [1:1176] 109.2 98.4 105.9 92.1 104.5 ...
##  $ Anion_gap               : num [1:1176] 13.2 11.4 10 12.4 15.2 ...
##  $ Magnesium_ion           : num [1:1176] 2.62 1.89 2.16 1.94 1.65 ...
##  $ PH                      : num [1:1176] 7.23 7.22 7.27 7.37 7.25 ...
##  $ Bicarbonate             : num [1:1176] 21.2 33.4 30.6 38.6 22 ...
##  $ Lactic_acid             : num [1:1176] 0.5 0.5 0.5 0.6 0.6 ...
##  $ PCO2                    : num [1:1176] 40 78 71.5 75 50 ...
##  $ EF                      : num [1:1176] 55 55 35 55 55 35 55 75 50 55 ...We are going to impute the missing values in the numerical variables using their mean. We use the e1071 package impute() function to impute the means here.
The new data will be tidied to visualize the distribution of the numerical variables.
as_tibble(imputed_means) %>%
  pivot_longer(cols = everything(),
               names_to = "variable",
               values_to = "value") %>%
  ggplot(show.legend = F)+
  geom_density(aes(value), col = "tomato2")+
  facet_wrap(~variable, scales = "free")+
  theme_minimal()
###### Normally distributed Variables Variables provides information on
the body’s electrolyte and Mineral content such as anion gap,
bicarbonate, blood calcium, blood potassium, blood sodium and chloride
are normally distributed. Vital signs and Cardiovascular parameters such
as temperature, PH, Diastolic blood pressure, and heart rate are also
normally distributed. Only three Hematology parameters, MCH, MCHC and
MCV are normally distributed.
Respiratory rate, RBW, RBC, urea nitrogen, urine output, basophils, BMI, creatine-Kinase, creatinine, glucose, hematocrit, INR, lactic acid, leucocyte, lymphocyte, magnesium ion, NT Pro BMP, PCO2, platelets, PT, and systolic blood pressure are all right skewed.
Only three variables are left skewed. These includes neutrophils, EF, and SP O2.
Next we replace the old values with new ones in the main data frame. In the process, the group_num and ID column were also removed.
health_rec_clean <- health_rec %>%
  select(-c(1,2),-c(names(numerical))) %>% # remove unwanted variables and old numerical variables
  bind_cols(imputed_means) %>% # add imputed values columns to the data frame
  mutate(id = row_number()) %>% # add row numbers to the data frame
  select(id, everything()) # rearrange variables to make id come first
tail(health_rec_clean, 10)Now we check to find if there are still missing values.
##         id outcome   age gender hypertensive atrialfibrillation CHD_with_no_MI
## [1,] FALSE   FALSE FALSE  FALSE        FALSE              FALSE          FALSE
##      diabetes deficiencyanemias depression Hyperlipemia Renal_failure  COPD
## [1,]    FALSE             FALSE      FALSE        FALSE         FALSE FALSE
##        BMI heart_rate Systolic_blood_pressure Diastolic_blood_pressure
## [1,] FALSE      FALSE                   FALSE                    FALSE
##      Respiratory_rate temperature SP_O2 Urine_output hematocrit   RBC   MCH
## [1,]            FALSE       FALSE FALSE        FALSE      FALSE FALSE FALSE
##       MCHC   MCV   RDW Leucocyte Platelets Neutrophils Basophils Lymphocyte
## [1,] FALSE FALSE FALSE     FALSE     FALSE       FALSE     FALSE      FALSE
##         PT   INR NT_proBNP Creatine_kinase Creatinine Urea_nitrogen glucose
## [1,] FALSE FALSE     FALSE           FALSE      FALSE         FALSE   FALSE
##      Blood_potassium Blood_sodium Blood_calcium Chloride Anion_gap
## [1,]           FALSE        FALSE         FALSE    FALSE     FALSE
##      Magnesium_ion    PH Bicarbonate Lactic_acid  PCO2    EF
## [1,]         FALSE FALSE       FALSE       FALSE FALSE FALSEAll numerical variable missing values are replaced by their means.
A majority of the question asked are descriptive statistics questions and answers will be provided using visualization and tables.
The age group with the highest frequency: To do this, we have to bin the age variable, creating a class range, then we can create a summary of the age_group and see their frequency
# Create bins
age_categories <- seq(0, 120, by = 20) # We assume the highest age is not above 120 years old
# Create age interval variable
health_rec_clean <- health_rec_clean %>%
  mutate(age_group = cut(age,
                         breaks = age_categories,
                         include.lowest = T, # To indicate if x is the lowest
                         ordered_result = F),#This ensures that the factor variable is not ordered
         age_group =  ifelse(age_group == "[0,20]", "0-20",
                             ifelse(age_group == "(20,40]", "20-40",
                                    ifelse(age_group == "(40,60]", "40-60",
                                           ifelse(age_group == "(60,80]", "60-80", "80-100")))), # Change the group names to ensure consistency
         age_group = factor(age_group, levels = c("0-20", "20-40","40-60", "60-80", "80-100")))health_rec_clean %>%
  group_by(age_group) %>%
  summarize(frequency = n()) %>%
  ggplot(aes("", frequency, fill = age_group))+
  geom_col(color = "white")+
  geom_text(aes(label = paste0(round(frequency/sum(frequency)*100, 1), "%")),
              position = position_stack(vjust = 0.5))+
  scale_fill_manual(values = c("pink", "coral", "tomato1", "orangered1", "indianred1"))+
  labs(x = NULL,
       y = NULL,
       fill = "Age Group",
       title = "Pie Chart of Age Groups")+
  coord_polar(theta = "y", start = 0)+
  theme(legend.position = "top",
        axis.line = element_blank(),
        plot.background = element_blank(),
        rect = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank())Older people are the highest in the hospital, with people between 60 - 80 years old having being the highest in the clinic and people more than 80 years old being the next most occurring patients. To know the age(not age group) that is most occurring, we can simply group by the age them.
health_rec_clean %>%
  group_by(age) %>%
  summarize(frequency = n()) %>%
  arrange(desc(frequency)) %>%
  top_n(n = 6)The highest count of patients in the clinic are aged 89, 84 and 81 years respectively.
health_rec_clean %>%
  filter(outcome == 1) %>%
  group_by(gender, age_group) %>%
  summarize(dead_patients = length(outcome)) %>%
  arrange(desc(dead_patients)) %>%
  ggplot(aes(age_group, dead_patients, fill = factor(gender))) +
  geom_col(position = "dodge")+
  scale_fill_manual(values = c("tomato2", "thistle4"),
                    labels = c("Male", "Female"))+
  labs(x = "Age group",
       y = "Dead patients",
       fill = "Gender",
       title = "Dead counts across the different gender")+
  expand_limits(y = c(0, 40))+
  theme_bw()The chart shows the dead patients according to age group and gender. The age group 80-100 years for females have the highest count of death.
The gender with the highest count in the clinic is the female gender as show below
health_rec_clean %>%
  select(gender) %>%
  mutate(gender = ifelse(gender == 1, "Male", "Female")) %>%
  group_by(gender) %>%
  count() %>%
  ggplot(aes(gender, n))+
  geom_col(fill = c("tan3", "wheat4"))+
  ggtitle("Frequency of Patients According to Their Gender")+
  theme_bw()#what is the death rate of both gender for a given age group?
health_rec_clean %>%
  mutate(outcome = ifelse(outcome == 0, "Alive", "Dead"),
         gender = ifelse(gender == 1, "Male", "Female")) %>%
  group_by(gender, outcome, age_group) %>%
  count() %>%
  pivot_wider(names_from = outcome,
              values_from = n,
              values_fill = 0) %>%
  mutate(total = sum(Alive, Dead),
         death_per_100 = Dead/100,
         death_ratio = round(Dead/total, 2))Now we will investigate the 159 death occurrences that is present in the clinic and relate it to some of the diseases.
Tidying the data for visualization
dead_patients_tidy <- dead_patients %>%
  pivot_longer(cols = c(hypertensive:COPD),
               names_to = "diseases",
               values_to = "state") %>%
  mutate(outcome = ifelse(outcome == 0, "Alive", "Dead"),
         gender = ifelse(gender == 1, "Male", "Female"),
         state = ifelse(diseases == "Renal_failure", "Not having", "Having"),
         state = ifelse(diseases != "Renal_failure", "Having", "Not Having"))dead_patients_tidy %>%
  ggplot(aes(state, fill = gender))+
  geom_bar(position = "dodge")+
  scale_fill_manual(values = c("wheat4", "tan3"))+
  facet_wrap(~diseases, scales = "free_y")+
  theme_bw()
All dead patients seems to be having a disease or health condition,
while Renal_failure have not been related to death.
## corrplot 0.92 loaded# Correlation for numerical variables
correlation_plot <- cor(health_rec_clean[,c(1:2, 13:49)])
# Visualize correlation
ggplot(melt(correlation_plot), aes(Var1, Var2, fill = value, label = round(value,2)))+
  geom_tile()+
  geom_text(aes(label= ifelse(value>0.1, as.character(round(value, 1)), "")))+
  scale_fill_gradient(low = "tomato",
                      high = "skyblue")+
  labs(title = "Correlation plot of variables")+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))Machine Learning Algorithms to be used We are going to predict the outcome of any one patients making use of four machine learning algorithms viz:
Logistic Regression
Random Forest Classifier
Support Vector Machines (SVMs)
Gradient Boosting Algorithms (XGboosts)
Models will be evaluated using:
ROC curve
Confusion Matrix
Precision scores While these are used, Regression related metrics such as MAE, MSE, RMSE and R-squared will also be used based on the algorithm.
Split the data and set seed.
## Loading required package: lattice## 
## Attaching package: 'caret'## The following object is masked from 'package:purrr':
## 
##     lift# Set seed to ensure reproducibility
set.seed(1599)
# Split the data to train and test data
training_data_index <- createDataPartition(health_rec_clean$outcome,
                    p = 0.8,
                    list = F)
# Training data
health_rec_train <- health_rec_clean[training_data_index, ]
# Test data
health_rec_test <- health_rec_clean[-training_data_index,]# Remove id number and age_group columns
health_rec_train <- health_rec_train %>%
  select(-c(1, 51))
health_rec_test <- health_rec_test %>%
  select(-c(1, 51))Preview split data
####Training the model
Ten variables blood calcium, urea nitrogen, creatinine, platelets, leucocytes, PCO2, heart rate, diastolic blood pressure, COPD, and renal failure have a significant effect on the outcome.
Random forest classifier uses multiple decision tree models and try to mitigate the overfitting from a single decision tree by aggregating the decision from multiple decision trees.
## randomForest 4.7-1.1## Type rfNews() to see new features/changes/bug fixes.## 
## Attaching package: 'randomForest'## The following object is masked from 'package:dplyr':
## 
##     combine## The following object is masked from 'package:ggplot2':
## 
##     margin# Create parameter grid for search values
param_grid <- expand.grid(mtry = c(50, 200, 300,500, 700))
# define control function for tuning
control <- trainControl(method = "cv", number = 5) # Use 5 folds cross-validation
model <- train(factor(outcome) ~ ., data = health_rec_train, method = "rf", trControl = control, tuneGrid = param_grid)
print(model)## Random Forest 
## 
## 941 samples
##  48 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 752, 754, 752, 753, 753 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    50   0.8767351  0.2895012
##   200   0.8746017  0.2697694
##   300   0.8809792  0.3021321
##   500   0.8767351  0.2762276
##   700   0.8767181  0.2821706
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 300.The best parameter for ntree = 300 will be used for the model.
# mutate outcome to factor type
health_rec_train2 <- health_rec_train %>%
  mutate(outcome = factor(outcome))
health_rec_test2 <- health_rec_test %>%
  mutate(outcome = factor(outcome))
svm_model <- svm(outcome ~ .,
                 data = health_rec_train2,
                 kernel = "linear")
summary(svm_model)## 
## Call:
## svm(formula = outcome ~ ., data = health_rec_train2, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  261
## 
##  ( 140 121 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1Next we will use the xgboost package
We create the xgboost data type
Next we train the data
The confusion matrix is used as the only evaluation method for this study. ### Logistic Regression
##              actual
## log_predicted   0   1
##             0 204  27
##             1   1   3The model performs poorly at predicting dead individuals. It is having more false positives, predicting that some patients are alive instead dead.
##                        actual
## random_forest_predicted   0   1
##                       0 203  27
##                       1   2   3This performance is also similar to logistic model, it is also having more false positives, predicting that some individuals are dead instead of alive.
##           actual
## prediction   0   1
##          0 204  27
##          1   1   3Four models, XGBoost, logistic regression, SVM and random forest, were tested to see which is having the best outcome. XGBoost is having the best prediction with 90% probability of success.