Retrospective exploratory data analysis of intensive care unit health-data downloaded from kaggle.com already preprocessed by user @saurabhshahane.
NB: The data used is part of a whole data from MIMIC III original database.
Mainly and
We load the libraries for reading and analyzing the data first.
library("reshape2")
# Load the tidyverse library for data manipulation and visualization
library('tidyverse')
Next we import the data.frame
# Import the data
health_rec <- read_csv('https://raw.githubusercontent.com/xrander/Health-data/master/Health_data.csv')
# Preview the first ten rows of the data
head(health_rec, n = 10)
The MIMIC-III database is publicly a available critical care database containing de-identified data on 46520 patients and 58976 admissions to the ICU of the Beth Israel Deaconess Medical Center, Boston, USA between 2001 and 2008. The data used for this study is limited to a bit more than 1000 patient only.
group_num
ID: The patient ID
outcome: 0 = Alive and 1 = Dead
age: the age of the patient
gender: 1 = Male and 2 = Female
BMI: Body Mass Index
hypertensive: 0 = with and 1 = without
atrialfibrillation: 0 = with and 1 = without
CHD_with_no_MI: Coronary heart disease with Myocardial infarction, 0 = with and 1 without
diabetes: 0 = with and 1 without
deficiencyanemias: 0 = with and 1 = without
depression: 0 = with and 1 = without
Hyperlipemia:this refers to high concentration of fats or lipids in the blood. 0 = with and 1 = without
Renal_failure: 0 = without renal failure and 1 = with renal failure
COPD: chronic obstructive pulmonary disease(A disease causing airflow blockage and breathing related problems). 0 = without and 1 = without
heart_rate: a range of 60 to 100 bpm is considered normal
Systolic_blood_pressure: measure of the pressure in arteries when the heart beats.
Diastolic_blood_pressure: measure of the pressure in the arteries when the heart rests between beats
Respiratory_rate: this is the number of breathes per minute. The normal range is between 12-20, anything under above or below this range is abnormal. 25 is considered too high.
temperature
SP_O2: Oxygen saturation (measure of how much oxygen the blood carries as a percentage of the maximum it could carry). The normal range for healthy individuals is between 96% to 99%
Urine_output
hematocrit: percentage volume of red blood cells in the blood.
RBC: acronym for Red Blood Count.
MCH: Mean Corpuscular Hemoglobin, this is the average amount of hemoglobin in each of the red blood cells
MCHC: Mean Cell Hemoglobin Concentration, this is the average concentration of hemoglobin in a given volume of blood.
MCV: Mean Corpuscular Volume, this is the measure of the average size of the red clood cells
RDW: red cell distribution width. it is the differences in the volume and size of the red blood cells
Leucocyte
Platelets
Neutrophils
Basophils
Lymphocyte
PT: measure of the time it takes the liquid portion of the blood to clot.
INR
NT_proBNP: a measure of heart failure.
Creatine_kinase
Creatinine
Urea_nitrogen
glucose
Blood_potassium
Blood_sodium
Blood_calcium
Chloride
Anion_gap: measures the difference between the negatively and positively charged electrolytes in the blood.
Magnesium_ion
PH
Bicarbonate
Lactic_acid
PCO2: Partial Pressure of Carbon Dioxide, this measures the carbondioxide within the arterial or venous blood.
EF: Ejection Fraction: Used to gauge how healthy the heart is. It is the amount of blood that the heart pumps each time it beats.
Note: The PostgreSQL code solution to the descriptive statistic questions asked are here
The dependent variable for this analysis is the outcome, while other variables are the explanatory variable.
##Understanding the Data
It is important we understand the data
## spc_tbl_ [1,177 × 51] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ group_num : num [1:1177] 1 1 1 1 1 1 1 1 1 1 ...
## $ ID : num [1:1177] 125047 139812 109787 130587 138290 ...
## $ outcome : num [1:1177] 0 0 0 0 0 0 0 0 0 0 ...
## $ age : num [1:1177] 72 75 83 43 75 76 72 83 61 67 ...
## $ gender : num [1:1177] 1 2 2 2 2 1 1 2 2 1 ...
## $ BMI : num [1:1177] 37.6 NA 26.6 83.3 31.8 ...
## $ hypertensive : num [1:1177] 0 0 0 0 1 1 1 1 1 1 ...
## $ atrialfibrillation : num [1:1177] 0 0 0 0 0 1 0 1 1 0 ...
## $ CHD_with_no_MI : num [1:1177] 0 0 0 0 0 0 0 0 0 0 ...
## $ diabetes : num [1:1177] 1 0 0 0 0 0 0 1 1 1 ...
## $ deficiencyanemias : num [1:1177] 1 1 1 0 1 1 0 1 0 0 ...
## $ depression : num [1:1177] 0 0 0 0 0 0 0 0 0 0 ...
## $ Hyperlipemia : num [1:1177] 1 0 0 0 0 1 1 0 0 0 ...
## $ Renal_failure : num [1:1177] 1 0 1 0 1 1 1 0 1 0 ...
## $ COPD : num [1:1177] 0 1 0 0 1 1 1 0 0 0 ...
## $ heart_rate : num [1:1177] 68.8 101.4 72.3 94.5 67.9 ...
## $ Systolic_blood_pressure : num [1:1177] 156 140 135 126 157 ...
## $ Diastolic_blood_pressure: num [1:1177] 68.3 65 61.4 73.2 58.1 ...
## $ Respiratory_rate : num [1:1177] 16.6 20.9 23.6 21.9 21.4 ...
## $ temperature : num [1:1177] 36.7 36.7 36.5 36.3 36.8 ...
## $ SP_O2 : num [1:1177] 98.4 96.9 95.3 93.8 99.3 ...
## $ Urine_output : num [1:1177] 2155 1425 2425 8760 4455 ...
## $ hematocrit : num [1:1177] 26.3 30.8 27.7 36.6 29.9 ...
## $ RBC : num [1:1177] 2.96 3.14 2.62 4.28 3.29 ...
## $ MCH : num [1:1177] 28.2 31.1 34.3 26.1 30.7 ...
## $ MCHC : num [1:1177] 31.5 31.7 31.3 30.4 33.7 ...
## $ MCV : num [1:1177] 89.9 98.2 109.8 85.6 91 ...
## $ RDW : num [1:1177] 16.2 14.3 23.8 17 16.3 ...
## $ Leucocyte : num [1:1177] 7.65 12.74 5.48 8.22 8.83 ...
## $ Platelets : num [1:1177] 305 246 204 216 251 ...
## $ Neutrophils : num [1:1177] 74.7 NA 68.1 81.8 NA ...
## $ Basophils : num [1:1177] 0.4 NA 0.55 0.15 NA 0.3 0.2 NA 0.55 NA ...
## $ Lymphocyte : num [1:1177] 13.3 NA 24.5 14.5 NA ...
## $ PT : num [1:1177] 10.6 NA 11.3 27.1 NA ...
## $ INR : num [1:1177] 1 NA 0.95 2.67 NA ...
## $ NT_proBNP : num [1:1177] 1956 2384 4081 668 30802 ...
## $ Creatine_kinase : num [1:1177] 148 60.6 16 85 111.7 ...
## $ Creatinine : num [1:1177] 1.958 1.122 1.871 0.586 1.95 ...
## $ Urea_nitrogen : num [1:1177] 50 20.3 33.9 15.3 43 ...
## $ glucose : num [1:1177] 115 148 149 128 146 ...
## $ Blood_potassium : num [1:1177] 4.82 4.45 5.83 4.39 4.78 ...
## $ Blood_sodium : num [1:1177] 139 139 141 138 137 ...
## $ Blood_calcium : num [1:1177] 7.46 8.16 8.27 9.48 8.73 ...
## $ Chloride : num [1:1177] 109.2 98.4 105.9 92.1 104.5 ...
## $ Anion_gap : num [1:1177] 13.2 11.4 10 12.4 15.2 ...
## $ Magnesium_ion : num [1:1177] 2.62 1.89 2.16 1.94 1.65 ...
## $ PH : num [1:1177] 7.23 7.22 7.27 7.37 7.25 ...
## $ Bicarbonate : num [1:1177] 21.2 33.4 30.6 38.6 22 ...
## $ Lactic_acid : num [1:1177] 0.5 0.5 0.5 0.6 0.6 ...
## $ PCO2 : num [1:1177] 40 78 71.5 75 50 ...
## $ EF : num [1:1177] 55 55 35 55 55 35 55 75 50 55 ...
## - attr(*, "spec")=
## .. cols(
## .. group_num = col_double(),
## .. ID = col_double(),
## .. outcome = col_double(),
## .. age = col_double(),
## .. gender = col_double(),
## .. BMI = col_double(),
## .. hypertensive = col_double(),
## .. atrialfibrillation = col_double(),
## .. CHD_with_no_MI = col_double(),
## .. diabetes = col_double(),
## .. deficiencyanemias = col_double(),
## .. depression = col_double(),
## .. Hyperlipemia = col_double(),
## .. Renal_failure = col_double(),
## .. COPD = col_double(),
## .. heart_rate = col_double(),
## .. Systolic_blood_pressure = col_double(),
## .. Diastolic_blood_pressure = col_double(),
## .. Respiratory_rate = col_double(),
## .. temperature = col_double(),
## .. SP_O2 = col_double(),
## .. Urine_output = col_double(),
## .. hematocrit = col_double(),
## .. RBC = col_double(),
## .. MCH = col_double(),
## .. MCHC = col_double(),
## .. MCV = col_double(),
## .. RDW = col_double(),
## .. Leucocyte = col_double(),
## .. Platelets = col_double(),
## .. Neutrophils = col_double(),
## .. Basophils = col_double(),
## .. Lymphocyte = col_double(),
## .. PT = col_double(),
## .. INR = col_double(),
## .. NT_proBNP = col_double(),
## .. Creatine_kinase = col_double(),
## .. Creatinine = col_double(),
## .. Urea_nitrogen = col_double(),
## .. glucose = col_double(),
## .. Blood_potassium = col_double(),
## .. Blood_sodium = col_double(),
## .. Blood_calcium = col_double(),
## .. Chloride = col_double(),
## .. Anion_gap = col_double(),
## .. Magnesium_ion = col_double(),
## .. PH = col_double(),
## .. Bicarbonate = col_double(),
## .. Lactic_acid = col_double(),
## .. PCO2 = col_double(),
## .. EF = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
At first glance, all the variables are numeric type. There are 1177 observations and 51 variables. Some of the variables of the data does not follow the data definition. A quick descriptive summary of the data shows this.
## group_num ID outcome age
## Min. :1.000 Min. :100213 Min. :0.0000 Min. :19.00
## 1st Qu.:1.000 1st Qu.:125603 1st Qu.:0.0000 1st Qu.:65.00
## Median :1.000 Median :151901 Median :0.0000 Median :77.00
## Mean :1.299 Mean :150778 Mean :0.1352 Mean :74.06
## 3rd Qu.:2.000 3rd Qu.:176048 3rd Qu.:0.0000 3rd Qu.:85.00
## Max. :2.000 Max. :199952 Max. :1.0000 Max. :99.00
## NA's :1
## gender BMI hypertensive atrialfibrillation
## Min. :1.000 Min. : 13.35 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.: 24.33 1st Qu.:0.0000 1st Qu.:0.0000
## Median :2.000 Median : 28.31 Median :1.0000 Median :0.0000
## Mean :1.525 Mean : 30.19 Mean :0.7179 Mean :0.4511
## 3rd Qu.:2.000 3rd Qu.: 33.63 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :2.000 Max. :104.97 Max. :1.0000 Max. :1.0000
## NA's :215
## CHD_with_no_MI diabetes deficiencyanemias depression
## Min. :0.00000 Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.000 Median :0.0000
## Mean :0.08581 Mean :0.4214 Mean :0.339 Mean :0.1189
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000 Max. :1.000 Max. :1.0000
##
## Hyperlipemia Renal_failure COPD heart_rate
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. : 36.00
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.: 72.37
## Median :0.0000 Median :0.0000 Median :0.00000 Median : 83.61
## Mean :0.3798 Mean :0.3653 Mean :0.07562 Mean : 84.58
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.: 95.91
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :135.71
## NA's :13
## Systolic_blood_pressure Diastolic_blood_pressure Respiratory_rate
## Min. : 75.0 Min. : 24.74 Min. :11.14
## 1st Qu.:105.4 1st Qu.: 52.17 1st Qu.:17.93
## Median :116.1 Median : 58.46 Median :20.37
## Mean :118.0 Mean : 59.53 Mean :20.80
## 3rd Qu.:128.6 3rd Qu.: 65.46 3rd Qu.:23.39
## Max. :203.0 Max. :107.00 Max. :40.90
## NA's :16 NA's :16 NA's :13
## temperature SP_O2 Urine_output hematocrit
## Min. :33.25 Min. : 75.92 Min. : 0 Min. :20.31
## 1st Qu.:36.29 1st Qu.: 95.00 1st Qu.: 980 1st Qu.:28.16
## Median :36.65 Median : 96.45 Median :1675 Median :30.80
## Mean :36.68 Mean : 96.27 Mean :1899 Mean :31.91
## 3rd Qu.:37.02 3rd Qu.: 97.92 3rd Qu.:2500 3rd Qu.:35.01
## Max. :39.13 Max. :100.00 Max. :8820 Max. :55.42
## NA's :19 NA's :13 NA's :36
## RBC MCH MCHC MCV
## Min. :2.030 Min. :18.12 Min. :27.82 Min. : 62.60
## 1st Qu.:3.120 1st Qu.:28.25 1st Qu.:32.01 1st Qu.: 86.25
## Median :3.490 Median :29.75 Median :32.99 Median : 90.00
## Mean :3.575 Mean :29.54 Mean :32.86 Mean : 89.90
## 3rd Qu.:3.900 3rd Qu.:31.24 3rd Qu.:33.83 3rd Qu.: 93.86
## Max. :6.575 Max. :40.31 Max. :37.01 Max. :116.71
##
## RDW Leucocyte Platelets Neutrophils
## Min. :12.09 Min. : 0.10 Min. : 9.571 Min. : 5.00
## 1st Qu.:14.46 1st Qu.: 7.44 1st Qu.: 168.909 1st Qu.:74.78
## Median :15.51 Median : 9.68 Median : 222.667 Median :82.47
## Mean :15.95 Mean :10.71 Mean : 241.504 Mean :80.11
## 3rd Qu.:16.94 3rd Qu.:12.74 3rd Qu.: 304.250 3rd Qu.:87.45
## Max. :29.05 Max. :64.75 Max. :1028.200 Max. :98.00
## NA's :144
## Basophils Lymphocyte PT INR
## Min. :0.1000 Min. : 0.9667 Min. :10.10 Min. :0.8714
## 1st Qu.:0.2000 1st Qu.: 6.6500 1st Qu.:13.16 1st Qu.:1.1400
## Median :0.3000 Median :10.4750 Median :14.63 Median :1.3000
## Mean :0.4056 Mean :12.2330 Mean :17.48 Mean :1.6255
## 3rd Qu.:0.5000 3rd Qu.:15.4625 3rd Qu.:18.80 3rd Qu.:1.7364
## Max. :8.8000 Max. :83.5000 Max. :71.27 Max. :8.3429
## NA's :259 NA's :145 NA's :20 NA's :20
## NT_proBNP Creatine_kinase Creatinine Urea_nitrogen
## Min. : 50 Min. : 8.00 Min. : 0.2667 Min. : 5.357
## 1st Qu.: 2251 1st Qu.: 46.00 1st Qu.: 0.9400 1st Qu.: 20.833
## Median : 5840 Median : 89.25 Median : 1.2875 Median : 30.667
## Mean : 11014 Mean : 246.78 Mean : 1.6428 Mean : 36.298
## 3rd Qu.: 14968 3rd Qu.: 185.19 3rd Qu.: 1.9000 3rd Qu.: 45.250
## Max. :118928 Max. :42987.50 Max. :15.5273 Max. :161.750
## NA's :165
## glucose Blood_potassium Blood_sodium Blood_calcium
## Min. : 66.67 Min. :3.000 Min. :114.7 Min. : 6.700
## 1st Qu.:113.94 1st Qu.:3.900 1st Qu.:136.7 1st Qu.: 8.149
## Median :136.40 Median :4.115 Median :139.2 Median : 8.500
## Mean :148.80 Mean :4.177 Mean :138.9 Mean : 8.501
## 3rd Qu.:169.50 3rd Qu.:4.400 3rd Qu.:141.6 3rd Qu.: 8.869
## Max. :414.10 Max. :6.567 Max. :154.7 Max. :10.950
## NA's :18 NA's :1
## Chloride Anion_gap Magnesium_ion PH
## Min. : 80.27 Min. : 6.636 Min. :1.400 Min. :7.090
## 1st Qu.: 99.00 1st Qu.:12.250 1st Qu.:1.956 1st Qu.:7.335
## Median :102.50 Median :13.667 Median :2.092 Median :7.380
## Mean :102.28 Mean :13.925 Mean :2.120 Mean :7.379
## 3rd Qu.:105.57 3rd Qu.:15.417 3rd Qu.:2.242 3rd Qu.:7.430
## Max. :122.53 Max. :25.500 Max. :4.073 Max. :7.580
## NA's :292
## Bicarbonate Lactic_acid PCO2 EF
## Min. :12.86 Min. :0.500 Min. :18.75 Min. :15.00
## 1st Qu.:23.45 1st Qu.:1.200 1st Qu.:37.04 1st Qu.:40.00
## Median :26.50 Median :1.600 Median :43.00 Median :55.00
## Mean :26.91 Mean :1.853 Mean :45.54 Mean :48.72
## 3rd Qu.:29.88 3rd Qu.:2.200 3rd Qu.:50.59 3rd Qu.:55.00
## Max. :47.67 Max. :8.333 Max. :98.60 Max. :75.00
## NA's :229 NA's :294
We can expect two data type according to the data description. The first is categorical variables for data which min is 0 and max is 1, and the other is numeric variables.
The descriptive statistics also shows the min and max age as 19 and 99.
Next, we investigate the data for missing values. It is alright to have some missing values for some variables, but the outcome should be known, as it is important to see if patients are dead or alive. Outcome will be used to handle the missing data.
There is no missing data present in the outcome of patients. The data
have been reduced from 1177 to dim(health_rec)[1]
. To deal
with missing values from the other variables other than outcome, we need
to know the missing value for the categorical and numeric variable and
take the necessary steps.
## gender hypertensive atrialfibrillation CHD_with_no_MI diabetes
## [1,] FALSE FALSE FALSE FALSE FALSE
## deficiencyanemias depression Hyperlipemia Renal_failure COPD
## [1,] FALSE FALSE FALSE FALSE FALSE
The categorical variables do not have missing values.
For the numerical variable, the descriptive statistics carried out earlier shows that some numerical variables contains missing values
## tibble [1,176 × 37] (S3: tbl_df/tbl/data.frame)
## $ BMI : num [1:1176] 37.6 NA 26.6 83.3 31.8 ...
## $ heart_rate : num [1:1176] 68.8 101.4 72.3 94.5 67.9 ...
## $ Systolic_blood_pressure : num [1:1176] 156 140 135 126 157 ...
## $ Diastolic_blood_pressure: num [1:1176] 68.3 65 61.4 73.2 58.1 ...
## $ Respiratory_rate : num [1:1176] 16.6 20.9 23.6 21.9 21.4 ...
## $ temperature : num [1:1176] 36.7 36.7 36.5 36.3 36.8 ...
## $ SP_O2 : num [1:1176] 98.4 96.9 95.3 93.8 99.3 ...
## $ Urine_output : num [1:1176] 2155 1425 2425 8760 4455 ...
## $ hematocrit : num [1:1176] 26.3 30.8 27.7 36.6 29.9 ...
## $ RBC : num [1:1176] 2.96 3.14 2.62 4.28 3.29 ...
## $ MCH : num [1:1176] 28.2 31.1 34.3 26.1 30.7 ...
## $ MCHC : num [1:1176] 31.5 31.7 31.3 30.4 33.7 ...
## $ MCV : num [1:1176] 89.9 98.2 109.8 85.6 91 ...
## $ RDW : num [1:1176] 16.2 14.3 23.8 17 16.3 ...
## $ Leucocyte : num [1:1176] 7.65 12.74 5.48 8.22 8.83 ...
## $ Platelets : num [1:1176] 305 246 204 216 251 ...
## $ Neutrophils : num [1:1176] 74.7 NA 68.1 81.8 NA ...
## $ Basophils : num [1:1176] 0.4 NA 0.55 0.15 NA 0.3 0.2 NA 0.55 NA ...
## $ Lymphocyte : num [1:1176] 13.3 NA 24.5 14.5 NA ...
## $ PT : num [1:1176] 10.6 NA 11.3 27.1 NA ...
## $ INR : num [1:1176] 1 NA 0.95 2.67 NA ...
## $ NT_proBNP : num [1:1176] 1956 2384 4081 668 30802 ...
## $ Creatine_kinase : num [1:1176] 148 60.6 16 85 111.7 ...
## $ Creatinine : num [1:1176] 1.958 1.122 1.871 0.586 1.95 ...
## $ Urea_nitrogen : num [1:1176] 50 20.3 33.9 15.3 43 ...
## $ glucose : num [1:1176] 115 148 149 128 146 ...
## $ Blood_potassium : num [1:1176] 4.82 4.45 5.83 4.39 4.78 ...
## $ Blood_sodium : num [1:1176] 139 139 141 138 137 ...
## $ Blood_calcium : num [1:1176] 7.46 8.16 8.27 9.48 8.73 ...
## $ Chloride : num [1:1176] 109.2 98.4 105.9 92.1 104.5 ...
## $ Anion_gap : num [1:1176] 13.2 11.4 10 12.4 15.2 ...
## $ Magnesium_ion : num [1:1176] 2.62 1.89 2.16 1.94 1.65 ...
## $ PH : num [1:1176] 7.23 7.22 7.27 7.37 7.25 ...
## $ Bicarbonate : num [1:1176] 21.2 33.4 30.6 38.6 22 ...
## $ Lactic_acid : num [1:1176] 0.5 0.5 0.5 0.6 0.6 ...
## $ PCO2 : num [1:1176] 40 78 71.5 75 50 ...
## $ EF : num [1:1176] 55 55 35 55 55 35 55 75 50 55 ...
We are going to impute the missing values in the numerical variables using their mean. We use the e1071 package impute() function to impute the means here.
The new data will be tidied to visualize the distribution of the numerical variables.
as_tibble(imputed_means) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "value") %>%
ggplot(show.legend = F)+
geom_density(aes(value), col = "tomato2")+
facet_wrap(~variable, scales = "free")+
theme_minimal()
###### Normally distributed Variables Variables provides information on the body’s electrolyte and Mineral content such as anion gap, bicarbonate, blood calcium, blood potassium, blood sodium and chloride are normally distributed. Vital signs and Cardiovascular parameters such as temperature, PH, Diastolic blood pressure, and heart rate are also normally distributed. Only three Hematology parameters, MCH, MCHC and MCV are normally distributed.
Respiratory rate, RBW, RBC, urea nitrogen, urine output, basophils, BMI, creatine-Kinase, creatinine, glucose, hematocrit, INR, lactic acid, leucocyte, lymphocyte, magnesium ion, NT Pro BMP, PCO2, platelets, PT, and systolic blood pressure are all right skewed.
Only three variables are left skewed. These includes neutrophils, EF, and SP O2.
Next we replace the old values with new ones in the main data frame. In the process, the group_num and ID column were also removed.
health_rec_clean <- health_rec %>%
select(-c(1,2),-c(names(numerical))) %>% # remove unwanted variables and old numerical variables
bind_cols(imputed_means) %>% # add imputed values columns to the data frame
mutate(id = row_number()) %>% # add row numbers to the data frame
select(id, everything()) # rearrange variables to make id come first
tail(health_rec_clean, 10)
Now we check to find if there are still missing values.
## id outcome age gender hypertensive atrialfibrillation CHD_with_no_MI
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## diabetes deficiencyanemias depression Hyperlipemia Renal_failure COPD
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE
## BMI heart_rate Systolic_blood_pressure Diastolic_blood_pressure
## [1,] FALSE FALSE FALSE FALSE
## Respiratory_rate temperature SP_O2 Urine_output hematocrit RBC MCH
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## MCHC MCV RDW Leucocyte Platelets Neutrophils Basophils Lymphocyte
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## PT INR NT_proBNP Creatine_kinase Creatinine Urea_nitrogen glucose
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Blood_potassium Blood_sodium Blood_calcium Chloride Anion_gap
## [1,] FALSE FALSE FALSE FALSE FALSE
## Magnesium_ion PH Bicarbonate Lactic_acid PCO2 EF
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE
All numerical variable missing values are replaced by their means.
A majority of the question asked are descriptive statistics questions and answers will be provided using visualization and tables.
The age group with the highest frequency: To do this, we have to bin the age variable, creating a class range, then we can create a summary of the age_group and see their frequency
# Create bins
age_categories <- seq(0, 120, by = 20) # We assume the highest age is not above 120 years old
# Create age interval variable
health_rec_clean <- health_rec_clean %>%
mutate(age_group = cut(age,
breaks = age_categories,
include.lowest = T, # To indicate if x is the lowest
ordered_result = F),#This ensures that the factor variable is not ordered
age_group = ifelse(age_group == "[0,20]", "0-20",
ifelse(age_group == "(20,40]", "20-40",
ifelse(age_group == "(40,60]", "40-60",
ifelse(age_group == "(60,80]", "60-80", "80-100")))), # Change the group names to ensure consistency
age_group = factor(age_group, levels = c("0-20", "20-40","40-60", "60-80", "80-100")))
health_rec_clean %>%
group_by(age_group) %>%
summarize(frequency = n()) %>%
ggplot(aes("", frequency, fill = age_group))+
geom_col(color = "white")+
geom_text(aes(label = paste0(round(frequency/sum(frequency)*100, 1), "%")),
position = position_stack(vjust = 0.5))+
scale_fill_manual(values = c("pink", "coral", "tomato1", "orangered1", "indianred1"))+
labs(x = NULL,
y = NULL,
fill = "Age Group",
title = "Pie Chart of Age Groups")+
coord_polar(theta = "y", start = 0)+
theme(legend.position = "top",
axis.line = element_blank(),
plot.background = element_blank(),
rect = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank())
Older people are the highest in the hospital, with people between 60 - 80 years old having being the highest in the clinic and people more than 80 years old being the next most occurring patients. To know the age(not age group) that is most occurring, we can simply group by the age them.
health_rec_clean %>%
group_by(age) %>%
summarize(frequency = n()) %>%
arrange(desc(frequency)) %>%
top_n(n = 6)
The highest count of patients in the clinic are aged 89, 84 and 81 years respectively.
health_rec_clean %>%
filter(outcome == 1) %>%
group_by(gender, age_group) %>%
summarize(dead_patients = length(outcome)) %>%
arrange(desc(dead_patients)) %>%
ggplot(aes(age_group, dead_patients, fill = factor(gender))) +
geom_col(position = "dodge")+
scale_fill_manual(values = c("tomato2", "thistle4"),
labels = c("Male", "Female"))+
labs(x = "Age group",
y = "Dead patients",
fill = "Gender",
title = "Dead counts across the different gender")+
expand_limits(y = c(0, 40))+
theme_bw()
The chart shows the dead patients according to age group and gender. The age group 80-100 years for females have the highest count of death.
The gender with the highest count in the clinic is the female gender as show below
health_rec_clean %>%
select(gender) %>%
mutate(gender = ifelse(gender == 1, "Male", "Female")) %>%
group_by(gender) %>%
count() %>%
ggplot(aes(gender, n))+
geom_col(fill = c("tan3", "wheat4"))+
ggtitle("Frequency of Patients According to Their Gender")+
theme_bw()
#what is the death rate of both gender for a given age group?
health_rec_clean %>%
mutate(outcome = ifelse(outcome == 0, "Alive", "Dead"),
gender = ifelse(gender == 1, "Male", "Female")) %>%
group_by(gender, outcome, age_group) %>%
count() %>%
pivot_wider(names_from = outcome,
values_from = n,
values_fill = 0) %>%
mutate(total = sum(Alive, Dead),
death_per_100 = Dead/100,
death_ratio = round(Dead/total, 2))
Now we will investigate the 159 death occurrences that is present in the clinic and relate it to some of the diseases.
Tidying the data for visualization
dead_patients_tidy <- dead_patients %>%
pivot_longer(cols = c(hypertensive:COPD),
names_to = "diseases",
values_to = "state") %>%
mutate(outcome = ifelse(outcome == 0, "Alive", "Dead"),
gender = ifelse(gender == 1, "Male", "Female"),
state = ifelse(diseases == "Renal_failure", "Not having", "Having"),
state = ifelse(diseases != "Renal_failure", "Having", "Not Having"))
dead_patients_tidy %>%
ggplot(aes(state, fill = gender))+
geom_bar(position = "dodge")+
scale_fill_manual(values = c("wheat4", "tan3"))+
facet_wrap(~diseases, scales = "free_y")+
theme_bw()
All dead patients seems to be having a disease or health condition, while Renal_failure have not been related to death.
## corrplot 0.92 loaded
# Correlation for numerical variables
correlation_plot <- cor(health_rec_clean[,c(1:2, 13:49)])
# Visualize correlation
ggplot(melt(correlation_plot), aes(Var1, Var2, fill = value, label = round(value,2)))+
geom_tile()+
geom_text(aes(label= ifelse(value>0.1, as.character(round(value, 1)), "")))+
scale_fill_gradient(low = "tomato",
high = "skyblue")+
labs(title = "Correlation plot of variables")+
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Machine Learning Algorithms to be used We are going to predict the outcome of any one patients making use of four machine learning algorithms viz:
Logistic Regression
Random Forest Classifier
Support Vector Machines (SVMs)
Gradient Boosting Algorithms (XGboosts)
Models will be evaluated using:
ROC curve
Confusion Matrix
Precision scores While these are used, Regression related metrics such as MAE, MSE, RMSE and R-squared will also be used based on the algorithm.
Split the data and set seed.
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
# Set seed to ensure reproducibility
set.seed(1599)
# Split the data to train and test data
training_data_index <- createDataPartition(health_rec_clean$outcome,
p = 0.8,
list = F)
# Training data
health_rec_train <- health_rec_clean[training_data_index, ]
# Test data
health_rec_test <- health_rec_clean[-training_data_index,]
# Remove id number and age_group columns
health_rec_train <- health_rec_train %>%
select(-c(1, 51))
health_rec_test <- health_rec_test %>%
select(-c(1, 51))
Preview split data
####Training the model
Ten variables blood calcium, urea nitrogen, creatinine, platelets, leucocytes, PCO2, heart rate, diastolic blood pressure, COPD, and renal failure have a significant effect on the outcome.
Random forest classifier uses multiple decision tree models and try to mitigate the overfitting from a single decision tree by aggregating the decision from multiple decision trees.
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
# Create parameter grid for search values
param_grid <- expand.grid(mtry = c(50, 200, 300,500, 700))
# define control function for tuning
control <- trainControl(method = "cv", number = 5) # Use 5 folds cross-validation
model <- train(factor(outcome) ~ ., data = health_rec_train, method = "rf", trControl = control, tuneGrid = param_grid)
print(model)
## Random Forest
##
## 941 samples
## 48 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 752, 754, 752, 753, 753
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 50 0.8767351 0.2895012
## 200 0.8746017 0.2697694
## 300 0.8809792 0.3021321
## 500 0.8767351 0.2762276
## 700 0.8767181 0.2821706
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 300.
The best parameter for ntree = 300 will be used for the model.
# mutate outcome to factor type
health_rec_train2 <- health_rec_train %>%
mutate(outcome = factor(outcome))
health_rec_test2 <- health_rec_test %>%
mutate(outcome = factor(outcome))
svm_model <- svm(outcome ~ .,
data = health_rec_train2,
kernel = "linear")
summary(svm_model)
##
## Call:
## svm(formula = outcome ~ ., data = health_rec_train2, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 261
##
## ( 140 121 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
Next we will use the xgboost package
We create the xgboost data type
Next we train the data
The confusion matrix is used as the only evaluation method for this study. ### Logistic Regression
## actual
## log_predicted 0 1
## 0 204 27
## 1 1 3
The model performs poorly at predicting dead individuals. It is having more false positives, predicting that some patients are alive instead dead.
## actual
## random_forest_predicted 0 1
## 0 203 27
## 1 2 3
This performance is also similar to logistic model, it is also having more false positives, predicting that some individuals are dead instead of alive.
## actual
## prediction 0 1
## 0 204 27
## 1 1 3
Four models, XGBoost, logistic regression, SVM and random forest, were tested to see which is having the best outcome. XGBoost is having the best prediction with 90% probability of success.