1 Background 📚

Research studies such as those conducted by Lewis & Ellis, LLC and The Society of Actuaries underscore the critical need for advanced predictive models within health insurance companies. These models are crucial for more accurately aligning insurance costs with individual health risks. By employing predictive modeling, insurers can better address the discrepancies between the costs individuals pay and the services they receive. This not only aids in accurate risk prediction to prevent coverage disparities but also enhances financial planning capabilities for insurance providers, ultimately leading to more equitable health coverage.

The advantages of this approach include the optimization of premium rates, more effective risk management, and the ability to make well-informed decisions about policyholder engagement and fraud prevention. Furthermore, this strategy not only aids in identifying high-risk individuals but also boosts customer satisfaction by customizing insurance products to meet specific needs.

2 Objectives 🎯

Predict the likelihood of new patients developing a certain disease, based on their health features.

Identify the primary risk determinants that significantly elevate the probability of diseases.

Summary: This approach allows for more targeted healthcare interventions and preventative measures, enhancing patient outcomes.

3 Key Concepts🔑

Predictive Risk Modeling
Efective Heatlh Insurance Management
Fair Health Coverage
Premium Rate Optimization
Customer Satisfaction

4 Project Data Source 📁

The data for this project is derived from the BRFSS 2019 survey, which gathers information on health-related risk behaviors among the U.S. population that have cancer.

Important Note 🎯 The dataset used here exemplifies how we can predict specific diseases based on survey data.This project’s methodology is designed with flexibility in mind, allowing for easy adaptation to other health surveys sharing similar features. With only minor modifications, tailored to specific domain knowledge, this framework can be applied by data specialists at health insurance companies to suit a variety of analytical needs. This adaptability ensures that specialists can leverage this approach to analyze data pertinent to their specific requirements, enhancing the project’s utility across different health-related datasets.

5 Methodology📊

Step	Process	Description
1	Data Preparation	Clean the dataset and apply two balancing methods to ensure fairness.
2	Feature Engineering	Utilize two attribute selection techniques to refine features.
3	Model Development	Construct five different classification algorithms.
4	Model Evaluation	Evaluate and compare the performance of 10 model variations to determine the most effective approach.

6 Data Insights 🔍

Interactive Table

Dataset dimensions: The dataset has 66 variables described above:

Variable	Description	Variable	Description	Variable	Description
FMONTH	Month when the interview was completed	PERSDOC2	Primary doctor status	DIABETE4	Diabetes status
IDATE	Interview date	MEDCOST	Medical costs not seen due to expense	HAVARTH4	Arthritis status
IMONTH	Interview month	CHECKUP1	Time since last routine checkup	MARITAL	Marital status
IDAY	Interview day	BPHIGH4	High blood pressure	EDUCA	Education level
IYEAR	Interview year	CHOLCHK2	Cholesterol checked	RENTHOM1	Home ownership
DISPCODE	Disposition code	TOLDHI2	Told have high cholesterol	CPDEMO1B	Telephone usage
SEQNO	Unique sequence number	CVDINFR4	Ever told had heart attack	VETERAN3	Veteran status
SEXVAR	Respondent sex	CVDCRHD4	Ever told had angina or coronary heart disease	EMPLOY1	Employment status
GENHLTH	General health status	CVDSTRK3	Ever told had stroke	CHILDREN	Number of children
PHYSHLTH	Physical health status days	ASTHMA3	Asthma status	INCOME2	Income levels
MENTHLTH	Mental health status days	CHCCOPD2	Chronic obstructive pulmonary disease status	WEIGHT2	Weight in pounds
HLTHPLN1	Health care coverage	ADDEPEV3	Ever diagnosed with depression	HEIGHT3	Height in inches
DEAF	Hearing difficulty	DIFFDRES	Dressing difficulty	FRENCHF1	French fries or chips consumption
BLIND	Vision difficulty	DIFFALON	Living alone difficulty	POTATOE1	Potatoes consumption
DECIDE	Decision making difficulty	SMOKE100	Smoking status	VEGETAB2	Vegetables consumption
DIFFWALK	Walking difficulty	USENOW3	Tobacco use	FLUSHOT7	Flu shot status
CHCKDNY2	Chronic kidney disease	ALCDAY5	Alcohol consumption days	TETANUS1	Tetanus shot status
EXERANY2	Exercise status	STRENGTH	Strength training frequency	PNEUVAC4	Pneumonia vaccine status
FRUIT2	Fruit consumption frequency	FRUITJU2	Fruit juice consumption	HIVTST7	HIV test status
FVGREEN1	Vegetable consumption frequency	HIVRISK5	HIV risk assessment	QSTVER	Questionnaire version
HTIN4	Height in inches (rounded)	HTM4	Height in meters	WTKG3	Weight in kilograms
DRNKANY5	Any drinking status	Class	Patient Cancer Status

7 Data Pre-processing 🧹

7.1 Data Dimension Reduction 📉

7.1.1 Columns that will not be useful for analysis based on domain knowledge 🚮

Columns related to the administration of the interview are removed because they do not provide predictive value and may introduce noise into the model.

Columns: FMONTH, IDATE, IMONTH, IDAY, IYEAR, DISPCODE, SEQNO, MARITAL, EDUCA, RENTHOM1, CPDEMO1B, EMPLOY1, INCOME2, QSTVER, QSTLANG, HTIN4, WEIGHT2, HEIGHT3

df <- df %>%
  select(-FMONTH, -IDATE, -IMONTH, -IDAY, -IYEAR, -SEQNO, -DISPCODE, 
         -CPDEMO1B, -MARITAL, -EDUCA, -RENTHOM1, -EMPLOY1, -INCOME2, 
         -QSTVER, -QSTLANG, -HTIN4, -WEIGHT2, -HEIGHT3)
df <- df %>%
  select(-all_of(nearZeroVar(df, names = TRUE)))
numeric_cols <- df %>% select(where(is.numeric))
highly.correlated.variables <- findCorrelation(cor(numeric_cols, use = "complete.obs"), cutoff = 0.7, names = TRUE)
df <- df %>% select(-all_of(highly.correlated.variables))

7.1.2 Columns with almost 0 variance 🚮

Columns with almost zero variance are removed because they provide no useful information for distinguishing between data points and do not contribute to model prediction accuracy.

Columns: CHCKDNY, DIFFDRES, USENOW3, HIVRISK5

df <- df %>%
  select(-all_of(nearZeroVar(df, names = TRUE)))

7.1.3 Columns with correlation 🚮

Columns with high correlation are often removed to prevent multicollinearity, which can distort the estimated coefficients and reduce model interpretability.

Columns: ALCDAY5

numeric_cols <- df %>% select(where(is.numeric))
highly.correlated.variables <- findCorrelation(cor(numeric_cols, use = "complete.obs"), cutoff = 0.7, names = TRUE)
df <- df %>% select(-all_of(highly.correlated.variables))

7.1.4 Multiple Correspondence Analysis 🚮

Multiple Correspondence Analysis (MCA) is used to reduce dimensionality and uncover underlying patterns in categorical data, enhancing model simplicity and effectiveness.

Columns Retained: ASTHMA3, CHCCOPD2, DIABETE4, CVDINFR4, FLUSHOT7, SMOKE100, STRENGTH, FRUIT2, FRUITJU2, FRENCHF1, FVGREEN1, ADDEPEV3, POTATOE1, EXERANY2, VEGETAB2, HAVARTH4, DEAF, DIFFALON, PNEUVAC4, DRNKANY5, DIFFWALK, DECIDE

df_factors <- df %>% mutate_if(is.numeric, as.factor)

# Perform MCA
mca_results <- MCA(df_factors, graph = FALSE)

# Get the contributions of variables to the first two dimensions
contrib_dim1 <- mca_results$var$contrib[,1]
contrib_dim2 <- mca_results$var$contrib[,2]

# Combine contributions and identify the top contributing variables
contrib_total <- contrib_dim1 + contrib_dim2
top_vars <- names(sort(contrib_total, decreasing = TRUE)[1:40])  # Top 10 variables, adjust as needed
strip_suffix <- function(var_names) {
  # Remove everything after and including the first underscore
  original_names <- gsub("(_|\\.).*$", "", var_names)
  return(original_names)
}

# Apply the function to top_vars
original_vars <- strip_suffix(top_vars)
# Get unique original variable names
unique_original_vars <- unique(original_vars)
# Filter the dataframe to keep only the top contributing variables
df_principal_vars <- df_factors %>% select(all_of(unique_original_vars))
# Filter the dataframe to keep only the top contributing variables
df <- df_factors %>%
  select(Class, all_of(unique_original_vars))

The four graphs, representing the columns with the highest variance after performing MCA, illustrate the dominant dimensions of variability, highlighting key categorical relationships and differences within the dataset.

Plot 1 Plot 2 Plot 3 Plot 4

7.2 Data Normalization 📕

For normalization, recodification based on domain knowledge is essential; values like “Don’t know/Not sure,” “Refused,” and other nonsensical responses are recoded as NA to ensure data quality and consistency, enabling more accurate analysis and comparisons across different variables.

scale <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}
df[] <- lapply(names(df), function(column_name) {
  # Access the column by its name
  x <- df[[column_name]]
  
  # Check if the column is not 'Class', is a factor, and needs conversion
  if (column_name != "Class" && is.factor(x)) {
    # Convert factor to numeric by first converting to character
    as.numeric(as.character(x))
  } else {
    # Leave the column as is if it's 'Class', numeric or should not be converted
    x
  }
})
#MENTHLTH
df$MENTHLTH[!(df$MENTHLTH  %in% c(1:30))] <- NA
df$MENTHLTH <- scale(df$MENTHLTH)


#CHOLCHK2 
df$CHOLCHK2[!(df$CHOLCHK2 %in% c(1,2,3,4,5,6))] <- NA
df$CHOLCHK2<- scale(df$CHOLCHK2)

#CVDINFR4 
df$CVDINFR4[!(df$CVDINFR4 %in% c(1,2))] <- NA
df$CVDINFR4<- scale(df$CVDINFR4)

#CVDCRHD4
df$CVDCRHD4[!(df$CVDCRHD4 %in% c(1,2))] <- NA
df$CVDCRHD4<- scale(df$CVDCRHD4)

#ASTHMA3 
df$ASTHMA3[!(df$ASTHMA3 %in% c(1,2))] <- NA
df$ASTHMA3 <- scale(df$ASTHMA3)

#CHCCOPD2
df$CHCCOPD2[!(df$CHCCOPD2 %in% c(1,2))] <- NA
df$CHCCOPD2<- scale(df$CHCCOPD2)

#ADDEPEV3 
df$ADDEPEV3[!(df$ADDEPEV3 %in% c(1,2))] <- NA
df$ADDEPEV3 <- scale(df$ADDEPEV3)

#HAVARTH4 
df$HAVARTH4[!(df$HAVARTH4 %in% c(1,2))] <- NA
df$HAVARTH4<- scale(df$DIABETE4)

#VETERAN3
df$VETERAN3[!(df$VETERAN3 %in% c(1,2))] <- NA
df$VETERAN3<- scale(df$VETERAN3)

#DEAF
df$DEAF[!(df$DEAF %in% c(1,2))] <- NA
df$DEAF<- scale(df$DEAF)

#BLIND
df$BLIND[!(df$BLIND %in% c(1,2))] <- NA
df$BLIND<- scale(df$BLIND)

#DECIDE
df$DECIDE[!(df$DECIDE %in% c(1,2))] <- NA
df$DECIDE <- scale(df$DECIDE)

#DIFFWALK 
df$DIFFWALK[!(df$DIFFWALK %in% c(1,2))] <- NA
df$DIFFWALK <- scale(df$DIFFWALK)

#DIFFALON
df$DIFFALON[!(df$DIFFALON %in% c(1,2))] <- NA
df$DIFFALON<- scale(df$DIFFALON)

#SMOKE100
df$SMOKE100[!(df$SMOKE100 %in% c(1,2))] <- NA
df$SMOKE100<- scale(df$SMOKE100)

#EXERANY2
df$EXERANY2[!(df$EXERANY2 %in% c(1,2))] <- NA
df$EXERANY2<- scale(df$EXERANY2)

#FLUSHOT7
df$FLUSHOT7[!(df$FLUSHOT7 %in% c(1,2))] <- NA
df$FLUSHOT7 <- scale(df$FLUSHOT7)

#PNEUVAC4
df$PNEUVAC4[!(df$PNEUVAC4 %in% c(1,2))] <- NA
df$PNEUVAC4 <- scale(df$PNEUVAC4)

#HIVTST7
df$HIVTST7[!(df$HIVTST7 %in% c(1,2))] <- NA
df$HIVTST7 <- scale(df$HIVTST7)

#DRNKANY5
df$DRNKANY5[!(df$DRNKANY5 %in% c(1,2))] <- NA
df$DRNKANY5 <- scale(df$DRNKANY5)

#TETANUS1
df$TETANUS1[(df$TETANUS1 %in% c(7,9))] <- NA
df$TETANUS1 <- scale(df$TETANUS1)

#DIABETE4
df$DIABETE4 <- ifelse(!(df$DIABETE4 %in% c(1, 2, 3, 4)), NA, df$DIABETE4)
df$DIABETE4 <- recode(df$DIABETE4, `1` = 8, `2` = 7, `3` = 5, `4` = 6)
df$DIABETE4 <- scale(df$DIABETE4)

#STRENGTH
df$STRENGTH[df$STRENGTH %in% c(200, 777, 999)] <- NA
df$STRENGTH[df$STRENGTH == 888] <- 0
df$STRENGTH <- ifelse(df$STRENGTH %in% 101:199, df$STRENGTH %% 100, df$STRENGTH)
df$STRENGTH <- ifelse(df$STRENGTH %in% 201:299, df$STRENGTH %% 200, df$STRENGTH)
df$STRENGTH <- round(scale(df$STRENGTH),1)

#FRUIT2
df$FRUIT2[df$FRUIT2 %in% c(777, 999, 555, 300)] <- NA
df$FRUIT2[df$FRUIT2 %in% c(555, 300)] <- c(0, 0.5)[match(df$FRUIT2, c(555, 300))]
df$FRUIT2 <- ifelse(df$FRUIT2 %in% 101:199, df$FRUIT2 %% 100, df$FRUIT2)
df$FRUIT2 <- ifelse(df$FRUIT2 %in% 201:299, df$FRUIT2 %% 200, df$FRUIT2)
df$FRUIT2 <- ifelse(df$FRUIT2 %in% 301:399, df$FRUIT2 %% 300, df$FRUIT2)
df$FRUIT2 <- round(scale(df$FRUIT2),2)

#FRUITJU2
df$FRUITJU2[df$FRUITJU2 %in% c(777, 999)] <- NA
df$FRUITJU2[df$FRUITJU2 %in% c(555, 300)] <- 0
df$FRUITJU2 <- ifelse(df$FRUITJU2 %in% 101:199, df$FRUITJU2 %% 100, df$FRUITJU2)
df$FRUITJU2 <- ifelse(df$FRUITJU2 %in% 201:299, df$FRUITJU2 %% 200, df$FRUITJU2)
df$FRUITJU2 <- ifelse(df$FRUITJU2 %in% 301:399, df$FRUITJU2 %% 300, df$FRUITJU2)
df$FRUITJU2 <- round(scale(df$FRUITJU2),1)

#FVGREEN1
df$FVGREEN1[df$FVGREEN1 %in% c(777, 999)] <- NA
df$FVGREEN1[df$FVGREEN1 %in% c(555, 300)] <- 0
df$FVGREEN1 <- ifelse(df$FVGREEN1 %in% 101:199, df$FVGREEN1 %% 100 , df$FVGREEN1)
df$FVGREEN1 <- ifelse(df$FVGREEN1 %in% 201:299, df$FVGREEN1 %% 200 , df$FVGREEN1)
df$FVGREEN1 <- ifelse(df$FVGREEN1 %in% 301:399, df$FVGREEN1 %% 300, df$FVGREEN1)
df$FVGREEN1 <- round(scale(df$FVGREEN1),1)

#FRENCHF1
df$FRENCHF1[df$FRENCHF1 %in% c(777, 999, 200)] <- NA
df$FRENCHF1[df$FRENCHF1 %in% c(555, 300)] <- 0
df$FRENCHF1 <- ifelse(df$FRENCHF1 %in% 101:199, df$FRENCHF1 %% 100, df$FRENCHF1)
df$FRENCHF1 <- ifelse(df$FRENCHF1 %in% 201:299, df$FRENCHF1 %% 200, df$FRENCHF1)
df$FRENCHF1 <- ifelse(df$FRENCHF1 %in% 301:399, df$FRENCHF1 %% 300, df$FRENCHF1)
df$FRENCHF1 <- round(scale(df$FRENCHF1),1)

#POTATOE1
df$POTATOE1[df$POTATOE1 %in% c(777, 999)] <- NA
df$POTATOE1[df$POTATOE1 %in% c(555, 300)] <- 0
df$POTATOE1 <- ifelse(df$POTATOE1 %in% 101:199, df$POTATOE1 %% 100, df$POTATOE1)
df$POTATOE1 <- ifelse(df$POTATOE1 %in% 201:299, df$POTATOE1 %% 200, df$POTATOE1)
df$POTATOE1 <- ifelse(df$POTATOE1 %in% 301:399, df$POTATOE1 %% 300, df$POTATOE1)
df$POTATOE1 <- round(scale(df$POTATOE1),1)

#VEGETAB2
df$VEGETAB2[df$VEGETAB2 %in% c(777, 999)] <- NA
df$VEGETAB2[df$VEGETAB2 %in% c(555, 300)] <- 0
df$VEGETAB2 <- ifelse(df$VEGETAB2 %in% 101:199, df$VEGETAB2 %% 100, df$VEGETAB2)
df$VEGETAB2 <- ifelse(df$VEGETAB2 %in% 201:299, df$VEGETAB2 %% 200, df$VEGETAB2)
df$VEGETAB2 <- ifelse(df$VEGETAB2 %in% 301:399, df$VEGETAB2 %% 300, df$VEGETAB2)
df$VEGETAB2 <- round(scale(df$VEGETAB2),1)

#####The code builds a **decision tree** to predict Class using HTM4, extracts split points using **entrophy**. If no splits are found, uses the 33rd and 66th percentiles of  bins HTM4 into categories based on these splits
#HTM4  
tree_model <- rpart(Class ~ HTM4, data = df, method = "class", control=rpart.control(minsplit=1, cp=0.001))
splits <- sort(unique(tree_model$frame$split[!is.na(tree_model$frame$split)]))
if (length(splits) == 0) {
  splits <- quantile(df$HTM4, probs = c(0.33, 0.66), na.rm = TRUE)
}
labels <- as.character(1:(length(splits) + 1))
df$HTM4 <- cut(df$HTM4, breaks = c(-Inf, splits, Inf), include.lowest = TRUE, labels = labels)
df$HTM4 <- as.numeric(df$HTM4)
df$HTM4 <- scale(df$HTM4)

7.3 Filling Missing Values 🤷

We automate missing value imputation in a dataframe using linear regression models that leverage strong variable correlations to maintain data integrity and enhance analysis accuracy.

Step	Description
1	Compute Correlation Matrix: Calculate the correlation matrix only from complete observations.
2	Convert Correlation to Dataframe: Transform the correlation matrix to a dataframe for ease of manipulation.
3	Identify Columns with NAs: Detect all columns that contain missing values.
4	Find Highest Correlated Columns: For each column with missing values, identify the column that has the highest correlation with it.
5	Build Regression Models: Construct a regression model for each pair of highly correlated columns.
6	Predict Missing Values: Use the regression models to estimate and fill in the missing values.

Implementation Details

The correlation matrix is computed using only those rows that do not contain any missing values to ensure accuracy in correlation calculation.
Each column’s strongest correlation is determined by absolute value, meaning both strong positive and negative correlations are considered.
Regression models are built individually for each column with missing data based on its most correlated counterpart, providing tailored imputation.
Predictions from these models directly replace the missing values, thus maintaining the integrity and distribution of the original data.

Correlation heatmap

This graph, a correlation heatmap, visually represents the strength and direction of relationships between variables in a dataset, using color intensity to indicate the degree of correlation.

# 1

fill_missing_with_regression <- function(df) { numeric_cols <- df %>% select(where(is.numeric))
correlation_matrix <- cor(numeric_cols, use = "complete.obs") # 1

# 2
correlation_df <- as.data.frame(correlation_matrix) %>%
  rownames_to_column(var = "Row") %>%
  pivot_longer(cols = -Row, names_to = "Column", values_to = "Correlation") %>%
  filter(Row != Column, !is.na(Correlation)) %>%
  arrange(desc(abs(Correlation)))

# 3
cols_with_na <- names(df)[colSums(is.na(df)) > 0]
print(paste("Columns with NA:", toString(cols_with_na)))

for (col in cols_with_na) {
  top_correlation <- correlation_df %>%
    filter(Row == col | Column == col) %>%
    slice_max(abs(Correlation), n = 1) 
  # 4
  if (nrow(top_correlation) > 0) {
    predictor_col <- ifelse(top_correlation$Row[1] == col, top_correlation$Column[1], top_correlation$Row[1])
    print(paste("Processing:", col, "using", predictor_col))
    missing_indices <- which(is.na(df[[col]]) & !is.na(df[[predictor_col]]))
    if (length(missing_indices) == 0) {
      print(paste("No valid data to impute for column:", col))
      next} 
    # 5 
    formula <- as.formula(paste(col, "~", predictor_col))
    model <- lm(formula, data = df, na.action = na.exclude) # 5
    
    # 6
    predictions <- predict(model, newdata = df[missing_indices, , drop = FALSE], type = "response")
    df[[col]][missing_indices] <- predictions
  } else {
    print(paste("No correlations found for column:", col))}} 
return(df)}
df <- fill_missing_with_regression(df)
df[df < 0] <- 0

df <- df[rowSums(is.na(df)) <= 3, ]

7.4 Pre-processed Data Snapshot 👻

This table showcases five observations from the dataset, following the completion of the pre-processing steps. It serves as an example of how the data has been cleaned and structured, ready for further analysis or modeling.

8 Feature Engineering

In this phase, we construct predictive models using various machine learning algorithms. The goal is to develop models that can accurately predict outcomes based on input data. We experiment with different algorithms, tune hyperparameters, and evaluate model performance to select the best-performing models. This step involves iterative testing and refinement to ensure robustness and reliability in predictions. ## Spliting The data 📏 Data splitting is a crucial step in model building where the dataset is divided into training and testing subsets. The training set is used to train the models, while the testing set is reserved for evaluating model performance on unseen data.

8.1 Create Balanced Dataset

To ensure the avoidance of data leakage, we carefully split our balanced dataset. Initially, the entire dataset is divided into separate sets for training and testing. We make sure that the same data points are not present in both sets, thereby preventing the model from having prior knowledge of the test data, which can lead to overfitting and unrealistic performance estimates.

8.2 Data Dimension Reduction Method 1: Information Gain🧩

Information Gain is a method used to identify and select the most relevant features in a dataset. By measuring how well a feature splits the data, it helps in reducing dimensionality and focusing on the variables that contribute the most to the model’s predictive power.

8.3 Data Dimension Reduction Method 2: Boruta🧩

“Boruta is an all relevant feature selection method, while most other are minimal optimal; this means it tries to find all features carrying information usable for prediction, rather than finding a possibly compact subset of features on which some classifier has a minimal error.”

8.4 Data Dimension Reduction Method 3: Recursive Feature Elimination 🧩

“RFE iteratively removes less important features, creating a subset that maximizes predictive accuracy. By leveraging a machine learning algorithm and an importance-ranking metric, RFE evaluates each feature’s impact on model performance.”

9 Model Development 🤖

9.1 AdaBoost Classifier 🧠

We use Information Gain to select the most informative features for the AdaBoost model. This enhances AdaBoost’s efficiency and accuracy by focusing on features that significantly reduce uncertainty, improving overall model performance

library(dplyr)
library(knitr)

calculate_measures_per_class_with_averages <- function(cm) {
  results <- data.frame(
    Class = character(),
    TPR = numeric(),
    FPR = numeric(),
    Precision = numeric(),
    Recall = numeric(),
    F_measure = numeric(),
    MCC = numeric(),
    Kappa = numeric(),
    stringsAsFactors = FALSE
  )
  
  total = sum(cm)
  po = (cm[1,1] + cm[2,2]) / total
  pe = ((cm[1,1] + cm[1,2]) * (cm[1,1] + cm[2,1]) + (cm[2,1] + cm[2,2]) * (cm[1,2] + cm[2,2])) / total^2
  
  for (i in 1:2) {
    if (i == 1) {
      tp = cm[1, 1]
      fp = cm[2, 1]
      fn = cm[1, 2]
      tn = cm[2, 2]
      class_label = "Class 0"
    } else {
      tp = cm[2, 2]
      fp = cm[1, 2]
      fn = cm[2, 1]
      tn = cm[1, 1]
      class_label = "Class 1"
    }
    
    tpr = round(tp / (tp + fn), 1)
    fpr = round(fp / (fp + tn), 1)
    precision = round(tp / (tp + fp), 1)
    recall = round(tpr, 1)  # recall is the same as TPR
    f_measure = round(ifelse((precision + recall) == 0, 0, (2 * precision * recall) / (precision + recall)), 1)
    mcc = round(ifelse((sqrt((tp+fp) * (tp+fn) * (tn+fp) * (tn+fn))) == 0, 0,
                 (tp * tn - fp * fn) / sqrt((tp+fp) * (tp+fn) * (tn+fp) * (tn+fn))), 1)
    kappa = round((po - pe) / (1 - pe), 1)
    
    results <- rbind(results, data.frame(
      Class = class_label,
      TPR = tpr,
      FPR = fpr,
      Precision = precision,
      Recall = recall,
      F_measure = f_measure,
      MCC = mcc,
      Kappa = kappa
    ))
  }
  
  supports = rowSums(cm)
  total_support = sum(supports)
  weighted_avgs <- sapply(results[, -1], function(x) sum(x * supports) / total_support, simplify = FALSE)
  
  weighted_row <- c("Wt. Average", sapply(unlist(weighted_avgs), function(x) round(x, 1)))
  results <- rbind(results, setNames(as.list(weighted_row), names(results)))
  
  results <- na.omit(results) # Remove rows with NA values
  results <- arrange(results, desc(Class)) # Optional: Sort by Class in descending order
  
  # Use knitr::kable() to create a nicely formatted table
  kable(results, format = "html", caption = "Class Performance Metrics with Weighted Averages")
}

9.1.1 Model 1 - Data splitting method 1 - Information Gain - ADA

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC	Kappa
Wt. Average	0.6	0.4	0.8	0.6	0.6	0.2	0.2
Class 1	0.6	0.4	0.3	0.6	0.4	0.2	0.2
Class 0	0.6	0.4	0.9	0.6	0.7	0.2	0.2

9.1.2 Model 2 - Data splitting method 1 - Boruta - ADA

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC	Kappa
Wt. Average	0.7	0.5	0.6	0.7	0.6	0.2	0.2
Class 1	0.3	0.1	0.6	0.3	0.4	0.2	0.2
Class 0	0.9	0.7	0.6	0.9	0.7	0.2	0.2

9.1.3 Model 3 - Data splitting method 1 - RFE - ADA

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC	Kappa
Wt. Average	0.7	0.5	0.6	0.7	0.6	0.2	0.2
Class 1	0.3	0.1	0.6	0.3	0.4	0.2	0.2
Class 0	0.9	0.7	0.6	0.9	0.7	0.2	0.2

9.1.4 Model 4 - Data splitting method 2 - Information Gain - ADA

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC	Kappa
Wt. Average	0.6	0.5	0.6	0.6	0.5	0.2	0.1
Class 1	0.2	0.1	0.6	0.2	0.3	0.2	0.1
Class 0	0.9	0.8	0.6	0.9	0.7	0.2	0.1

9.1.5 Model 5 - Data splitting method 2 - Boruta - ADA

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC	Kappa
Wt. Average	0.7	0.5	0.6	0.7	0.6	0.2	0.2
Class 1	0.3	0.1	0.6	0.3	0.4	0.2	0.2
Class 0	0.9	0.7	0.6	0.9	0.7	0.2	0.2

9.1.6 Model 6 - Data splitting method 2 - RFE - ADA

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC	Kappa
Wt. Average	0.8	0.4	1	0.8	0.9	0.1	0.1
Class 1	0.6	0.2	0.1	0.6	0.2	0.1	0.1
Class 0	0.8	0.4	1	0.8	0.9	0.1	0.1

9.2 Random Forest Classifier 🧪

Forest reduces overfitting and improves generalization. The RF model can effectively handle large datasets with higher dimensionality and provides insights into feature importance, which enhances its interpretability and performance. This approach enhances the classifier’s accuracy by leveraging the strengths of multiple decision trees and their collective prediction power.

9.2.1 Model 7 - Data splitting method 1 - Information Gain - Random Forest

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC
Wt. Average	0.8	0.7	1	0.8	0.9	0.1
Class 1	0.3	0.2	0	0.3	0	0.1
Class 0	0.8	0.7	1	0.8	0.9	0.1

9.2.2 Model 8 - Data splitting method 1 - Boruta - Random Forest

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure
Wt. Average	0.8	1	1	0.8	0.9
Class 1	0	0.2	0	0	0
Class 0	0.8	1	1	0.8	0.9

9.2.3 Model 9 - Data splitting method 1 - RFE - Random Forest

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC
Wt. Average	0.8	0.6	1	0.8	0.9	0.1
Class 1	0.4	0.2	0	0.4	0	0.1
Class 0	0.8	0.6	1	0.8	0.9	0.1

9.2.4 Model 10 - Data splitting method 2 - Information Gain - Random Forest

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC
Wt. Average	0.8	0.3	1	0.8	0.9	0.1
Class 1	0.7	0.2	0	0.7	0	0.1
Class 0	0.8	0.3	1	0.8	0.9	0.1

9.2.5 Model 11 - Data splitting method 2 - BORUTA - Random Forest

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure
Wt. Average	0.8	1	1	0.8	0.9
Class 1	0	0.2	0	0	0
Class 0	0.8	1	1	0.8	0.9

9.2.6 Model 12 - Data splitting method 2 - RFE - Random Forest

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC	Kappa
Wt. Average	0.8	0.4	1	0.8	0.9	0.1	0.1
Class 1	0.6	0.2	0.1	0.6	0.2	0.1	0.1
Class 0	0.8	0.4	1	0.8	0.9	0.1	0.1

9.3 Generalized Linear Model 🎲

The Generalized Linear Model (GLM) extends linear regression to accommodate non-normal error distributions and diverse response types, such as binary or count data. It adapts to various scenarios using link functions, enhancing its versatility for modeling different outcomes.

9.3.1 Model 13 - Data splitting method 1 - Information Gain - GLM

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure
Wt. Average	0.8	0.7	1	0.8	0.9
Class 1	0.3	0.2	0	0.3	0
Class 0	0.8	0.7	1	0.8	0.9

9.3.2 Model 14 - Data splitting method 1 - BORUTA - GLM

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC
Wt. Average	0.8	0.6	1	0.8	0.9	0.1
Class 1	0.4	0.2	0	0.4	0	0.1
Class 0	0.8	0.6	1	0.8	0.9	0.1

9.3.3 Model 15 - Data splitting method 1 - RFE - GLM

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC
Wt. Average	0.8	0.5	1	0.8	0.9	0.1
Class 1	0.5	0.2	0	0.5	0	0.1
Class 0	0.8	0.5	1	0.8	0.9	0.1

9.3.4 Model 16 - Data splitting method 2 - Information Gain - GLM

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC	Kappa
Wt. Average	0.8	0.4	1	0.8	0.9	0.1	0.1
Class 1	0.6	0.2	0	0.6	0	0.1	0.1
Class 0	0.8	0.4	1	0.8	0.9	0.1	0.1

9.3.5 Model 17 - Data splitting method 2 - Boruta - GLM

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC	Kappa
Wt. Average	0.8	0.6	1	0.8	0.9	0.1	0.1
Class 1	0.4	0.2	0.1	0.4	0.2	0.1	0.1
Class 0	0.8	0.6	1	0.8	0.9	0.1	0.1

9.3.6 Model 18 - Data splitting method 2 - RFE - GLM

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC
Wt. Average	0.8	0.5	1	0.8	0.9	0.1
Class 1	0.5	0.2	0	0.5	0	0.1
Class 0	0.8	0.5	1	0.8	0.9	0.1

9.4 SVM Classifier 🔢

The SVM (Support Vector Machine) classifier identifies the optimal boundary between classes by maximizing the margin of separation in high-dimensional space. It excels in binary classification tasks and can handle nonlinear relationships using kernel functions.

9.4.1 Model 19 - Data splitting method 1 - INFORMATION GAIN - SVM

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC
Wt. Average	0.8	0.7	1	0.8	0.9	0.1
Class 1	0.3	0.2	0	0.3	0	0.1
Class 0	0.8	0.7	1	0.8	0.9	0.1

9.4.2 Model 20 - Data splitting method 1 - BOR - SVM

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC
Wt. Average	0.8	0.6	1	0.8	0.9	0.1
Class 1	0.4	0.2	0	0.4	0	0.1
Class 0	0.8	0.6	1	0.8	0.9	0.1

9.4.3 Model 21 - Data splitting method 1 - RFE - SVM

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC
Wt. Average	0.8	0.6	1	0.8	0.9	0.1
Class 1	0.4	0.2	0	0.4	0	0.1
Class 0	0.8	0.6	1	0.8	0.9	0.1

9.4.4 Model 22 - Data splitting method 2 - INFORMATION GAIN - SVM

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC	Kappa
Wt. Average	0.8	0.8	0.7	0.8	0.7	0.1	0.1
Class 1	0	0	0.5	0	0	0.1	0.1
Class 0	1	1	0.8	1	0.9	0.1	0.1

9.4.5 Model 23 - Data splitting method 2 - BORUTA - SVM

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC
Wt. Average	0.8	0.8	0.7	0.8	0.7	0.1
Class 1	0	0	0.4	0	0	0.1
Class 0	1	1	0.8	1	0.9	0.1

9.4.6 Model 24 - Data splitting method 2 - RFE - SVM

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC	Kappa
Wt. Average	0.8	0.8	0.8	0.8	0.7	0.1	0.1
Class 1	0	0	0.6	0	0	0.1	0.1
Class 0	1	1	0.8	1	0.9	0.1	0.1

9.5 Ctree Classifier 🔢

The ctree (Conditional Inference Tree) classifier constructs decision trees based on statistically significant associations. It uses permutation tests for independence between response and predictors to make unbiased splits, minimizing overfitting. This method provides a robust approach to modeling non-linear relationships and interactions among features.

9.5.1 Model 25 - Data splitting method 1 - INFORMATION GAIN - ctree

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC	Kappa
Wt. Average	0.8	0.7	0.7	0.8	0.8	0.1	0.1
Class 1	0.1	0	0.4	0.1	0.2	0.1	0.1
Class 0	1	0.9	0.8	1	0.9	0.1	0.1

9.5.2 Model 26 - Data splitting method 1 - Boruta - ctree

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure
Wt. Average	0.8	0.8	0.7	0.8	0.7
Class 1	0	0	0.3	0	0
Class 0	1	1	0.8	1	0.9

9.5.3 Model 27 - Data splitting method 1 - RFE - ctree

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure
Wt. Average	0.8	0.8	0.7	0.8	0.7
Class 1	0	0	0.3	0	0
Class 0	1	1	0.8	1	0.9

9.5.4 Model 28 - Data splitting method 2 - INFORMATION GAIN - ctree

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC
Wt. Average	0.8	0.8	0.7	0.8	0.7	0.1
Class 1	0	0	0.4	0	0	0.1
Class 0	1	1	0.8	1	0.9	0.1

9.5.5 Model 29 - Data splitting method 2 - Boruta - ctree

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure
Wt. Average	0.8	0.7	0.7	0.8	0.8
Class 1	0.1	0	0.3	0.1	0.1
Class 0	1	0.9	0.8	1	0.9

9.5.6 Model 30 - Data splitting method 2 - RFE - ctree

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC
Wt. Average	0.8	0.8	0.7	0.8	0.7	0.1
Class 1	0	0	0.5	0	0	0.1
Class 0	1	1	0.8	1	0.9	0.1

9.6 GcvEarth Classifier 🔢

The gcvEarth classifier utilizes multivariate adaptive regression splines (MARS) to model complex nonlinear relationships in data. It employs generalized cross-validation (GCV) to optimize model complexity, focusing on essential interactions and trends for improved predictive accuracy. This approach allows it to handle varying data structures effectively.

9.6.1 Model 31 - Data splitting method 1 - INFORMATION GAIN - gcvEarth

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure
Wt. Average	0.8	0.7	1	0.8	0.9
Class 1	0.3	0.2	0	0.3	0
Class 0	0.8	0.7	1	0.8	0.9

9.6.2 Model 32 - Data splitting method 1 - BORUTA - gcvEarth

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC	Kappa
Wt. Average	0.8	0.6	1	0.8	0.9	0.1	0.1
Class 1	0.4	0.2	0.1	0.4	0.2	0.1	0.1
Class 0	0.8	0.6	1	0.8	0.9	0.1	0.1

9.6.3 Model 33 - Data splitting method 1 - RFE - gcvEarth

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC
Wt. Average	0.8	0.6	1	0.8	0.9	0.1
Class 1	0.4	0.2	0	0.4	0	0.1
Class 0	0.8	0.6	1	0.8	0.9	0.1

9.6.4 Model 34 - Data splitting method 2 - INFORMATION GAIN - gcvEarth

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC
Wt. Average	0.8	0.6	1	0.8	0.9	0.1
Class 1	0.4	0.2	0	0.4	0	0.1
Class 0	0.8	0.6	1	0.8	0.9	0.1

9.6.5 Model 35 - Data splitting method 2 - BORUTA - gcvEarth

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC
Wt. Average	0.8	0.6	1	0.8	0.9	0.1
Class 1	0.4	0.2	0	0.4	0	0.1
Class 0	0.8	0.6	1	0.8	0.9	0.1

9.6.6 Model 36 - Data splitting method 2 - RFE - gcvEarth

Class Performance Metrics with Weighted Averages
Class	TPR	FPR	Precision	Recall	F_measure	MCC
Wt. Average	0.8	0.5	1	0.8	0.9	0.1
Class 1	0.5	0.2	0	0.5	0	0.1
Class 0	0.8	0.5	1	0.8	0.9	0.1

9.7 Summary of Model Performance Metrics:

Model 6 (ADA with RFE - Data splitting method 2) shows the highest F-measure of 0.9, which suggests a very strong predictive performance. Additionally, its Kappa and MCC, though modest at 0.1, are relatively good considering the spread of metrics in other models. This model achieves a good balance between TPR (True Positive Rate) and Precision, indicating effective handling of both positive class predictions and overall accuracy.

9.8 Recommendation:

Based on the overview of the provided metrics, Model 6 stands out with its high F-measure and balanced performance across other metrics. It shows strong ability to correctly classify cases with high reliability in predictions. The use of SVM with RFE (Recursive Feature Elimination) and Data Splitting Method 2 appears to provide a robust model capable of managing the intricacies of the dataset effectively.

9.9 Conclusion:

The recommended Model 6 should be considered for further testing and validation on a separate dataset to confirm its effectiveness outside the provided data. The high performance across key metrics suggests that it could be reliable for practical applications, assuming the underlying data characteristics do not differ significantly from those seen in training and testing.

PREDICTIVE DISEASE MACHINE LEARNING MODEL FOR EFFECTIVE HEALTH INSURANCE MANAGEMENT

Arturo Chavez

2024-06-23