Research studies such as those conducted by Lewis & Ellis, LLC and The Society of Actuaries underscore the critical need for advanced predictive models within health insurance companies. These models are crucial for more accurately aligning insurance costs with individual health risks. By employing predictive modeling, insurers can better address the discrepancies between the costs individuals pay and the services they receive. This not only aids in accurate risk prediction to prevent coverage disparities but also enhances financial planning capabilities for insurance providers, ultimately leading to more equitable health coverage.
The advantages of this approach include the optimization of premium rates, more effective risk management, and the ability to make well-informed decisions about policyholder engagement and fraud prevention. Furthermore, this strategy not only aids in identifying high-risk individuals but also boosts customer satisfaction by customizing insurance products to meet specific needs.
Predict the likelihood of new patients developing a certain disease, based on their health features.
Identify the primary risk determinants that significantly elevate the probability of diseases.
Summary: This approach allows for more targeted healthcare interventions and preventative measures, enhancing patient outcomes.
The data for this project is derived from the BRFSS 2019 survey, which gathers information on health-related risk behaviors among the U.S. population that have cancer.
Important Note 🎯 The dataset used here exemplifies how we can predict specific diseases based on survey data.This project’s methodology is designed with flexibility in mind, allowing for easy adaptation to other health surveys sharing similar features. With only minor modifications, tailored to specific domain knowledge, this framework can be applied by data specialists at health insurance companies to suit a variety of analytical needs. This adaptability ensures that specialists can leverage this approach to analyze data pertinent to their specific requirements, enhancing the project’s utility across different health-related datasets.
Step | Process | Description |
---|---|---|
1 | Data Preparation | Clean the dataset and apply two balancing methods to ensure fairness. |
2 | Feature Engineering | Utilize two attribute selection techniques to refine features. |
3 | Model Development | Construct five different classification algorithms. |
4 | Model Evaluation | Evaluate and compare the performance of 10 model variations to determine the most effective approach. |
Dataset dimensions: The dataset has 66 variables described above:
Variable | Description | Variable | Description | Variable | Description |
---|---|---|---|---|---|
FMONTH | Month when the interview was completed | PERSDOC2 | Primary doctor status | DIABETE4 | Diabetes status |
IDATE | Interview date | MEDCOST | Medical costs not seen due to expense | HAVARTH4 | Arthritis status |
IMONTH | Interview month | CHECKUP1 | Time since last routine checkup | MARITAL | Marital status |
IDAY | Interview day | BPHIGH4 | High blood pressure | EDUCA | Education level |
IYEAR | Interview year | CHOLCHK2 | Cholesterol checked | RENTHOM1 | Home ownership |
DISPCODE | Disposition code | TOLDHI2 | Told have high cholesterol | CPDEMO1B | Telephone usage |
SEQNO | Unique sequence number | CVDINFR4 | Ever told had heart attack | VETERAN3 | Veteran status |
SEXVAR | Respondent sex | CVDCRHD4 | Ever told had angina or coronary heart disease | EMPLOY1 | Employment status |
GENHLTH | General health status | CVDSTRK3 | Ever told had stroke | CHILDREN | Number of children |
PHYSHLTH | Physical health status days | ASTHMA3 | Asthma status | INCOME2 | Income levels |
MENTHLTH | Mental health status days | CHCCOPD2 | Chronic obstructive pulmonary disease status | WEIGHT2 | Weight in pounds |
HLTHPLN1 | Health care coverage | ADDEPEV3 | Ever diagnosed with depression | HEIGHT3 | Height in inches |
DEAF | Hearing difficulty | DIFFDRES | Dressing difficulty | FRENCHF1 | French fries or chips consumption |
BLIND | Vision difficulty | DIFFALON | Living alone difficulty | POTATOE1 | Potatoes consumption |
DECIDE | Decision making difficulty | SMOKE100 | Smoking status | VEGETAB2 | Vegetables consumption |
DIFFWALK | Walking difficulty | USENOW3 | Tobacco use | FLUSHOT7 | Flu shot status |
CHCKDNY2 | Chronic kidney disease | ALCDAY5 | Alcohol consumption days | TETANUS1 | Tetanus shot status |
EXERANY2 | Exercise status | STRENGTH | Strength training frequency | PNEUVAC4 | Pneumonia vaccine status |
FRUIT2 | Fruit consumption frequency | FRUITJU2 | Fruit juice consumption | HIVTST7 | HIV test status |
FVGREEN1 | Vegetable consumption frequency | HIVRISK5 | HIV risk assessment | QSTVER | Questionnaire version |
HTIN4 | Height in inches (rounded) | HTM4 | Height in meters | WTKG3 | Weight in kilograms |
DRNKANY5 | Any drinking status | Class | Patient Cancer Status |
Columns related to the administration of the interview are removed because they do not provide predictive value and may introduce noise into the model.
Columns: FMONTH
, IDATE
,
IMONTH
, IDAY
, IYEAR
,
DISPCODE
,
SEQNO, MARITAL, EDUCA, RENTHOM1, CPDEMO1B, EMPLOY1, INCOME2, QSTVER, QSTLANG, HTIN4, WEIGHT2, HEIGHT3
df <- df %>%
select(-FMONTH, -IDATE, -IMONTH, -IDAY, -IYEAR, -SEQNO, -DISPCODE,
-CPDEMO1B, -MARITAL, -EDUCA, -RENTHOM1, -EMPLOY1, -INCOME2,
-QSTVER, -QSTLANG, -HTIN4, -WEIGHT2, -HEIGHT3)
df <- df %>%
select(-all_of(nearZeroVar(df, names = TRUE)))
numeric_cols <- df %>% select(where(is.numeric))
highly.correlated.variables <- findCorrelation(cor(numeric_cols, use = "complete.obs"), cutoff = 0.7, names = TRUE)
df <- df %>% select(-all_of(highly.correlated.variables))
Columns with almost zero variance are removed because they provide no useful information for distinguishing between data points and do not contribute to model prediction accuracy.
Columns: CHCKDNY
,
DIFFDRES
, USENOW3
, HIVRISK5
Columns with high correlation are often removed to prevent multicollinearity, which can distort the estimated coefficients and reduce model interpretability.
Columns: ALCDAY5
Multiple Correspondence Analysis (MCA) is used to reduce dimensionality and uncover underlying patterns in categorical data, enhancing model simplicity and effectiveness.
Columns Retained: ASTHMA3
,
CHCCOPD2
, DIABETE4
, CVDINFR4
,
FLUSHOT7
, SMOKE100
, STRENGTH
,
FRUIT2
, FRUITJU2
, FRENCHF1
,
FVGREEN1
, ADDEPEV3
, POTATOE1
,
EXERANY2
, VEGETAB2
, HAVARTH4
,
DEAF
, DIFFALON
, PNEUVAC4
,
DRNKANY5
, DIFFWALK
, DECIDE
df_factors <- df %>% mutate_if(is.numeric, as.factor)
# Perform MCA
mca_results <- MCA(df_factors, graph = FALSE)
# Get the contributions of variables to the first two dimensions
contrib_dim1 <- mca_results$var$contrib[,1]
contrib_dim2 <- mca_results$var$contrib[,2]
# Combine contributions and identify the top contributing variables
contrib_total <- contrib_dim1 + contrib_dim2
top_vars <- names(sort(contrib_total, decreasing = TRUE)[1:40]) # Top 10 variables, adjust as needed
strip_suffix <- function(var_names) {
# Remove everything after and including the first underscore
original_names <- gsub("(_|\\.).*$", "", var_names)
return(original_names)
}
# Apply the function to top_vars
original_vars <- strip_suffix(top_vars)
# Get unique original variable names
unique_original_vars <- unique(original_vars)
# Filter the dataframe to keep only the top contributing variables
df_principal_vars <- df_factors %>% select(all_of(unique_original_vars))
# Filter the dataframe to keep only the top contributing variables
df <- df_factors %>%
select(Class, all_of(unique_original_vars))
The four graphs, representing the columns with the highest variance after performing MCA, illustrate the dominant dimensions of variability, highlighting key categorical relationships and differences within the dataset.
For normalization, recodification based on domain knowledge is essential; values like “Don’t know/Not sure,” “Refused,” and other nonsensical responses are recoded as NA to ensure data quality and consistency, enabling more accurate analysis and comparisons across different variables.
scale <- function(x) {
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}
df[] <- lapply(names(df), function(column_name) {
# Access the column by its name
x <- df[[column_name]]
# Check if the column is not 'Class', is a factor, and needs conversion
if (column_name != "Class" && is.factor(x)) {
# Convert factor to numeric by first converting to character
as.numeric(as.character(x))
} else {
# Leave the column as is if it's 'Class', numeric or should not be converted
x
}
})
#MENTHLTH
df$MENTHLTH[!(df$MENTHLTH %in% c(1:30))] <- NA
df$MENTHLTH <- scale(df$MENTHLTH)
#CHOLCHK2
df$CHOLCHK2[!(df$CHOLCHK2 %in% c(1,2,3,4,5,6))] <- NA
df$CHOLCHK2<- scale(df$CHOLCHK2)
#CVDINFR4
df$CVDINFR4[!(df$CVDINFR4 %in% c(1,2))] <- NA
df$CVDINFR4<- scale(df$CVDINFR4)
#CVDCRHD4
df$CVDCRHD4[!(df$CVDCRHD4 %in% c(1,2))] <- NA
df$CVDCRHD4<- scale(df$CVDCRHD4)
#ASTHMA3
df$ASTHMA3[!(df$ASTHMA3 %in% c(1,2))] <- NA
df$ASTHMA3 <- scale(df$ASTHMA3)
#CHCCOPD2
df$CHCCOPD2[!(df$CHCCOPD2 %in% c(1,2))] <- NA
df$CHCCOPD2<- scale(df$CHCCOPD2)
#ADDEPEV3
df$ADDEPEV3[!(df$ADDEPEV3 %in% c(1,2))] <- NA
df$ADDEPEV3 <- scale(df$ADDEPEV3)
#HAVARTH4
df$HAVARTH4[!(df$HAVARTH4 %in% c(1,2))] <- NA
df$HAVARTH4<- scale(df$DIABETE4)
#VETERAN3
df$VETERAN3[!(df$VETERAN3 %in% c(1,2))] <- NA
df$VETERAN3<- scale(df$VETERAN3)
#DEAF
df$DEAF[!(df$DEAF %in% c(1,2))] <- NA
df$DEAF<- scale(df$DEAF)
#BLIND
df$BLIND[!(df$BLIND %in% c(1,2))] <- NA
df$BLIND<- scale(df$BLIND)
#DECIDE
df$DECIDE[!(df$DECIDE %in% c(1,2))] <- NA
df$DECIDE <- scale(df$DECIDE)
#DIFFWALK
df$DIFFWALK[!(df$DIFFWALK %in% c(1,2))] <- NA
df$DIFFWALK <- scale(df$DIFFWALK)
#DIFFALON
df$DIFFALON[!(df$DIFFALON %in% c(1,2))] <- NA
df$DIFFALON<- scale(df$DIFFALON)
#SMOKE100
df$SMOKE100[!(df$SMOKE100 %in% c(1,2))] <- NA
df$SMOKE100<- scale(df$SMOKE100)
#EXERANY2
df$EXERANY2[!(df$EXERANY2 %in% c(1,2))] <- NA
df$EXERANY2<- scale(df$EXERANY2)
#FLUSHOT7
df$FLUSHOT7[!(df$FLUSHOT7 %in% c(1,2))] <- NA
df$FLUSHOT7 <- scale(df$FLUSHOT7)
#PNEUVAC4
df$PNEUVAC4[!(df$PNEUVAC4 %in% c(1,2))] <- NA
df$PNEUVAC4 <- scale(df$PNEUVAC4)
#HIVTST7
df$HIVTST7[!(df$HIVTST7 %in% c(1,2))] <- NA
df$HIVTST7 <- scale(df$HIVTST7)
#DRNKANY5
df$DRNKANY5[!(df$DRNKANY5 %in% c(1,2))] <- NA
df$DRNKANY5 <- scale(df$DRNKANY5)
#TETANUS1
df$TETANUS1[(df$TETANUS1 %in% c(7,9))] <- NA
df$TETANUS1 <- scale(df$TETANUS1)
#DIABETE4
df$DIABETE4 <- ifelse(!(df$DIABETE4 %in% c(1, 2, 3, 4)), NA, df$DIABETE4)
df$DIABETE4 <- recode(df$DIABETE4, `1` = 8, `2` = 7, `3` = 5, `4` = 6)
df$DIABETE4 <- scale(df$DIABETE4)
#STRENGTH
df$STRENGTH[df$STRENGTH %in% c(200, 777, 999)] <- NA
df$STRENGTH[df$STRENGTH == 888] <- 0
df$STRENGTH <- ifelse(df$STRENGTH %in% 101:199, df$STRENGTH %% 100, df$STRENGTH)
df$STRENGTH <- ifelse(df$STRENGTH %in% 201:299, df$STRENGTH %% 200, df$STRENGTH)
df$STRENGTH <- round(scale(df$STRENGTH),1)
#FRUIT2
df$FRUIT2[df$FRUIT2 %in% c(777, 999, 555, 300)] <- NA
df$FRUIT2[df$FRUIT2 %in% c(555, 300)] <- c(0, 0.5)[match(df$FRUIT2, c(555, 300))]
df$FRUIT2 <- ifelse(df$FRUIT2 %in% 101:199, df$FRUIT2 %% 100, df$FRUIT2)
df$FRUIT2 <- ifelse(df$FRUIT2 %in% 201:299, df$FRUIT2 %% 200, df$FRUIT2)
df$FRUIT2 <- ifelse(df$FRUIT2 %in% 301:399, df$FRUIT2 %% 300, df$FRUIT2)
df$FRUIT2 <- round(scale(df$FRUIT2),2)
#FRUITJU2
df$FRUITJU2[df$FRUITJU2 %in% c(777, 999)] <- NA
df$FRUITJU2[df$FRUITJU2 %in% c(555, 300)] <- 0
df$FRUITJU2 <- ifelse(df$FRUITJU2 %in% 101:199, df$FRUITJU2 %% 100, df$FRUITJU2)
df$FRUITJU2 <- ifelse(df$FRUITJU2 %in% 201:299, df$FRUITJU2 %% 200, df$FRUITJU2)
df$FRUITJU2 <- ifelse(df$FRUITJU2 %in% 301:399, df$FRUITJU2 %% 300, df$FRUITJU2)
df$FRUITJU2 <- round(scale(df$FRUITJU2),1)
#FVGREEN1
df$FVGREEN1[df$FVGREEN1 %in% c(777, 999)] <- NA
df$FVGREEN1[df$FVGREEN1 %in% c(555, 300)] <- 0
df$FVGREEN1 <- ifelse(df$FVGREEN1 %in% 101:199, df$FVGREEN1 %% 100 , df$FVGREEN1)
df$FVGREEN1 <- ifelse(df$FVGREEN1 %in% 201:299, df$FVGREEN1 %% 200 , df$FVGREEN1)
df$FVGREEN1 <- ifelse(df$FVGREEN1 %in% 301:399, df$FVGREEN1 %% 300, df$FVGREEN1)
df$FVGREEN1 <- round(scale(df$FVGREEN1),1)
#FRENCHF1
df$FRENCHF1[df$FRENCHF1 %in% c(777, 999, 200)] <- NA
df$FRENCHF1[df$FRENCHF1 %in% c(555, 300)] <- 0
df$FRENCHF1 <- ifelse(df$FRENCHF1 %in% 101:199, df$FRENCHF1 %% 100, df$FRENCHF1)
df$FRENCHF1 <- ifelse(df$FRENCHF1 %in% 201:299, df$FRENCHF1 %% 200, df$FRENCHF1)
df$FRENCHF1 <- ifelse(df$FRENCHF1 %in% 301:399, df$FRENCHF1 %% 300, df$FRENCHF1)
df$FRENCHF1 <- round(scale(df$FRENCHF1),1)
#POTATOE1
df$POTATOE1[df$POTATOE1 %in% c(777, 999)] <- NA
df$POTATOE1[df$POTATOE1 %in% c(555, 300)] <- 0
df$POTATOE1 <- ifelse(df$POTATOE1 %in% 101:199, df$POTATOE1 %% 100, df$POTATOE1)
df$POTATOE1 <- ifelse(df$POTATOE1 %in% 201:299, df$POTATOE1 %% 200, df$POTATOE1)
df$POTATOE1 <- ifelse(df$POTATOE1 %in% 301:399, df$POTATOE1 %% 300, df$POTATOE1)
df$POTATOE1 <- round(scale(df$POTATOE1),1)
#VEGETAB2
df$VEGETAB2[df$VEGETAB2 %in% c(777, 999)] <- NA
df$VEGETAB2[df$VEGETAB2 %in% c(555, 300)] <- 0
df$VEGETAB2 <- ifelse(df$VEGETAB2 %in% 101:199, df$VEGETAB2 %% 100, df$VEGETAB2)
df$VEGETAB2 <- ifelse(df$VEGETAB2 %in% 201:299, df$VEGETAB2 %% 200, df$VEGETAB2)
df$VEGETAB2 <- ifelse(df$VEGETAB2 %in% 301:399, df$VEGETAB2 %% 300, df$VEGETAB2)
df$VEGETAB2 <- round(scale(df$VEGETAB2),1)
#####The code builds a **decision tree** to predict Class using HTM4, extracts split points using **entrophy**. If no splits are found, uses the 33rd and 66th percentiles of bins HTM4 into categories based on these splits
#HTM4
tree_model <- rpart(Class ~ HTM4, data = df, method = "class", control=rpart.control(minsplit=1, cp=0.001))
splits <- sort(unique(tree_model$frame$split[!is.na(tree_model$frame$split)]))
if (length(splits) == 0) {
splits <- quantile(df$HTM4, probs = c(0.33, 0.66), na.rm = TRUE)
}
labels <- as.character(1:(length(splits) + 1))
df$HTM4 <- cut(df$HTM4, breaks = c(-Inf, splits, Inf), include.lowest = TRUE, labels = labels)
df$HTM4 <- as.numeric(df$HTM4)
df$HTM4 <- scale(df$HTM4)
We automate missing value imputation in a dataframe using linear regression models that leverage strong variable correlations to maintain data integrity and enhance analysis accuracy.
Step | Description |
---|---|
1 | Compute Correlation Matrix: Calculate the correlation matrix only from complete observations. |
2 | Convert Correlation to Dataframe: Transform the correlation matrix to a dataframe for ease of manipulation. |
3 | Identify Columns with NAs: Detect all columns that contain missing values. |
4 | Find Highest Correlated Columns: For each column with missing values, identify the column that has the highest correlation with it. |
5 | Build Regression Models: Construct a regression model for each pair of highly correlated columns. |
6 | Predict Missing Values: Use the regression models to estimate and fill in the missing values. |
This graph, a correlation heatmap, visually represents the strength and direction of relationships between variables in a dataset, using color intensity to indicate the degree of correlation.
# 1
fill_missing_with_regression <- function(df) { numeric_cols <- df %>% select(where(is.numeric))
correlation_matrix <- cor(numeric_cols, use = "complete.obs") # 1
# 2
correlation_df <- as.data.frame(correlation_matrix) %>%
rownames_to_column(var = "Row") %>%
pivot_longer(cols = -Row, names_to = "Column", values_to = "Correlation") %>%
filter(Row != Column, !is.na(Correlation)) %>%
arrange(desc(abs(Correlation)))
# 3
cols_with_na <- names(df)[colSums(is.na(df)) > 0]
print(paste("Columns with NA:", toString(cols_with_na)))
for (col in cols_with_na) {
top_correlation <- correlation_df %>%
filter(Row == col | Column == col) %>%
slice_max(abs(Correlation), n = 1)
# 4
if (nrow(top_correlation) > 0) {
predictor_col <- ifelse(top_correlation$Row[1] == col, top_correlation$Column[1], top_correlation$Row[1])
print(paste("Processing:", col, "using", predictor_col))
missing_indices <- which(is.na(df[[col]]) & !is.na(df[[predictor_col]]))
if (length(missing_indices) == 0) {
print(paste("No valid data to impute for column:", col))
next}
# 5
formula <- as.formula(paste(col, "~", predictor_col))
model <- lm(formula, data = df, na.action = na.exclude) # 5
# 6
predictions <- predict(model, newdata = df[missing_indices, , drop = FALSE], type = "response")
df[[col]][missing_indices] <- predictions
} else {
print(paste("No correlations found for column:", col))}}
return(df)}
df <- fill_missing_with_regression(df)
df[df < 0] <- 0
df <- df[rowSums(is.na(df)) <= 3, ]
This table showcases five observations from the dataset, following the completion of the pre-processing steps. It serves as an example of how the data has been cleaned and structured, ready for further analysis or modeling.
In this phase, we construct predictive models using various machine learning algorithms. The goal is to develop models that can accurately predict outcomes based on input data. We experiment with different algorithms, tune hyperparameters, and evaluate model performance to select the best-performing models. This step involves iterative testing and refinement to ensure robustness and reliability in predictions. ## Spliting The data 📏 Data splitting is a crucial step in model building where the dataset is divided into training and testing subsets. The training set is used to train the models, while the testing set is reserved for evaluating model performance on unseen data.
To ensure the avoidance of data leakage, we carefully split our balanced dataset. Initially, the entire dataset is divided into separate sets for training and testing. We make sure that the same data points are not present in both sets, thereby preventing the model from having prior knowledge of the test data, which can lead to overfitting and unrealistic performance estimates.
Information Gain is a method used to identify and select the most relevant features in a dataset. By measuring how well a feature splits the data, it helps in reducing dimensionality and focusing on the variables that contribute the most to the model’s predictive power.
“Boruta is an all relevant feature selection method, while most other are minimal optimal; this means it tries to find all features carrying information usable for prediction, rather than finding a possibly compact subset of features on which some classifier has a minimal error.”
“RFE iteratively removes less important features, creating a subset that maximizes predictive accuracy. By leveraging a machine learning algorithm and an importance-ranking metric, RFE evaluates each feature’s impact on model performance.”
We use Information Gain to select the most informative features for the AdaBoost model. This enhances AdaBoost’s efficiency and accuracy by focusing on features that significantly reduce uncertainty, improving overall model performance
library(dplyr)
library(knitr)
calculate_measures_per_class_with_averages <- function(cm) {
results <- data.frame(
Class = character(),
TPR = numeric(),
FPR = numeric(),
Precision = numeric(),
Recall = numeric(),
F_measure = numeric(),
MCC = numeric(),
Kappa = numeric(),
stringsAsFactors = FALSE
)
total = sum(cm)
po = (cm[1,1] + cm[2,2]) / total
pe = ((cm[1,1] + cm[1,2]) * (cm[1,1] + cm[2,1]) + (cm[2,1] + cm[2,2]) * (cm[1,2] + cm[2,2])) / total^2
for (i in 1:2) {
if (i == 1) {
tp = cm[1, 1]
fp = cm[2, 1]
fn = cm[1, 2]
tn = cm[2, 2]
class_label = "Class 0"
} else {
tp = cm[2, 2]
fp = cm[1, 2]
fn = cm[2, 1]
tn = cm[1, 1]
class_label = "Class 1"
}
tpr = round(tp / (tp + fn), 1)
fpr = round(fp / (fp + tn), 1)
precision = round(tp / (tp + fp), 1)
recall = round(tpr, 1) # recall is the same as TPR
f_measure = round(ifelse((precision + recall) == 0, 0, (2 * precision * recall) / (precision + recall)), 1)
mcc = round(ifelse((sqrt((tp+fp) * (tp+fn) * (tn+fp) * (tn+fn))) == 0, 0,
(tp * tn - fp * fn) / sqrt((tp+fp) * (tp+fn) * (tn+fp) * (tn+fn))), 1)
kappa = round((po - pe) / (1 - pe), 1)
results <- rbind(results, data.frame(
Class = class_label,
TPR = tpr,
FPR = fpr,
Precision = precision,
Recall = recall,
F_measure = f_measure,
MCC = mcc,
Kappa = kappa
))
}
supports = rowSums(cm)
total_support = sum(supports)
weighted_avgs <- sapply(results[, -1], function(x) sum(x * supports) / total_support, simplify = FALSE)
weighted_row <- c("Wt. Average", sapply(unlist(weighted_avgs), function(x) round(x, 1)))
results <- rbind(results, setNames(as.list(weighted_row), names(results)))
results <- na.omit(results) # Remove rows with NA values
results <- arrange(results, desc(Class)) # Optional: Sort by Class in descending order
# Use knitr::kable() to create a nicely formatted table
kable(results, format = "html", caption = "Class Performance Metrics with Weighted Averages")
}
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.6 | 0.4 | 0.8 | 0.6 | 0.6 | 0.2 | 0.2 |
Class 1 | 0.6 | 0.4 | 0.3 | 0.6 | 0.4 | 0.2 | 0.2 |
Class 0 | 0.6 | 0.4 | 0.9 | 0.6 | 0.7 | 0.2 | 0.2 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.7 | 0.5 | 0.6 | 0.7 | 0.6 | 0.2 | 0.2 |
Class 1 | 0.3 | 0.1 | 0.6 | 0.3 | 0.4 | 0.2 | 0.2 |
Class 0 | 0.9 | 0.7 | 0.6 | 0.9 | 0.7 | 0.2 | 0.2 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.7 | 0.5 | 0.6 | 0.7 | 0.6 | 0.2 | 0.2 |
Class 1 | 0.3 | 0.1 | 0.6 | 0.3 | 0.4 | 0.2 | 0.2 |
Class 0 | 0.9 | 0.7 | 0.6 | 0.9 | 0.7 | 0.2 | 0.2 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.6 | 0.5 | 0.6 | 0.6 | 0.5 | 0.2 | 0.1 |
Class 1 | 0.2 | 0.1 | 0.6 | 0.2 | 0.3 | 0.2 | 0.1 |
Class 0 | 0.9 | 0.8 | 0.6 | 0.9 | 0.7 | 0.2 | 0.1 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.7 | 0.5 | 0.6 | 0.7 | 0.6 | 0.2 | 0.2 |
Class 1 | 0.3 | 0.1 | 0.6 | 0.3 | 0.4 | 0.2 | 0.2 |
Class 0 | 0.9 | 0.7 | 0.6 | 0.9 | 0.7 | 0.2 | 0.2 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.4 | 1 | 0.8 | 0.9 | 0.1 | 0.1 |
Class 1 | 0.6 | 0.2 | 0.1 | 0.6 | 0.2 | 0.1 | 0.1 |
Class 0 | 0.8 | 0.4 | 1 | 0.8 | 0.9 | 0.1 | 0.1 |
Forest reduces overfitting and improves generalization. The RF model can effectively handle large datasets with higher dimensionality and provides insights into feature importance, which enhances its interpretability and performance. This approach enhances the classifier’s accuracy by leveraging the strengths of multiple decision trees and their collective prediction power.
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.7 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class 1 | 0.3 | 0.2 | 0 | 0.3 | 0 | 0.1 | 0 |
Class 0 | 0.8 | 0.7 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 1 | 1 | 0.8 | 0.9 | 0 | 0 |
Class 1 | 0 | 0.2 | 0 | 0 | 0 | 0 | 0 |
Class 0 | 0.8 | 1 | 1 | 0.8 | 0.9 | 0 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.6 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class 1 | 0.4 | 0.2 | 0 | 0.4 | 0 | 0.1 | 0 |
Class 0 | 0.8 | 0.6 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.3 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class 1 | 0.7 | 0.2 | 0 | 0.7 | 0 | 0.1 | 0 |
Class 0 | 0.8 | 0.3 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 1 | 1 | 0.8 | 0.9 | 0 | 0 |
Class 1 | 0 | 0.2 | 0 | 0 | 0 | 0 | 0 |
Class 0 | 0.8 | 1 | 1 | 0.8 | 0.9 | 0 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.4 | 1 | 0.8 | 0.9 | 0.1 | 0.1 |
Class 1 | 0.6 | 0.2 | 0.1 | 0.6 | 0.2 | 0.1 | 0.1 |
Class 0 | 0.8 | 0.4 | 1 | 0.8 | 0.9 | 0.1 | 0.1 |
The Generalized Linear Model (GLM) extends linear regression to accommodate non-normal error distributions and diverse response types, such as binary or count data. It adapts to various scenarios using link functions, enhancing its versatility for modeling different outcomes.
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.7 | 1 | 0.8 | 0.9 | 0 | 0 |
Class 1 | 0.3 | 0.2 | 0 | 0.3 | 0 | 0 | 0 |
Class 0 | 0.8 | 0.7 | 1 | 0.8 | 0.9 | 0 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.6 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class 1 | 0.4 | 0.2 | 0 | 0.4 | 0 | 0.1 | 0 |
Class 0 | 0.8 | 0.6 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.5 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class 1 | 0.5 | 0.2 | 0 | 0.5 | 0 | 0.1 | 0 |
Class 0 | 0.8 | 0.5 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.4 | 1 | 0.8 | 0.9 | 0.1 | 0.1 |
Class 1 | 0.6 | 0.2 | 0 | 0.6 | 0 | 0.1 | 0.1 |
Class 0 | 0.8 | 0.4 | 1 | 0.8 | 0.9 | 0.1 | 0.1 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.6 | 1 | 0.8 | 0.9 | 0.1 | 0.1 |
Class 1 | 0.4 | 0.2 | 0.1 | 0.4 | 0.2 | 0.1 | 0.1 |
Class 0 | 0.8 | 0.6 | 1 | 0.8 | 0.9 | 0.1 | 0.1 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.5 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class 1 | 0.5 | 0.2 | 0 | 0.5 | 0 | 0.1 | 0 |
Class 0 | 0.8 | 0.5 | 1 | 0.8 | 0.9 | 0.1 | 0 |
The SVM (Support Vector Machine) classifier identifies the optimal boundary between classes by maximizing the margin of separation in high-dimensional space. It excels in binary classification tasks and can handle nonlinear relationships using kernel functions.
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.7 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class 1 | 0.3 | 0.2 | 0 | 0.3 | 0 | 0.1 | 0 |
Class 0 | 0.8 | 0.7 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.6 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class 1 | 0.4 | 0.2 | 0 | 0.4 | 0 | 0.1 | 0 |
Class 0 | 0.8 | 0.6 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.6 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class 1 | 0.4 | 0.2 | 0 | 0.4 | 0 | 0.1 | 0 |
Class 0 | 0.8 | 0.6 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.8 | 0.7 | 0.8 | 0.7 | 0.1 | 0.1 |
Class 1 | 0 | 0 | 0.5 | 0 | 0 | 0.1 | 0.1 |
Class 0 | 1 | 1 | 0.8 | 1 | 0.9 | 0.1 | 0.1 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.8 | 0.7 | 0.8 | 0.7 | 0.1 | 0 |
Class 1 | 0 | 0 | 0.4 | 0 | 0 | 0.1 | 0 |
Class 0 | 1 | 1 | 0.8 | 1 | 0.9 | 0.1 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.8 | 0.8 | 0.8 | 0.7 | 0.1 | 0.1 |
Class 1 | 0 | 0 | 0.6 | 0 | 0 | 0.1 | 0.1 |
Class 0 | 1 | 1 | 0.8 | 1 | 0.9 | 0.1 | 0.1 |
The ctree (Conditional Inference Tree) classifier constructs decision trees based on statistically significant associations. It uses permutation tests for independence between response and predictors to make unbiased splits, minimizing overfitting. This method provides a robust approach to modeling non-linear relationships and interactions among features.
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.7 | 0.7 | 0.8 | 0.8 | 0.1 | 0.1 |
Class 1 | 0.1 | 0 | 0.4 | 0.1 | 0.2 | 0.1 | 0.1 |
Class 0 | 1 | 0.9 | 0.8 | 1 | 0.9 | 0.1 | 0.1 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.8 | 0.7 | 0.8 | 0.7 | 0 | 0 |
Class 1 | 0 | 0 | 0.3 | 0 | 0 | 0 | 0 |
Class 0 | 1 | 1 | 0.8 | 1 | 0.9 | 0 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.8 | 0.7 | 0.8 | 0.7 | 0 | 0 |
Class 1 | 0 | 0 | 0.3 | 0 | 0 | 0 | 0 |
Class 0 | 1 | 1 | 0.8 | 1 | 0.9 | 0 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.8 | 0.7 | 0.8 | 0.7 | 0.1 | 0 |
Class 1 | 0 | 0 | 0.4 | 0 | 0 | 0.1 | 0 |
Class 0 | 1 | 1 | 0.8 | 1 | 0.9 | 0.1 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.7 | 0.7 | 0.8 | 0.8 | 0 | 0 |
Class 1 | 0.1 | 0 | 0.3 | 0.1 | 0.1 | 0 | 0 |
Class 0 | 1 | 0.9 | 0.8 | 1 | 0.9 | 0 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.8 | 0.7 | 0.8 | 0.7 | 0.1 | 0 |
Class 1 | 0 | 0 | 0.5 | 0 | 0 | 0.1 | 0 |
Class 0 | 1 | 1 | 0.8 | 1 | 0.9 | 0.1 | 0 |
The gcvEarth classifier utilizes multivariate adaptive regression splines (MARS) to model complex nonlinear relationships in data. It employs generalized cross-validation (GCV) to optimize model complexity, focusing on essential interactions and trends for improved predictive accuracy. This approach allows it to handle varying data structures effectively.
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.7 | 1 | 0.8 | 0.9 | 0 | 0 |
Class 1 | 0.3 | 0.2 | 0 | 0.3 | 0 | 0 | 0 |
Class 0 | 0.8 | 0.7 | 1 | 0.8 | 0.9 | 0 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.6 | 1 | 0.8 | 0.9 | 0.1 | 0.1 |
Class 1 | 0.4 | 0.2 | 0.1 | 0.4 | 0.2 | 0.1 | 0.1 |
Class 0 | 0.8 | 0.6 | 1 | 0.8 | 0.9 | 0.1 | 0.1 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.6 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class 1 | 0.4 | 0.2 | 0 | 0.4 | 0 | 0.1 | 0 |
Class 0 | 0.8 | 0.6 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.6 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class 1 | 0.4 | 0.2 | 0 | 0.4 | 0 | 0.1 | 0 |
Class 0 | 0.8 | 0.6 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.6 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class 1 | 0.4 | 0.2 | 0 | 0.4 | 0 | 0.1 | 0 |
Class 0 | 0.8 | 0.6 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class | TPR | FPR | Precision | Recall | F_measure | MCC | Kappa |
---|---|---|---|---|---|---|---|
Wt. Average | 0.8 | 0.5 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Class 1 | 0.5 | 0.2 | 0 | 0.5 | 0 | 0.1 | 0 |
Class 0 | 0.8 | 0.5 | 1 | 0.8 | 0.9 | 0.1 | 0 |
Model 6 (ADA with RFE - Data splitting method 2) shows the highest F-measure of 0.9, which suggests a very strong predictive performance. Additionally, its Kappa and MCC, though modest at 0.1, are relatively good considering the spread of metrics in other models. This model achieves a good balance between TPR (True Positive Rate) and Precision, indicating effective handling of both positive class predictions and overall accuracy.
Based on the overview of the provided metrics, Model 6 stands out with its high F-measure and balanced performance across other metrics. It shows strong ability to correctly classify cases with high reliability in predictions. The use of SVM with RFE (Recursive Feature Elimination) and Data Splitting Method 2 appears to provide a robust model capable of managing the intricacies of the dataset effectively.
The recommended Model 6 should be considered for further testing and validation on a separate dataset to confirm its effectiveness outside the provided data. The high performance across key metrics suggests that it could be reliable for practical applications, assuming the underlying data characteristics do not differ significantly from those seen in training and testing.