!pip install keras-tuner
!pip install xgboost
!pip install shap
Collecting keras-tuner Downloading keras_tuner-1.4.7-py3-none-any.whl.metadata (5.4 kB) Requirement already satisfied: keras in /usr/local/lib/python3.11/dist-packages (from keras-tuner) (3.8.0) Requirement already satisfied: packaging in /usr/local/lib/python3.11/dist-packages (from keras-tuner) (24.2) Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from keras-tuner) (2.32.3) Collecting kt-legacy (from keras-tuner) Downloading kt_legacy-1.0.5-py3-none-any.whl.metadata (221 bytes) Requirement already satisfied: absl-py in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (1.4.0) Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (1.26.4) Requirement already satisfied: rich in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (13.9.4) Requirement already satisfied: namex in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (0.0.8) Requirement already satisfied: h5py in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (3.12.1) Requirement already satisfied: optree in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (0.14.0) Requirement already satisfied: ml-dtypes in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (0.4.1) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests->keras-tuner) (3.4.1) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests->keras-tuner) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->keras-tuner) (2.3.0) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests->keras-tuner) (2025.1.31) Requirement already satisfied: typing-extensions>=4.5.0 in /usr/local/lib/python3.11/dist-packages (from optree->keras->keras-tuner) (4.12.2) Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.11/dist-packages (from rich->keras->keras-tuner) (3.0.0) Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.11/dist-packages (from rich->keras->keras-tuner) (2.18.0) Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.11/dist-packages (from markdown-it-py>=2.2.0->rich->keras->keras-tuner) (0.1.2) Downloading keras_tuner-1.4.7-py3-none-any.whl (129 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.1/129.1 kB 3.4 MB/s eta 0:00:00 Downloading kt_legacy-1.0.5-py3-none-any.whl (9.6 kB) Installing collected packages: kt-legacy, keras-tuner Successfully installed keras-tuner-1.4.7 kt-legacy-1.0.5 Requirement already satisfied: xgboost in /usr/local/lib/python3.11/dist-packages (2.1.4) Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (from xgboost) (1.26.4) Requirement already satisfied: nvidia-nccl-cu12 in /usr/local/lib/python3.11/dist-packages (from xgboost) (2.21.5) Requirement already satisfied: scipy in /usr/local/lib/python3.11/dist-packages (from xgboost) (1.13.1) Requirement already satisfied: shap in /usr/local/lib/python3.11/dist-packages (0.46.0) Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (from shap) (1.26.4) Requirement already satisfied: scipy in /usr/local/lib/python3.11/dist-packages (from shap) (1.13.1) Requirement already satisfied: scikit-learn in /usr/local/lib/python3.11/dist-packages (from shap) (1.6.1) Requirement already satisfied: pandas in /usr/local/lib/python3.11/dist-packages (from shap) (2.2.2) Requirement already satisfied: tqdm>=4.27.0 in /usr/local/lib/python3.11/dist-packages (from shap) (4.67.1) Requirement already satisfied: packaging>20.9 in /usr/local/lib/python3.11/dist-packages (from shap) (24.2) Requirement already satisfied: slicer==0.0.8 in /usr/local/lib/python3.11/dist-packages (from shap) (0.0.8) Requirement already satisfied: numba in /usr/local/lib/python3.11/dist-packages (from shap) (0.61.0) Requirement already satisfied: cloudpickle in /usr/local/lib/python3.11/dist-packages (from shap) (3.1.1) Requirement already satisfied: llvmlite<0.45,>=0.44.0dev0 in /usr/local/lib/python3.11/dist-packages (from numba->shap) (0.44.0) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas->shap) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas->shap) (2025.1) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas->shap) (2025.1) Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn->shap) (1.4.2) Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn->shap) (3.5.0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->pandas->shap) (1.17.0)
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.metrics import AUC, Precision, Recall
from sklearn.preprocessing import StandardScaler, LabelEncoder
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.regularizers import l1_l2
from tensorflow import keras
from tensorflow.keras.optimizers import Adam, SGD, RMSprop
import keras_tuner as kt
import xgboost as xgb
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, roc_auc_score
from scipy.stats import uniform, randint
import shap
from sklearn.impute import KNNImputer
# stage 1 File URL
file_url = "https://drive.google.com/uc?id=1pA8DDYmQuaLyxADCOZe1QaSQwF16q1J6"
data = pd.read_csv(file_url)
def dropout_data_check(data):
print("Missing values:\n", round((data.isnull().sum() / len(data)) * 100, 0))
print("Checking for duplicate values:\n", data.duplicated().sum())
print("Data Shape:\n", data.shape)
dropout_data_check(data)
Missing values: CentreName 0.0 LearnerCode 0.0 BookingType 0.0 LeadSource 0.0 DiscountType 70.0 DateofBirth 0.0 Gender 0.0 Nationality 0.0 HomeState 64.0 HomeCity 14.0 CourseLevel 0.0 CourseName 0.0 IsFirstIntake 0.0 CompletedCourse 0.0 ProgressionDegree 3.0 ProgressionUniversity 0.0 dtype: float64 Checking for duplicate values: 0 Data Shape: (25059, 16)
data["CompletedCourse"].value_counts().plot(kind = "bar")
plt.xlabel('Target Classes')
plt.ylabel('Frequency')
plt.title('Distribution of Target Classes')
plt.show()
print(round((data["CompletedCourse"].value_counts() / len(data)) * 100), 1)
CompletedCourse Yes 85.0 No 15.0 Name: count, dtype: float64 1
def initial_preprocessing(data, threshold, cardinality):
data = data.copy()
#applying lowercase for uniformity in cat features
data = data.map(lambda x: x.lower() if isinstance(x, str) else x)
# Calculating Missing data percentage
missing_percentage = (data.isnull().sum() / len(data)) * 100
#high cardinality
cardinality_cal = data.nunique()
# age Calculation
data["DateofBirth"] = pd.to_datetime(data["DateofBirth"], dayfirst=True)
current_date = pd.to_datetime("today")
data["Age"] = (current_date - data["DateofBirth"]).dt.days //365
high_missing_percentage = missing_percentage[missing_percentage > threshold].index.tolist()
high_cardinality = cardinality_cal[cardinality_cal > cardinality].index.tolist()
feature_to_drop = pd.concat([pd.Series(high_missing_percentage), pd.Series(high_cardinality)]).drop_duplicates().tolist()
return data.drop(feature_to_drop, axis= 1), feature_to_drop
data, dropped_features = initial_preprocessing(data, 50, 200)
dropout_data_check(data)
dropped_features
Missing values: CentreName 0.0 BookingType 0.0 LeadSource 0.0 Gender 0.0 Nationality 0.0 CourseLevel 0.0 CourseName 0.0 IsFirstIntake 0.0 CompletedCourse 0.0 ProgressionUniversity 0.0 Age 0.0 dtype: float64 Checking for duplicate values: 13490 Data Shape: (25059, 11)
['DiscountType', 'HomeState', 'LearnerCode', 'DateofBirth', 'HomeCity', 'ProgressionDegree']
def feature_encoding(data, one_hot, binary):
data = data.copy()
label = LabelEncoder()
#One hot encoding
data = pd.get_dummies(data, columns= one_hot, drop_first= True)
#Binary encoding
for i in binary:
data[i] = label.fit_transform(data[i])
return data
one_hot = ["CentreName", "BookingType", "LeadSource", "Gender", "Nationality", "CourseLevel", "CourseName", "ProgressionUniversity"]
binary = ["IsFirstIntake", "CompletedCourse"]
data = feature_encoding(data, one_hot, binary)
#dataset split and and scaling of age feature
X = data.drop("CompletedCourse", axis = 1)
y = data["CompletedCourse"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.2)
standardscaler = StandardScaler()
X_train["Age"] = standardscaler.fit_transform(X_train[["Age"]])
X_test["Age"] = standardscaler.transform(X_test[["Age"]])
dropout_data_check(data)
Missing values: IsFirstIntake 0.0 CompletedCourse 0.0 Age 0.0 CentreName_isc_cardiff 0.0 CentreName_isc_dublin 0.0 ... ProgressionUniversity_university of sheffield international college 0.0 ProgressionUniversity_university of strathclyde 0.0 ProgressionUniversity_university of surrey 0.0 ProgressionUniversity_university of sussex 0.0 ProgressionUniversity_vu amsterdam 0.0 Length: 392, dtype: float64 Checking for duplicate values: 13490 Data Shape: (25059, 392)
def model_builder(hp, data):
input = keras.Input(shape = (data.shape[1], ))
hidden = input
num_layers = hp.Int("num_layers", min_value = 1, max_value =3, step = 1) #number of hidden layers with possible min of 1 and max of 3 and increase of 1 per tunning
units = hp.Int("units", min_value = 32, max_value = 128, step = 32)
activation = hp.Choice("activation", values = ["relu", "tanh", "swish"])
dropout_rate = hp.Float("dropout_rate", min_value = 0.2, max_value = 0.5, step = 0.1)
reg = hp.Float('reg', min_value=1e-4, max_value=1e-2, sampling='log')
optimizer = hp.Choice("optimizer", values = ["Adam", "SGD", "RMSprop"])
for _ in range(num_layers):
hidden = Dense(units = units, activation= activation, kernel_regularizer=l1_l2(reg))(hidden)
hidden = Dropout(dropout_rate)(hidden)
output = Dense(units = 1, activation= "sigmoid")(hidden)
model = keras.Model(inputs = input, outputs = output)
model.compile(optimizer = optimizer, loss = "binary_crossentropy", metrics= ["accuracy", AUC(), Recall(), Precision()])
return model
tuner = kt.RandomSearch(lambda hp: model_builder(hp, X_train),
objective = "val_accuracy",
max_trials=10,
executions_per_trial=1,
directory='dir',
project_name='model_x')
early_stop = keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, restore_best_weights=True)
tuner.search(X_train, y_train, epochs=100, batch_size=32, validation_split=0.2, callbacks=[early_stop], verbose=1)
Trial 10 Complete [00h 03m 38s] val_accuracy: 0.8955112099647522 Best val_accuracy So Far: 0.8960099816322327 Total elapsed time: 00h 26m 43s
#best vs worst model
def model_comparism(tuner, best_hyperparameters, X_train, y_train, X_test, y_test):
early_stop = keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, restore_best_weights=True)
model = tuner.hypermodel.build(best_hyperparameters)
history = model.fit(X_train, y_train, epochs=100, batch_size=32, validation_split=0.2, callbacks=[early_stop], verbose=0)
loss, accuracy, auc, recall, precision = model.evaluate(X_test, y_test)
y_pred = model.predict(X_test)
y_pred_class = (y_pred > 0.5).astype(int)
cm = confusion_matrix(y_test, y_pred_class)
print(f"Accuracy: {accuracy:.2f}, \n"
f"AUC: {auc:.2f}, \n"
f"Precision: {precision:.2f}, \n"
f"Recall: {recall:.2f}, \n"
f"Confusion Matrix: \n{cm}")
train_loss = history.history['loss']
val_loss = history.history['val_loss']
plt.figure(figsize=(10, 6))
plt.plot(train_loss, label='Training Loss', color='blue')
plt.plot(val_loss, label='Validation Loss', color='orange')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training and Validation Loss Curves')
plt.legend()
plt.show()
#best model
print("\n\033[1m Best performing Model after tuning\033[0m\n")
best_hyperparameters = tuner.get_best_hyperparameters(num_trials=1)[0]
model_comparism(tuner, best_hyperparameters, X_train, y_train, X_test, y_test)
#Initial configuration Model
print("\n\033[1m Initial configuration Model\033[0m\n")
worst_hyperparameters = tuner.get_best_hyperparameters(num_trials=10)[2]
model_comparism(tuner, worst_hyperparameters, X_train, y_train, X_test, y_test)
Best performing Model after tuning 157/157 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - accuracy: 0.9010 - auc_3: 0.8871 - loss: 0.3002 - precision_3: 0.9211 - recall_3: 0.9653 157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step Accuracy: 0.90, AUC: 0.88, Precision: 0.92, Recall: 0.96, Confusion Matrix: [[ 418 357] [ 152 4085]]
Initial configuration Model 157/157 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.9002 - auc_4: 0.8910 - loss: 0.2844 - precision_4: 0.9188 - recall_4: 0.9671 157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step Accuracy: 0.90, AUC: 0.88, Precision: 0.92, Recall: 0.97, Confusion Matrix: [[ 408 367] [ 145 4092]]
val_loss: The val_loss metric has been chosen because it provides a more nuanced evaluation of model performance. Unlike accuracy, it considers how well the model predicts probabilities rather than just the final predictions. Given that our dataset is imbalanced, this metric is particularly useful in assessing the model’s ability to generalize.
Model Selection: I have chosen to use the third-best performing model from the Random Search as the baseline, without additional hyperparameter tuning. The reason for this decision is that, unlike ensemble models, the Neural Network Functional API does not come with predefined default values, meaning many hyperparameters need to be explicitly defined. Therefore, I’m starting with the third-best model as my initial configuration.
def xgb_model_builder(X_train, y_train, X_test, y_test):
count = 2
param_dist = {
'learning_rate': uniform(0.01, 0.5),
'max_depth': randint(3, 15),
'n_estimators': randint(50, 200),
}
for i in range(count):
if i == 0:
xgbModel = xgb.XGBClassifier()
xgbModel.fit(X_train, y_train)
print("\n\033[1mModel trained with default parameters\033[0m\n")
else:
random_search = RandomizedSearchCV(
estimator= xgbModel,
param_distributions=param_dist,
n_iter=10,
cv=3,
scoring='roc_auc',
verbose=1,
n_jobs=-1,
random_state=42
)
random_search.fit(X_train, y_train)
xgbModel = random_search.best_estimator_
print("\n\033[1mModel trained with hyperparameter tuning using RandomizedSearchCV\033[0m\n")
y_pred = xgbModel.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}, \n"
f"AUC: {roc_auc_score(y_test, xgbModel.predict_proba(X_test)[:, 1]):.2f}, \n"
f"Precision: {precision_score(y_test, y_pred):.2f}, \n"
f"Recall: {recall_score(y_test, y_pred):.2f}, \n"
f"Confusion Matrix: \n{confusion_matrix(y_test, y_pred)}")
#feature importance
explainer = shap.Explainer(xgbModel)
shap_values = explainer(X_train)
plt.figure(figsize=(14, 10))
shap.summary_plot(shap_values, X_train, max_display=10)
plt.tight_layout(pad=3.0)
Model Choice: The XGBoost classifier is used for this project as specified in the project outline. It's known for its efficiency and strong performance, especially with structured/tabular data.
Initial Model: In the first iteration, the model is trained using default hyperparameters. This provides a baseline performance before any optimization is done.
Hyperparameter Tuning: In the second iteration, RandomizedSearchCV is applied to tune key hyperparameters:
RandomizedSearchCV is used for its efficiency in searching large hyperparameter spaces. The search runs for 10 iterations and uses 3-fold cross-validation, optimizing for ROC AUC.
Model Evaluation: The final model is evaluated using the following metrics:
This process ensures a well-tuned model while adhering to the specified evaluation criteria.
xgb_model_builder(X_train, y_train, X_test, y_test)
Model trained with default parameters
Accuracy: 0.90,
AUC: 0.89,
Precision: 0.93,
Recall: 0.96,
Confusion Matrix:
[[ 446 329]
[ 159 4078]]
Fitting 3 folds for each of 10 candidates, totalling 30 fits
Model trained with hyperparameter tuning using RandomizedSearchCV
Accuracy: 0.90,
AUC: 0.89,
Precision: 0.92,
Recall: 0.96,
Confusion Matrix:
[[ 441 334]
[ 157 4080]]
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
# File URL
file_url2 = "https://drive.google.com/uc?id=1vy1JFQZva3lhMJQV69C43AB1NTM4W-DZ"
data2 = pd.read_csv(file_url2)
data2.head(2)
CentreName | LearnerCode | BookingType | LeadSource | DiscountType | DateofBirth | Gender | Nationality | HomeState | HomeCity | CourseLevel | CourseName | IsFirstIntake | CompletedCourse | ProgressionDegree | ProgressionUniversity | AuthorisedAbsenceCount | UnauthorisedAbsenceCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ISC_Aberdeen | 2284932 | Agent | Standard Agent Booking | NaN | 13/01/1998 | Male | Chinese | Jianye District; Jiangsu Province | Nanjing | Pre-Masters | Business and Law Pre-Masters | True | Yes | Msc Econ Accounting and Investment Management | University of Aberdeen | NaN | NaN |
1 | ISC_Aberdeen | 2399500 | Agent | Standard Agent Booking | NaN | 12/2/1998 | Male | Chinese | NaN | Xi'an | Foundation | Life Sciences Undergraduate Foundation Programme | False | Yes | BSc Biological Sciences | University of Aberdeen | 93.0 | 5.0 |
# Separating the two new features
new_features = data2[["AuthorisedAbsenceCount", "UnauthorisedAbsenceCount"]]
before_clean = data2.drop(columns=new_features)
# Performing initial preprocessing, similar to Stage One, using previously defined functions in stage one
after_clean, dropped_features = initial_preprocessing(before_clean, 50, 200)
# Reintegrating the new features after preprocessing
after_clean = pd.concat([after_clean, new_features], axis=1)
# Checking the processed data
dropout_data_check(after_clean)
Missing values: CentreName 0.0 BookingType 0.0 LeadSource 0.0 Gender 0.0 Nationality 0.0 CourseLevel 0.0 CourseName 0.0 IsFirstIntake 0.0 CompletedCourse 0.0 ProgressionUniversity 0.0 Age 0.0 AuthorisedAbsenceCount 1.0 UnauthorisedAbsenceCount 1.0 dtype: float64 Checking for duplicate values: 1346 Data Shape: (25059, 13)
one_hot = ["CentreName", "BookingType", "LeadSource", "Gender", "Nationality", "CourseLevel", "CourseName", "ProgressionUniversity"]
binary = ["IsFirstIntake", "CompletedCourse"]
after_clean = feature_encoding(after_clean, one_hot, binary)
def missing_values(dataset, features):
for i in features:
dataset[i] = dataset[i].fillna(dataset[i].median())
return dataset
after_clean = missing_values(after_clean, new_features)
#dataset split and and scaling of age feature
X2 = after_clean.drop("CompletedCourse", axis = 1)
y2 = after_clean["CompletedCourse"]
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, random_state=1, test_size=0.2)
standardscaler = StandardScaler()
X_train2[["Age", "AuthorisedAbsenceCount", "UnauthorisedAbsenceCount"]] = standardscaler.fit_transform(X_train2[["Age", "AuthorisedAbsenceCount", "UnauthorisedAbsenceCount"]])
X_test2[["Age", "AuthorisedAbsenceCount", "UnauthorisedAbsenceCount"]] = standardscaler.transform(X_test2[["Age", "AuthorisedAbsenceCount", "UnauthorisedAbsenceCount"]])
tuner = kt.RandomSearch(lambda hp: model_builder(hp, X_train2),
objective = "val_accuracy",
max_trials=10,
executions_per_trial=1,
directory='dir',
project_name='model_4')
early_stop = keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, restore_best_weights=True)
tuner.search(X_train2, y_train2, epochs=100, batch_size=32, validation_split=0.2, callbacks=[early_stop], verbose=1)
Trial 10 Complete [00h 03m 09s] val_accuracy: 0.8862842917442322 Best val_accuracy So Far: 0.908229410648346 Total elapsed time: 00h 14m 28s
#best model
print("\n\033[1m Best performing Model after tuning\033[0m\n")
best_hyperparameters = tuner.get_best_hyperparameters(num_trials=1)[0]
model_comparism(tuner, best_hyperparameters, X_train2, y_train2, X_test2, y_test2)
#Initial configuration Model
print("\n\033[1m Initial configuration Model\033[0m\n")
worst_hyperparameters = tuner.get_best_hyperparameters(num_trials=10)[2]
model_comparism(tuner, worst_hyperparameters, X_train2, y_train2, X_test2, y_test2)
Best performing Model after tuning 157/157 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.9090 - auc_3: 0.9149 - loss: 0.2646 - precision_3: 0.9255 - recall_3: 0.9701 157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step Accuracy: 0.91, AUC: 0.91, Precision: 0.93, Recall: 0.97, Confusion Matrix: [[ 449 326] [ 132 4105]]
Initial configuration Model 157/157 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - accuracy: 0.9047 - auc_4: 0.9080 - loss: 0.2953 - precision_4: 0.9221 - recall_4: 0.9688 157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step Accuracy: 0.90, AUC: 0.90, Precision: 0.92, Recall: 0.97, Confusion Matrix: [[ 425 350] [ 130 4107]]
xgb_model_builder(X_train2, y_train2, X_test2, y_test2)
Model trained with default parameters
Accuracy: 0.91,
AUC: 0.92,
Precision: 0.93,
Recall: 0.97,
Confusion Matrix:
[[ 471 304]
[ 126 4111]]
Fitting 3 folds for each of 10 candidates, totalling 30 fits
Model trained with hyperparameter tuning using RandomizedSearchCV
Accuracy: 0.91,
AUC: 0.92,
Precision: 0.93,
Recall: 0.97,
Confusion Matrix:
[[ 469 306]
[ 122 4115]]
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
# File URL
file_url3 = "https://drive.google.com/uc?id=18oyu-RQotQN6jaibsLBoPdqQJbj_cV2-"
data3 = pd.read_csv(file_url3)
data3.head(2)
CentreName | LearnerCode | BookingType | LeadSource | DiscountType | DateofBirth | Gender | Nationality | HomeState | HomeCity | ... | CourseName | IsFirstIntake | CompletedCourse | AssessedModules | PassedModules | FailedModules | ProgressionDegree | ProgressionUniversity | AuthorisedAbsenceCount | UnauthorisedAbsenceCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ISC_Aberdeen | 2284932 | Agent | Standard Agent Booking | NaN | 13/01/1998 | Male | Chinese | Jianye District; Jiangsu Province | Nanjing | ... | Business and Law Pre-Masters | True | Yes | 4.0 | 4.0 | 0.0 | Msc Econ Accounting and Investment Management | University of Aberdeen | NaN | NaN |
1 | ISC_Aberdeen | 2399500 | Agent | Standard Agent Booking | NaN | 12/2/1998 | Male | Chinese | NaN | Xi'an | ... | Life Sciences Undergraduate Foundation Programme | False | Yes | 7.0 | 7.0 | 0.0 | BSc Biological Sciences | University of Aberdeen | 93.0 | 5.0 |
2 rows × 21 columns
# Separating the five new features
new_features3 = data3[["AuthorisedAbsenceCount", "UnauthorisedAbsenceCount", "AssessedModules", "PassedModules", "FailedModules"]]
before_clean3 = data3.drop(columns=new_features)
# Performing initial preprocessing, similar to Stage One, using previously defined function in stage one
after_clean3, dropped_features3 = initial_preprocessing(before_clean, 50, 200)
# Reintegrating the new features after preprocessing
after_clean3 = pd.concat([after_clean3, new_features3], axis=1)
# Checking the processed data
dropout_data_check(after_clean3)
Missing values: CentreName 0.0 BookingType 0.0 LeadSource 0.0 Gender 0.0 Nationality 0.0 CourseLevel 0.0 CourseName 0.0 IsFirstIntake 0.0 CompletedCourse 0.0 ProgressionUniversity 0.0 Age 0.0 AuthorisedAbsenceCount 1.0 UnauthorisedAbsenceCount 1.0 AssessedModules 9.0 PassedModules 9.0 FailedModules 9.0 dtype: float64 Checking for duplicate values: 1070 Data Shape: (25059, 16)
one_hot = ["CentreName", "BookingType", "LeadSource", "Gender", "Nationality", "CourseLevel", "CourseName", "ProgressionUniversity"]
binary = ["IsFirstIntake", "CompletedCourse"]
after_clean3 = feature_encoding(after_clean3, one_hot, binary)
def missing_values3(dataset, features, nine_percent):
dataset = dataset.dropna(subset= nine_percent).copy()
for i in features:
dataset[i] = dataset[i].fillna(dataset[i].median())
return dataset
nine_percent = ["AssessedModules", "PassedModules", "FailedModules"]
after_clean3 = missing_values3(after_clean3, new_features3, nine_percent)
#dataset split and and scaling of age feature
X3 = after_clean3.drop("CompletedCourse", axis = 1)
y3 = after_clean3["CompletedCourse"]
X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y3, random_state=1, test_size=0.2)
scale = ["Age", "AuthorisedAbsenceCount", "UnauthorisedAbsenceCount","AssessedModules", "PassedModules", "FailedModules"]
standardscaler = StandardScaler()
X_train3[scale] = standardscaler.fit_transform(X_train3[scale])
X_test3[scale] = standardscaler.transform(X_test3[scale])
tuner3 = kt.RandomSearch(lambda hp: model_builder(hp, X_train3),
objective = "val_accuracy",
max_trials=10,
executions_per_trial=1,
directory='dir',
project_name='model_5')
early_stop = keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, restore_best_weights=True)
tuner3.search(X_train3, y_train3, epochs=100, batch_size=32, validation_split=0.2, callbacks=[early_stop], verbose=1)
Trial 10 Complete [00h 01m 10s] val_accuracy: 0.9830276370048523 Best val_accuracy So Far: 0.986038863658905 Total elapsed time: 00h 15m 06s
#best model
print("\n\033[1m Best performing Model after tuning\033[0m\n")
best_hyperparameters3 = tuner3.get_best_hyperparameters(num_trials=1)[0]
model_comparism(tuner3, best_hyperparameters3, X_train3, y_train3, X_test3, y_test3)
#Initial configuration Model
print("\n\033[1m Initial configuration Model\033[0m\n")
worst_hyperparameters3 = tuner.get_best_hyperparameters(num_trials=10)[2]
model_comparism(tuner3, worst_hyperparameters3, X_train3, y_train3, X_test3, y_test3)
Best performing Model after tuning 143/143 ━━━━━━━━━━━━━━━━━━━━ 1s 8ms/step - accuracy: 0.9889 - auc_3: 0.9980 - loss: 0.0588 - precision_3: 0.9908 - recall_3: 0.9973 143/143 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step Accuracy: 0.99, AUC: 1.00, Precision: 0.99, Recall: 1.00, Confusion Matrix: [[ 268 38] [ 14 4246]]
Initial configuration Model 143/143 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - accuracy: 0.9859 - auc_4: 0.9962 - loss: 0.0730 - precision_4: 0.9910 - recall_4: 0.9938 143/143 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step Accuracy: 0.99, AUC: 1.00, Precision: 0.99, Recall: 0.99, Confusion Matrix: [[ 271 35] [ 33 4227]]
xgb_model_builder(X_train3, y_train3, X_test3, y_test3)
Model trained with default parameters
Accuracy: 0.99,
AUC: 1.00,
Precision: 0.99,
Recall: 1.00,
Confusion Matrix:
[[ 283 23]
[ 20 4240]]
Fitting 3 folds for each of 10 candidates, totalling 30 fits
Model trained with hyperparameter tuning using RandomizedSearchCV
Accuracy: 0.99,
AUC: 1.00,
Precision: 0.99,
Recall: 1.00,
Confusion Matrix:
[[ 280 26]
[ 15 4245]]
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>