!pip install keras-tuner
!pip install xgboost
!pip install shap

Collecting keras-tuner
  Downloading keras_tuner-1.4.7-py3-none-any.whl.metadata (5.4 kB)
Requirement already satisfied: keras in /usr/local/lib/python3.11/dist-packages (from keras-tuner) (3.8.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.11/dist-packages (from keras-tuner) (24.2)
Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from keras-tuner) (2.32.3)
Collecting kt-legacy (from keras-tuner)
  Downloading kt_legacy-1.0.5-py3-none-any.whl.metadata (221 bytes)
Requirement already satisfied: absl-py in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (1.4.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (1.26.4)
Requirement already satisfied: rich in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (13.9.4)
Requirement already satisfied: namex in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (0.0.8)
Requirement already satisfied: h5py in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (3.12.1)
Requirement already satisfied: optree in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (0.14.0)
Requirement already satisfied: ml-dtypes in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (0.4.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests->keras-tuner) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests->keras-tuner) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->keras-tuner) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests->keras-tuner) (2025.1.31)
Requirement already satisfied: typing-extensions>=4.5.0 in /usr/local/lib/python3.11/dist-packages (from optree->keras->keras-tuner) (4.12.2)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.11/dist-packages (from rich->keras->keras-tuner) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.11/dist-packages (from rich->keras->keras-tuner) (2.18.0)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.11/dist-packages (from markdown-it-py>=2.2.0->rich->keras->keras-tuner) (0.1.2)
Downloading keras_tuner-1.4.7-py3-none-any.whl (129 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.1/129.1 kB 3.4 MB/s eta 0:00:00
Downloading kt_legacy-1.0.5-py3-none-any.whl (9.6 kB)
Installing collected packages: kt-legacy, keras-tuner
Successfully installed keras-tuner-1.4.7 kt-legacy-1.0.5
Requirement already satisfied: xgboost in /usr/local/lib/python3.11/dist-packages (2.1.4)
Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (from xgboost) (1.26.4)
Requirement already satisfied: nvidia-nccl-cu12 in /usr/local/lib/python3.11/dist-packages (from xgboost) (2.21.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.11/dist-packages (from xgboost) (1.13.1)
Requirement already satisfied: shap in /usr/local/lib/python3.11/dist-packages (0.46.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (from shap) (1.26.4)
Requirement already satisfied: scipy in /usr/local/lib/python3.11/dist-packages (from shap) (1.13.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.11/dist-packages (from shap) (1.6.1)
Requirement already satisfied: pandas in /usr/local/lib/python3.11/dist-packages (from shap) (2.2.2)
Requirement already satisfied: tqdm>=4.27.0 in /usr/local/lib/python3.11/dist-packages (from shap) (4.67.1)
Requirement already satisfied: packaging>20.9 in /usr/local/lib/python3.11/dist-packages (from shap) (24.2)
Requirement already satisfied: slicer==0.0.8 in /usr/local/lib/python3.11/dist-packages (from shap) (0.0.8)
Requirement already satisfied: numba in /usr/local/lib/python3.11/dist-packages (from shap) (0.61.0)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.11/dist-packages (from shap) (3.1.1)
Requirement already satisfied: llvmlite<0.45,>=0.44.0dev0 in /usr/local/lib/python3.11/dist-packages (from numba->shap) (0.44.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas->shap) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas->shap) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas->shap) (2025.1)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn->shap) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn->shap) (3.5.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->pandas->shap) (1.17.0)


import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.metrics import AUC, Precision, Recall
from sklearn.preprocessing import StandardScaler, LabelEncoder
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.regularizers import l1_l2
from tensorflow import keras
from tensorflow.keras.optimizers import Adam, SGD, RMSprop
import keras_tuner as kt
import xgboost as xgb
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, roc_auc_score
from scipy.stats import uniform, randint
import shap
from sklearn.impute import KNNImputer


# stage 1 File URL
file_url = "https://drive.google.com/uc?id=1pA8DDYmQuaLyxADCOZe1QaSQwF16q1J6"
data = pd.read_csv(file_url)


def dropout_data_check(data):

    print("Missing values:\n", round((data.isnull().sum() / len(data)) * 100, 0))
    print("Checking for duplicate values:\n", data.duplicated().sum())
    print("Data Shape:\n", data.shape)


dropout_data_check(data)

Missing values:
 CentreName                0.0
LearnerCode               0.0
BookingType               0.0
LeadSource                0.0
DiscountType             70.0
DateofBirth               0.0
Gender                    0.0
Nationality               0.0
HomeState                64.0
HomeCity                 14.0
CourseLevel               0.0
CourseName                0.0
IsFirstIntake             0.0
CompletedCourse           0.0
ProgressionDegree         3.0
ProgressionUniversity     0.0
dtype: float64
Checking for duplicate values:
 0
Data Shape:
 (25059, 16)


data["CompletedCourse"].value_counts().plot(kind = "bar")
plt.xlabel('Target Classes')
plt.ylabel('Frequency')
plt.title('Distribution of Target Classes')

plt.show()
print(round((data["CompletedCourse"].value_counts() / len(data)) * 100), 1)

CompletedCourse
Yes    85.0
No     15.0
Name: count, dtype: float64 1


def initial_preprocessing(data, threshold, cardinality):

  data = data.copy()
  #applying lowercase for uniformity in cat features
  data = data.map(lambda x: x.lower() if isinstance(x, str) else x)

  # Calculating Missing data percentage
  missing_percentage =  (data.isnull().sum() / len(data)) * 100
  #high cardinality
  cardinality_cal = data.nunique()

  # age Calculation
  data["DateofBirth"] = pd.to_datetime(data["DateofBirth"], dayfirst=True)
  current_date = pd.to_datetime("today")
  data["Age"] = (current_date - data["DateofBirth"]).dt.days //365

  high_missing_percentage =  missing_percentage[missing_percentage > threshold].index.tolist()
  high_cardinality = cardinality_cal[cardinality_cal > cardinality].index.tolist()

  feature_to_drop = pd.concat([pd.Series(high_missing_percentage), pd.Series(high_cardinality)]).drop_duplicates().tolist()

  return data.drop(feature_to_drop, axis= 1), feature_to_drop


data, dropped_features = initial_preprocessing(data, 50, 200)


dropout_data_check(data)
dropped_features

Missing values:
 CentreName               0.0
BookingType              0.0
LeadSource               0.0
Gender                   0.0
Nationality              0.0
CourseLevel              0.0
CourseName               0.0
IsFirstIntake            0.0
CompletedCourse          0.0
ProgressionUniversity    0.0
Age                      0.0
dtype: float64
Checking for duplicate values:
 13490
Data Shape:
 (25059, 11)

['DiscountType',
 'HomeState',
 'LearnerCode',
 'DateofBirth',
 'HomeCity',
 'ProgressionDegree']


def feature_encoding(data, one_hot, binary):

  data = data.copy()
  label = LabelEncoder()

  #One hot encoding
  data = pd.get_dummies(data, columns= one_hot, drop_first= True)

  #Binary encoding
  for i in binary:
    data[i] = label.fit_transform(data[i])

  return data


one_hot = ["CentreName", "BookingType", "LeadSource", "Gender", "Nationality", "CourseLevel", "CourseName", "ProgressionUniversity"]
binary = ["IsFirstIntake", "CompletedCourse"]
data = feature_encoding(data, one_hot, binary)


#dataset split and and scaling of age feature
X = data.drop("CompletedCourse", axis = 1)
y = data["CompletedCourse"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.2)

standardscaler = StandardScaler()
X_train["Age"] = standardscaler.fit_transform(X_train[["Age"]])
X_test["Age"] = standardscaler.transform(X_test[["Age"]])


dropout_data_check(data)

Missing values:
 IsFirstIntake                                                          0.0
CompletedCourse                                                        0.0
Age                                                                    0.0
CentreName_isc_cardiff                                                 0.0
CentreName_isc_dublin                                                  0.0
                                                                      ... 
ProgressionUniversity_university of sheffield international college    0.0
ProgressionUniversity_university of strathclyde                        0.0
ProgressionUniversity_university of surrey                             0.0
ProgressionUniversity_university of sussex                             0.0
ProgressionUniversity_vu amsterdam                                     0.0
Length: 392, dtype: float64
Checking for duplicate values:
 13490
Data Shape:
 (25059, 392)


def model_builder(hp, data):

  input = keras.Input(shape = (data.shape[1], ))
  hidden = input

  num_layers = hp.Int("num_layers", min_value = 1, max_value =3, step = 1) #number of hidden layers with possible min of 1 and max of 3 and increase of 1 per tunning
  units = hp.Int("units", min_value = 32, max_value = 128, step = 32)
  activation = hp.Choice("activation", values = ["relu", "tanh", "swish"])
  dropout_rate = hp.Float("dropout_rate", min_value = 0.2, max_value = 0.5, step = 0.1)
  reg = hp.Float('reg', min_value=1e-4, max_value=1e-2, sampling='log')
  optimizer = hp.Choice("optimizer", values = ["Adam", "SGD", "RMSprop"])

  for _ in range(num_layers):

    hidden = Dense(units = units, activation= activation, kernel_regularizer=l1_l2(reg))(hidden)
    hidden = Dropout(dropout_rate)(hidden)

  output = Dense(units = 1, activation= "sigmoid")(hidden)

  model = keras.Model(inputs = input, outputs = output)
  model.compile(optimizer = optimizer, loss = "binary_crossentropy", metrics= ["accuracy", AUC(), Recall(), Precision()])

  return model


tuner = kt.RandomSearch(lambda hp: model_builder(hp, X_train),
                        objective = "val_accuracy",
                        max_trials=10,
                        executions_per_trial=1,
                        directory='dir',
                        project_name='model_x')

early_stop = keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, restore_best_weights=True)
tuner.search(X_train, y_train, epochs=100, batch_size=32, validation_split=0.2, callbacks=[early_stop], verbose=1)

Trial 10 Complete [00h 03m 38s]
val_accuracy: 0.8955112099647522

Best val_accuracy So Far: 0.8960099816322327
Total elapsed time: 00h 26m 43s


#best vs worst model
def model_comparism(tuner, best_hyperparameters, X_train, y_train, X_test, y_test):

  early_stop = keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, restore_best_weights=True)
  model = tuner.hypermodel.build(best_hyperparameters)
  history = model.fit(X_train, y_train, epochs=100, batch_size=32, validation_split=0.2, callbacks=[early_stop], verbose=0)
  loss, accuracy, auc, recall, precision = model.evaluate(X_test, y_test)
  y_pred = model.predict(X_test)
  y_pred_class = (y_pred > 0.5).astype(int)
  cm = confusion_matrix(y_test, y_pred_class)

  print(f"Accuracy: {accuracy:.2f}, \n"
            f"AUC: {auc:.2f}, \n"
            f"Precision: {precision:.2f}, \n"
            f"Recall: {recall:.2f}, \n"
            f"Confusion Matrix: \n{cm}")

  train_loss = history.history['loss']
  val_loss = history.history['val_loss']

  plt.figure(figsize=(10, 6))
  plt.plot(train_loss, label='Training Loss', color='blue')
  plt.plot(val_loss, label='Validation Loss', color='orange')

  plt.xlabel('Epochs')
  plt.ylabel('Loss')
  plt.title('Training and Validation Loss Curves')

  plt.legend()
  plt.show()


#best model
print("\n\033[1m Best performing Model after tuning\033[0m\n")
best_hyperparameters = tuner.get_best_hyperparameters(num_trials=1)[0]
model_comparism(tuner, best_hyperparameters, X_train, y_train, X_test, y_test)

#Initial configuration Model
print("\n\033[1m Initial configuration Model\033[0m\n")
worst_hyperparameters = tuner.get_best_hyperparameters(num_trials=10)[2]
model_comparism(tuner, worst_hyperparameters, X_train, y_train, X_test, y_test)

 Best performing Model after tuning

157/157 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - accuracy: 0.9010 - auc_3: 0.8871 - loss: 0.3002 - precision_3: 0.9211 - recall_3: 0.9653
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
Accuracy: 0.90, 
AUC: 0.88, 
Precision: 0.92, 
Recall: 0.96, 
Confusion Matrix: 
[[ 418  357]
 [ 152 4085]]

 Initial configuration Model

157/157 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.9002 - auc_4: 0.8910 - loss: 0.2844 - precision_4: 0.9188 - recall_4: 0.9671
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
Accuracy: 0.90, 
AUC: 0.88, 
Precision: 0.92, 
Recall: 0.97, 
Confusion Matrix: 
[[ 408  367]
 [ 145 4092]]


def xgb_model_builder(X_train, y_train, X_test, y_test):

  count = 2
  param_dist = {
    'learning_rate': uniform(0.01, 0.5),
    'max_depth': randint(3, 15),
    'n_estimators': randint(50, 200),
  }

  for i in range(count):

    if i == 0:

      xgbModel = xgb.XGBClassifier()
      xgbModel.fit(X_train, y_train)
      print("\n\033[1mModel trained with default parameters\033[0m\n")
    else:

      random_search = RandomizedSearchCV(
          estimator= xgbModel,
          param_distributions=param_dist,
          n_iter=10,
          cv=3,
          scoring='roc_auc',
          verbose=1,
          n_jobs=-1,
          random_state=42
        )
      random_search.fit(X_train, y_train)
      xgbModel = random_search.best_estimator_
      print("\n\033[1mModel trained with hyperparameter tuning using RandomizedSearchCV\033[0m\n")

    y_pred = xgbModel.predict(X_test)
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}, \n"
            f"AUC: {roc_auc_score(y_test, xgbModel.predict_proba(X_test)[:, 1]):.2f}, \n"
            f"Precision: {precision_score(y_test, y_pred):.2f}, \n"
            f"Recall: {recall_score(y_test, y_pred):.2f}, \n"
            f"Confusion Matrix: \n{confusion_matrix(y_test, y_pred)}")

    #feature importance
    explainer = shap.Explainer(xgbModel)
    shap_values = explainer(X_train)
    plt.figure(figsize=(14, 10))
    shap.summary_plot(shap_values, X_train, max_display=10)
    plt.tight_layout(pad=3.0)


xgb_model_builder(X_train, y_train, X_test, y_test)

Model trained with default parameters

Accuracy: 0.90, 
AUC: 0.89, 
Precision: 0.93, 
Recall: 0.96, 
Confusion Matrix: 
[[ 446  329]
 [ 159 4078]]

Fitting 3 folds for each of 10 candidates, totalling 30 fits

Model trained with hyperparameter tuning using RandomizedSearchCV

Accuracy: 0.90, 
AUC: 0.89, 
Precision: 0.92, 
Recall: 0.96, 
Confusion Matrix: 
[[ 441  334]
 [ 157 4080]]

<Figure size 640x480 with 0 Axes>

<Figure size 640x480 with 0 Axes>


# File URL
file_url2 = "https://drive.google.com/uc?id=1vy1JFQZva3lhMJQV69C43AB1NTM4W-DZ"
data2 = pd.read_csv(file_url2)


data2.head(2)


# Separating the two new features
new_features = data2[["AuthorisedAbsenceCount", "UnauthorisedAbsenceCount"]]
before_clean = data2.drop(columns=new_features)

# Performing initial preprocessing, similar to Stage One, using previously defined functions in stage one
after_clean, dropped_features = initial_preprocessing(before_clean, 50, 200)

# Reintegrating the new features after preprocessing
after_clean = pd.concat([after_clean, new_features], axis=1)

# Checking the processed data
dropout_data_check(after_clean)

Missing values:
 CentreName                  0.0
BookingType                 0.0
LeadSource                  0.0
Gender                      0.0
Nationality                 0.0
CourseLevel                 0.0
CourseName                  0.0
IsFirstIntake               0.0
CompletedCourse             0.0
ProgressionUniversity       0.0
Age                         0.0
AuthorisedAbsenceCount      1.0
UnauthorisedAbsenceCount    1.0
dtype: float64
Checking for duplicate values:
 1346
Data Shape:
 (25059, 13)


one_hot = ["CentreName", "BookingType", "LeadSource", "Gender", "Nationality", "CourseLevel", "CourseName", "ProgressionUniversity"]
binary = ["IsFirstIntake", "CompletedCourse"]
after_clean = feature_encoding(after_clean, one_hot, binary)


def missing_values(dataset, features):

  for i in features:

    dataset[i] = dataset[i].fillna(dataset[i].median())

  return dataset


after_clean = missing_values(after_clean, new_features)


#dataset split and and scaling of age feature
X2 = after_clean.drop("CompletedCourse", axis = 1)
y2 = after_clean["CompletedCourse"]

X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, random_state=1, test_size=0.2)

standardscaler = StandardScaler()
X_train2[["Age", "AuthorisedAbsenceCount", "UnauthorisedAbsenceCount"]] = standardscaler.fit_transform(X_train2[["Age", "AuthorisedAbsenceCount", "UnauthorisedAbsenceCount"]])
X_test2[["Age", "AuthorisedAbsenceCount", "UnauthorisedAbsenceCount"]] = standardscaler.transform(X_test2[["Age", "AuthorisedAbsenceCount", "UnauthorisedAbsenceCount"]])


tuner = kt.RandomSearch(lambda hp: model_builder(hp, X_train2),
                        objective = "val_accuracy",
                        max_trials=10,
                        executions_per_trial=1,
                        directory='dir',
                        project_name='model_4')

early_stop = keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, restore_best_weights=True)
tuner.search(X_train2, y_train2, epochs=100, batch_size=32, validation_split=0.2, callbacks=[early_stop], verbose=1)

Trial 10 Complete [00h 03m 09s]
val_accuracy: 0.8862842917442322

Best val_accuracy So Far: 0.908229410648346
Total elapsed time: 00h 14m 28s


#best model
print("\n\033[1m Best performing Model after tuning\033[0m\n")
best_hyperparameters = tuner.get_best_hyperparameters(num_trials=1)[0]
model_comparism(tuner, best_hyperparameters, X_train2, y_train2, X_test2, y_test2)

#Initial configuration Model
print("\n\033[1m Initial configuration Model\033[0m\n")
worst_hyperparameters = tuner.get_best_hyperparameters(num_trials=10)[2]
model_comparism(tuner, worst_hyperparameters, X_train2, y_train2, X_test2, y_test2)

 Best performing Model after tuning

157/157 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.9090 - auc_3: 0.9149 - loss: 0.2646 - precision_3: 0.9255 - recall_3: 0.9701
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
Accuracy: 0.91, 
AUC: 0.91, 
Precision: 0.93, 
Recall: 0.97, 
Confusion Matrix: 
[[ 449  326]
 [ 132 4105]]

 Initial configuration Model

157/157 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - accuracy: 0.9047 - auc_4: 0.9080 - loss: 0.2953 - precision_4: 0.9221 - recall_4: 0.9688
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
Accuracy: 0.90, 
AUC: 0.90, 
Precision: 0.92, 
Recall: 0.97, 
Confusion Matrix: 
[[ 425  350]
 [ 130 4107]]


xgb_model_builder(X_train2, y_train2, X_test2, y_test2)

Model trained with default parameters

Accuracy: 0.91, 
AUC: 0.92, 
Precision: 0.93, 
Recall: 0.97, 
Confusion Matrix: 
[[ 471  304]
 [ 126 4111]]

Fitting 3 folds for each of 10 candidates, totalling 30 fits

Model trained with hyperparameter tuning using RandomizedSearchCV

Accuracy: 0.91, 
AUC: 0.92, 
Precision: 0.93, 
Recall: 0.97, 
Confusion Matrix: 
[[ 469  306]
 [ 122 4115]]

<Figure size 640x480 with 0 Axes>

<Figure size 640x480 with 0 Axes>


# File URL
file_url3 = "https://drive.google.com/uc?id=18oyu-RQotQN6jaibsLBoPdqQJbj_cV2-"
data3 = pd.read_csv(file_url3)


data3.head(2)


# Separating the five new features
new_features3 = data3[["AuthorisedAbsenceCount", "UnauthorisedAbsenceCount", "AssessedModules", "PassedModules", "FailedModules"]]
before_clean3 = data3.drop(columns=new_features)

# Performing initial preprocessing, similar to Stage One, using previously defined function in stage one
after_clean3, dropped_features3 = initial_preprocessing(before_clean, 50, 200)

# Reintegrating the new features after preprocessing
after_clean3 = pd.concat([after_clean3, new_features3], axis=1)

# Checking the processed data
dropout_data_check(after_clean3)

Missing values:
 CentreName                  0.0
BookingType                 0.0
LeadSource                  0.0
Gender                      0.0
Nationality                 0.0
CourseLevel                 0.0
CourseName                  0.0
IsFirstIntake               0.0
CompletedCourse             0.0
ProgressionUniversity       0.0
Age                         0.0
AuthorisedAbsenceCount      1.0
UnauthorisedAbsenceCount    1.0
AssessedModules             9.0
PassedModules               9.0
FailedModules               9.0
dtype: float64
Checking for duplicate values:
 1070
Data Shape:
 (25059, 16)


one_hot = ["CentreName", "BookingType", "LeadSource", "Gender", "Nationality", "CourseLevel", "CourseName", "ProgressionUniversity"]
binary = ["IsFirstIntake", "CompletedCourse"]
after_clean3 = feature_encoding(after_clean3, one_hot, binary)


def missing_values3(dataset, features, nine_percent):

  dataset = dataset.dropna(subset= nine_percent).copy()
  for i in features:

    dataset[i] = dataset[i].fillna(dataset[i].median())

  return dataset


nine_percent = ["AssessedModules", "PassedModules", "FailedModules"]
after_clean3 = missing_values3(after_clean3, new_features3, nine_percent)


#dataset split and and scaling of age feature
X3 = after_clean3.drop("CompletedCourse", axis = 1)
y3 = after_clean3["CompletedCourse"]

X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y3, random_state=1, test_size=0.2)

scale = ["Age", "AuthorisedAbsenceCount", "UnauthorisedAbsenceCount","AssessedModules", "PassedModules", "FailedModules"]
standardscaler = StandardScaler()
X_train3[scale] = standardscaler.fit_transform(X_train3[scale])
X_test3[scale] = standardscaler.transform(X_test3[scale])


tuner3 = kt.RandomSearch(lambda hp: model_builder(hp, X_train3),
                        objective = "val_accuracy",
                        max_trials=10,
                        executions_per_trial=1,
                        directory='dir',
                        project_name='model_5')

early_stop = keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, restore_best_weights=True)
tuner3.search(X_train3, y_train3, epochs=100, batch_size=32, validation_split=0.2, callbacks=[early_stop], verbose=1)

Trial 10 Complete [00h 01m 10s]
val_accuracy: 0.9830276370048523

Best val_accuracy So Far: 0.986038863658905
Total elapsed time: 00h 15m 06s


#best model
print("\n\033[1m Best performing Model after tuning\033[0m\n")
best_hyperparameters3 = tuner3.get_best_hyperparameters(num_trials=1)[0]
model_comparism(tuner3, best_hyperparameters3, X_train3, y_train3, X_test3, y_test3)

#Initial configuration Model
print("\n\033[1m Initial configuration Model\033[0m\n")
worst_hyperparameters3 = tuner.get_best_hyperparameters(num_trials=10)[2]
model_comparism(tuner3, worst_hyperparameters3, X_train3, y_train3, X_test3, y_test3)

 Best performing Model after tuning

143/143 ━━━━━━━━━━━━━━━━━━━━ 1s 8ms/step - accuracy: 0.9889 - auc_3: 0.9980 - loss: 0.0588 - precision_3: 0.9908 - recall_3: 0.9973
143/143 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step
Accuracy: 0.99, 
AUC: 1.00, 
Precision: 0.99, 
Recall: 1.00, 
Confusion Matrix: 
[[ 268   38]
 [  14 4246]]

 Initial configuration Model

143/143 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - accuracy: 0.9859 - auc_4: 0.9962 - loss: 0.0730 - precision_4: 0.9910 - recall_4: 0.9938
143/143 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
Accuracy: 0.99, 
AUC: 1.00, 
Precision: 0.99, 
Recall: 0.99, 
Confusion Matrix: 
[[ 271   35]
 [  33 4227]]


xgb_model_builder(X_train3, y_train3, X_test3, y_test3)

Model trained with default parameters

Accuracy: 0.99, 
AUC: 1.00, 
Precision: 0.99, 
Recall: 1.00, 
Confusion Matrix: 
[[ 283   23]
 [  20 4240]]

Fitting 3 folds for each of 10 candidates, totalling 30 fits

Model trained with hyperparameter tuning using RandomizedSearchCV

Accuracy: 0.99, 
AUC: 1.00, 
Precision: 0.99, 
Recall: 1.00, 
Confusion Matrix: 
[[ 280   26]
 [  15 4245]]

<Figure size 640x480 with 0 Axes>

<Figure size 640x480 with 0 Axes>

	CentreName	LearnerCode	BookingType	LeadSource	DiscountType	DateofBirth	Gender	Nationality	HomeState	HomeCity	CourseLevel	CourseName	IsFirstIntake	CompletedCourse	ProgressionDegree	ProgressionUniversity	AuthorisedAbsenceCount	UnauthorisedAbsenceCount
0	ISC_Aberdeen	2284932	Agent	Standard Agent Booking	NaN	13/01/1998	Male	Chinese	Jianye District; Jiangsu Province	Nanjing	Pre-Masters	Business and Law Pre-Masters	True	Yes	Msc Econ Accounting and Investment Management	University of Aberdeen	NaN	NaN
1	ISC_Aberdeen	2399500	Agent	Standard Agent Booking	NaN	12/2/1998	Male	Chinese	NaN	Xi'an	Foundation	Life Sciences Undergraduate Foundation Programme	False	Yes	BSc Biological Sciences	University of Aberdeen	93.0	5.0

Table of Content¶

1. Required Libraries Installation¶

2. Required Libraries and Dataset¶

3. Stage One¶

3.1. DataFrame and EDA¶

Observations¶

3.1.1 Pre-processing and Feature Engineering¶

Observations¶

3.2. Feature Encoding and Scaling¶

Observations¶

3.3 Neural Network Model¶

Design Reasoning¶

3.3.1. Model Comparism¶

Design Reasoning¶

3.4. Xgboost Model¶

Model Design and Hyperparameter Tuning¶

4. Stage Two¶

4.2. Pre-processing and Feature Engineering¶

Design Reasoning¶

Observations¶

4.2.1. Dealing with Missing Values¶

4.3. Neural Network Model¶

4.3. Xgboost Model¶

5. Stage Three¶