Table of Content¶

  • 1. Required Libraries installation
  • 2. Required Libraries and Dataset
  • 3. Stage One
    • 3.1. DataFrame and EDA
      • 3.1.1. Pre-processing and Feature Engineering
    • 3.2. Feature Encoding and Scaling
    • 3.3 Neural Network Model
      • 3.3.1 Model Comparism
    • 3.4. Xgboost Model
  • 4. Stage Two
    • 4.1. Pre-processing and Feature Engineering
      • 4.1.1. Dealing with Missing Values
    • 4.2. Neural Network Model
    • 4.3. Xgboost Model
  • 5. Stage Three
    • 5.1. Pre-processing and Feature Engineering
      • 5.1.1. Dealing with Missing Values
    • 5.2. Neural Network Model
    • 5.3. Xgboost Model

1. Required Libraries Installation¶

In [1]:
!pip install keras-tuner
!pip install xgboost
!pip install shap
Collecting keras-tuner
  Downloading keras_tuner-1.4.7-py3-none-any.whl.metadata (5.4 kB)
Requirement already satisfied: keras in /usr/local/lib/python3.11/dist-packages (from keras-tuner) (3.8.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.11/dist-packages (from keras-tuner) (24.2)
Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from keras-tuner) (2.32.3)
Collecting kt-legacy (from keras-tuner)
  Downloading kt_legacy-1.0.5-py3-none-any.whl.metadata (221 bytes)
Requirement already satisfied: absl-py in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (1.4.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (1.26.4)
Requirement already satisfied: rich in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (13.9.4)
Requirement already satisfied: namex in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (0.0.8)
Requirement already satisfied: h5py in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (3.12.1)
Requirement already satisfied: optree in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (0.14.0)
Requirement already satisfied: ml-dtypes in /usr/local/lib/python3.11/dist-packages (from keras->keras-tuner) (0.4.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests->keras-tuner) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests->keras-tuner) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->keras-tuner) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests->keras-tuner) (2025.1.31)
Requirement already satisfied: typing-extensions>=4.5.0 in /usr/local/lib/python3.11/dist-packages (from optree->keras->keras-tuner) (4.12.2)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.11/dist-packages (from rich->keras->keras-tuner) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.11/dist-packages (from rich->keras->keras-tuner) (2.18.0)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.11/dist-packages (from markdown-it-py>=2.2.0->rich->keras->keras-tuner) (0.1.2)
Downloading keras_tuner-1.4.7-py3-none-any.whl (129 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.1/129.1 kB 3.4 MB/s eta 0:00:00
Downloading kt_legacy-1.0.5-py3-none-any.whl (9.6 kB)
Installing collected packages: kt-legacy, keras-tuner
Successfully installed keras-tuner-1.4.7 kt-legacy-1.0.5
Requirement already satisfied: xgboost in /usr/local/lib/python3.11/dist-packages (2.1.4)
Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (from xgboost) (1.26.4)
Requirement already satisfied: nvidia-nccl-cu12 in /usr/local/lib/python3.11/dist-packages (from xgboost) (2.21.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.11/dist-packages (from xgboost) (1.13.1)
Requirement already satisfied: shap in /usr/local/lib/python3.11/dist-packages (0.46.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (from shap) (1.26.4)
Requirement already satisfied: scipy in /usr/local/lib/python3.11/dist-packages (from shap) (1.13.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.11/dist-packages (from shap) (1.6.1)
Requirement already satisfied: pandas in /usr/local/lib/python3.11/dist-packages (from shap) (2.2.2)
Requirement already satisfied: tqdm>=4.27.0 in /usr/local/lib/python3.11/dist-packages (from shap) (4.67.1)
Requirement already satisfied: packaging>20.9 in /usr/local/lib/python3.11/dist-packages (from shap) (24.2)
Requirement already satisfied: slicer==0.0.8 in /usr/local/lib/python3.11/dist-packages (from shap) (0.0.8)
Requirement already satisfied: numba in /usr/local/lib/python3.11/dist-packages (from shap) (0.61.0)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.11/dist-packages (from shap) (3.1.1)
Requirement already satisfied: llvmlite<0.45,>=0.44.0dev0 in /usr/local/lib/python3.11/dist-packages (from numba->shap) (0.44.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas->shap) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas->shap) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas->shap) (2025.1)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn->shap) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn->shap) (3.5.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->pandas->shap) (1.17.0)

2. Required Libraries and Dataset¶

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.metrics import AUC, Precision, Recall
from sklearn.preprocessing import StandardScaler, LabelEncoder
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.regularizers import l1_l2
from tensorflow import keras
from tensorflow.keras.optimizers import Adam, SGD, RMSprop
import keras_tuner as kt
import xgboost as xgb
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, roc_auc_score
from scipy.stats import uniform, randint
import shap
from sklearn.impute import KNNImputer

3. Stage One¶

3.1. DataFrame and EDA¶

In [3]:
# stage 1 File URL
file_url = "https://drive.google.com/uc?id=1pA8DDYmQuaLyxADCOZe1QaSQwF16q1J6"
data = pd.read_csv(file_url)
In [4]:
def dropout_data_check(data):

    print("Missing values:\n", round((data.isnull().sum() / len(data)) * 100, 0))
    print("Checking for duplicate values:\n", data.duplicated().sum())
    print("Data Shape:\n", data.shape)
In [5]:
dropout_data_check(data)
Missing values:
 CentreName                0.0
LearnerCode               0.0
BookingType               0.0
LeadSource                0.0
DiscountType             70.0
DateofBirth               0.0
Gender                    0.0
Nationality               0.0
HomeState                64.0
HomeCity                 14.0
CourseLevel               0.0
CourseName                0.0
IsFirstIntake             0.0
CompletedCourse           0.0
ProgressionDegree         3.0
ProgressionUniversity     0.0
dtype: float64
Checking for duplicate values:
 0
Data Shape:
 (25059, 16)
In [6]:
data["CompletedCourse"].value_counts().plot(kind = "bar")
plt.xlabel('Target Classes')
plt.ylabel('Frequency')
plt.title('Distribution of Target Classes')

plt.show()
print(round((data["CompletedCourse"].value_counts() / len(data)) * 100), 1)
CompletedCourse
Yes    85.0
No     15.0
Name: count, dtype: float64 1

Observations¶

  • Four features in the dataset contain missing values, specifically:
    • DiscountType: 70%
    • HomeState: 64%
    • HomeCity: 14%
    • ProgressionDegree: 3%
  • The dataset is imbalanced (yes: 85% and no: 15% )
  • However, the missing values will not be addressed, as the associated features may not be essential for the completion of this project.
  • Data size: 25,059 and Features: 16
  • The dataset has no duplicate rows.
  • The next step in dataset preprocessing will involve cleaning and feature engineering. This includes removing columns with high cardinality (>200) and columns with more than 50% missing values.
  • If, after this process, any features still have missing values, they will be addressed.

3.1.1 Pre-processing and Feature Engineering¶

In [7]:
def initial_preprocessing(data, threshold, cardinality):

  data = data.copy()
  #applying lowercase for uniformity in cat features
  data = data.map(lambda x: x.lower() if isinstance(x, str) else x)

  # Calculating Missing data percentage
  missing_percentage =  (data.isnull().sum() / len(data)) * 100
  #high cardinality
  cardinality_cal = data.nunique()

  # age Calculation
  data["DateofBirth"] = pd.to_datetime(data["DateofBirth"], dayfirst=True)
  current_date = pd.to_datetime("today")
  data["Age"] = (current_date - data["DateofBirth"]).dt.days //365

  high_missing_percentage =  missing_percentage[missing_percentage > threshold].index.tolist()
  high_cardinality = cardinality_cal[cardinality_cal > cardinality].index.tolist()

  feature_to_drop = pd.concat([pd.Series(high_missing_percentage), pd.Series(high_cardinality)]).drop_duplicates().tolist()

  return data.drop(feature_to_drop, axis= 1), feature_to_drop
In [8]:
data, dropped_features = initial_preprocessing(data, 50, 200)
In [9]:
dropout_data_check(data)
dropped_features
Missing values:
 CentreName               0.0
BookingType              0.0
LeadSource               0.0
Gender                   0.0
Nationality              0.0
CourseLevel              0.0
CourseName               0.0
IsFirstIntake            0.0
CompletedCourse          0.0
ProgressionUniversity    0.0
Age                      0.0
dtype: float64
Checking for duplicate values:
 13490
Data Shape:
 (25059, 11)
Out[9]:
['DiscountType',
 'HomeState',
 'LearnerCode',
 'DateofBirth',
 'HomeCity',
 'ProgressionDegree']

Observations¶

  • Dataset no longer has missing values
  • New dataset shape(25,059, 11)
  • The ProgressionDegree feature, which was dropped, may be revisited and engineered by extracting details like MSc and BSc to assess whether it impacts model accuracy.
  • The next step will focus on encoding the features

3.2. Feature Encoding and Scaling¶

In [10]:
def feature_encoding(data, one_hot, binary):

  data = data.copy()
  label = LabelEncoder()

  #One hot encoding
  data = pd.get_dummies(data, columns= one_hot, drop_first= True)

  #Binary encoding
  for i in binary:
    data[i] = label.fit_transform(data[i])

  return data
In [11]:
one_hot = ["CentreName", "BookingType", "LeadSource", "Gender", "Nationality", "CourseLevel", "CourseName", "ProgressionUniversity"]
binary = ["IsFirstIntake", "CompletedCourse"]
data = feature_encoding(data, one_hot, binary)
In [12]:
#dataset split and and scaling of age feature
X = data.drop("CompletedCourse", axis = 1)
y = data["CompletedCourse"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.2)

standardscaler = StandardScaler()
X_train["Age"] = standardscaler.fit_transform(X_train[["Age"]])
X_test["Age"] = standardscaler.transform(X_test[["Age"]])
In [13]:
dropout_data_check(data)
Missing values:
 IsFirstIntake                                                          0.0
CompletedCourse                                                        0.0
Age                                                                    0.0
CentreName_isc_cardiff                                                 0.0
CentreName_isc_dublin                                                  0.0
                                                                      ... 
ProgressionUniversity_university of sheffield international college    0.0
ProgressionUniversity_university of strathclyde                        0.0
ProgressionUniversity_university of surrey                             0.0
ProgressionUniversity_university of sussex                             0.0
ProgressionUniversity_vu amsterdam                                     0.0
Length: 392, dtype: float64
Checking for duplicate values:
 13490
Data Shape:
 (25059, 392)

Observations¶

  • No missing values.
  • The feature age has been scaled because of NN.
  • New dataset shape(25,059, 392)

3.3 Neural Network Model¶

In [14]:
def model_builder(hp, data):

  input = keras.Input(shape = (data.shape[1], ))
  hidden = input

  num_layers = hp.Int("num_layers", min_value = 1, max_value =3, step = 1) #number of hidden layers with possible min of 1 and max of 3 and increase of 1 per tunning
  units = hp.Int("units", min_value = 32, max_value = 128, step = 32)
  activation = hp.Choice("activation", values = ["relu", "tanh", "swish"])
  dropout_rate = hp.Float("dropout_rate", min_value = 0.2, max_value = 0.5, step = 0.1)
  reg = hp.Float('reg', min_value=1e-4, max_value=1e-2, sampling='log')
  optimizer = hp.Choice("optimizer", values = ["Adam", "SGD", "RMSprop"])

  for _ in range(num_layers):

    hidden = Dense(units = units, activation= activation, kernel_regularizer=l1_l2(reg))(hidden)
    hidden = Dropout(dropout_rate)(hidden)

  output = Dense(units = 1, activation= "sigmoid")(hidden)

  model = keras.Model(inputs = input, outputs = output)
  model.compile(optimizer = optimizer, loss = "binary_crossentropy", metrics= ["accuracy", AUC(), Recall(), Precision()])

  return model
In [15]:
tuner = kt.RandomSearch(lambda hp: model_builder(hp, X_train),
                        objective = "val_accuracy",
                        max_trials=10,
                        executions_per_trial=1,
                        directory='dir',
                        project_name='model_x')

early_stop = keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, restore_best_weights=True)
tuner.search(X_train, y_train, epochs=100, batch_size=32, validation_split=0.2, callbacks=[early_stop], verbose=1)
Trial 10 Complete [00h 03m 38s]
val_accuracy: 0.8955112099647522

Best val_accuracy So Far: 0.8960099816322327
Total elapsed time: 00h 26m 43s

Design Reasoning¶

  • Keras Tuner was selected for its user-friendly interface and the seamless access it provides to various hyperparameter optimization techniques, including Grid Search and Random Search.
  • The hyperparameter values were chosen randomly, without relying on any specific research paper. This approach was primarily influenced by time constraints.
  • Random Search was chosen for this project due to the large number of hyperparameters. It offers significant computational efficiency, especially with the limited resources available, and is set to a maximum of 10 possible combinations. Additionally, it helps mitigate the risk of overfitting that can arise from more exhaustive search methods.

3.3.1. Model Comparism¶

In [16]:
#best vs worst model
def model_comparism(tuner, best_hyperparameters, X_train, y_train, X_test, y_test):

  early_stop = keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, restore_best_weights=True)
  model = tuner.hypermodel.build(best_hyperparameters)
  history = model.fit(X_train, y_train, epochs=100, batch_size=32, validation_split=0.2, callbacks=[early_stop], verbose=0)
  loss, accuracy, auc, recall, precision = model.evaluate(X_test, y_test)
  y_pred = model.predict(X_test)
  y_pred_class = (y_pred > 0.5).astype(int)
  cm = confusion_matrix(y_test, y_pred_class)

  print(f"Accuracy: {accuracy:.2f}, \n"
            f"AUC: {auc:.2f}, \n"
            f"Precision: {precision:.2f}, \n"
            f"Recall: {recall:.2f}, \n"
            f"Confusion Matrix: \n{cm}")

  train_loss = history.history['loss']
  val_loss = history.history['val_loss']

  plt.figure(figsize=(10, 6))
  plt.plot(train_loss, label='Training Loss', color='blue')
  plt.plot(val_loss, label='Validation Loss', color='orange')

  plt.xlabel('Epochs')
  plt.ylabel('Loss')
  plt.title('Training and Validation Loss Curves')

  plt.legend()
  plt.show()
In [17]:
#best model
print("\n\033[1m Best performing Model after tuning\033[0m\n")
best_hyperparameters = tuner.get_best_hyperparameters(num_trials=1)[0]
model_comparism(tuner, best_hyperparameters, X_train, y_train, X_test, y_test)

#Initial configuration Model
print("\n\033[1m Initial configuration Model\033[0m\n")
worst_hyperparameters = tuner.get_best_hyperparameters(num_trials=10)[2]
model_comparism(tuner, worst_hyperparameters, X_train, y_train, X_test, y_test)
 Best performing Model after tuning

157/157 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - accuracy: 0.9010 - auc_3: 0.8871 - loss: 0.3002 - precision_3: 0.9211 - recall_3: 0.9653
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
Accuracy: 0.90, 
AUC: 0.88, 
Precision: 0.92, 
Recall: 0.96, 
Confusion Matrix: 
[[ 418  357]
 [ 152 4085]]
 Initial configuration Model

157/157 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.9002 - auc_4: 0.8910 - loss: 0.2844 - precision_4: 0.9188 - recall_4: 0.9671
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
Accuracy: 0.90, 
AUC: 0.88, 
Precision: 0.92, 
Recall: 0.97, 
Confusion Matrix: 
[[ 408  367]
 [ 145 4092]]

Design Reasoning¶

  • val_loss: The val_loss metric has been chosen because it provides a more nuanced evaluation of model performance. Unlike accuracy, it considers how well the model predicts probabilities rather than just the final predictions. Given that our dataset is imbalanced, this metric is particularly useful in assessing the model’s ability to generalize.

  • Model Selection: I have chosen to use the third-best performing model from the Random Search as the baseline, without additional hyperparameter tuning. The reason for this decision is that, unlike ensemble models, the Neural Network Functional API does not come with predefined default values, meaning many hyperparameters need to be explicitly defined. Therefore, I’m starting with the third-best model as my initial configuration.

3.4. Xgboost Model¶

In [18]:
def xgb_model_builder(X_train, y_train, X_test, y_test):

  count = 2
  param_dist = {
    'learning_rate': uniform(0.01, 0.5),
    'max_depth': randint(3, 15),
    'n_estimators': randint(50, 200),
  }

  for i in range(count):

    if i == 0:

      xgbModel = xgb.XGBClassifier()
      xgbModel.fit(X_train, y_train)
      print("\n\033[1mModel trained with default parameters\033[0m\n")
    else:

      random_search = RandomizedSearchCV(
          estimator= xgbModel,
          param_distributions=param_dist,
          n_iter=10,
          cv=3,
          scoring='roc_auc',
          verbose=1,
          n_jobs=-1,
          random_state=42
        )
      random_search.fit(X_train, y_train)
      xgbModel = random_search.best_estimator_
      print("\n\033[1mModel trained with hyperparameter tuning using RandomizedSearchCV\033[0m\n")

    y_pred = xgbModel.predict(X_test)
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}, \n"
            f"AUC: {roc_auc_score(y_test, xgbModel.predict_proba(X_test)[:, 1]):.2f}, \n"
            f"Precision: {precision_score(y_test, y_pred):.2f}, \n"
            f"Recall: {recall_score(y_test, y_pred):.2f}, \n"
            f"Confusion Matrix: \n{confusion_matrix(y_test, y_pred)}")

    #feature importance
    explainer = shap.Explainer(xgbModel)
    shap_values = explainer(X_train)
    plt.figure(figsize=(14, 10))
    shap.summary_plot(shap_values, X_train, max_display=10)
    plt.tight_layout(pad=3.0)

Model Design and Hyperparameter Tuning¶

  • Model Choice: The XGBoost classifier is used for this project as specified in the project outline. It's known for its efficiency and strong performance, especially with structured/tabular data.

  • Initial Model: In the first iteration, the model is trained using default hyperparameters. This provides a baseline performance before any optimization is done.

  • Hyperparameter Tuning: In the second iteration, RandomizedSearchCV is applied to tune key hyperparameters:

    • learning_rate: Controls the step size at each iteration.
    • max_depth: Limits the depth of the trees, balancing model complexity.
    • n_estimators: Defines the number of boosting rounds (trees).

    RandomizedSearchCV is used for its efficiency in searching large hyperparameter spaces. The search runs for 10 iterations and uses 3-fold cross-validation, optimizing for ROC AUC.

  • Model Evaluation: The final model is evaluated using the following metrics:

    • Accuracy: Proportion of correct predictions.
    • Precision: Proportion of positive predictions that are correct.
    • Recall: Proportion of actual positives identified by the model.
    • Confusion Matrix: A detailed breakdown of true positives, false positives, true negatives, and false negatives.

This process ensures a well-tuned model while adhering to the specified evaluation criteria.

In [19]:
xgb_model_builder(X_train, y_train, X_test, y_test)
Model trained with default parameters

Accuracy: 0.90, 
AUC: 0.89, 
Precision: 0.93, 
Recall: 0.96, 
Confusion Matrix: 
[[ 446  329]
 [ 159 4078]]
Fitting 3 folds for each of 10 candidates, totalling 30 fits

Model trained with hyperparameter tuning using RandomizedSearchCV

Accuracy: 0.90, 
AUC: 0.89, 
Precision: 0.92, 
Recall: 0.96, 
Confusion Matrix: 
[[ 441  334]
 [ 157 4080]]
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>

4. Stage Two¶

4.2. Pre-processing and Feature Engineering¶

In [20]:
# File URL
file_url2 = "https://drive.google.com/uc?id=1vy1JFQZva3lhMJQV69C43AB1NTM4W-DZ"
data2 = pd.read_csv(file_url2)
In [21]:
data2.head(2)
Out[21]:
CentreName LearnerCode BookingType LeadSource DiscountType DateofBirth Gender Nationality HomeState HomeCity CourseLevel CourseName IsFirstIntake CompletedCourse ProgressionDegree ProgressionUniversity AuthorisedAbsenceCount UnauthorisedAbsenceCount
0 ISC_Aberdeen 2284932 Agent Standard Agent Booking NaN 13/01/1998 Male Chinese Jianye District; Jiangsu Province Nanjing Pre-Masters Business and Law Pre-Masters True Yes Msc Econ Accounting and Investment Management University of Aberdeen NaN NaN
1 ISC_Aberdeen 2399500 Agent Standard Agent Booking NaN 12/2/1998 Male Chinese NaN Xi'an Foundation Life Sciences Undergraduate Foundation Programme False Yes BSc Biological Sciences University of Aberdeen 93.0 5.0

Design Reasoning¶

  • All functions declared in Stage One will be reused in Stage Two to ensure consistency and avoid redundancy.
In [22]:
# Separating the two new features
new_features = data2[["AuthorisedAbsenceCount", "UnauthorisedAbsenceCount"]]
before_clean = data2.drop(columns=new_features)

# Performing initial preprocessing, similar to Stage One, using previously defined functions in stage one
after_clean, dropped_features = initial_preprocessing(before_clean, 50, 200)

# Reintegrating the new features after preprocessing
after_clean = pd.concat([after_clean, new_features], axis=1)

# Checking the processed data
dropout_data_check(after_clean)
Missing values:
 CentreName                  0.0
BookingType                 0.0
LeadSource                  0.0
Gender                      0.0
Nationality                 0.0
CourseLevel                 0.0
CourseName                  0.0
IsFirstIntake               0.0
CompletedCourse             0.0
ProgressionUniversity       0.0
Age                         0.0
AuthorisedAbsenceCount      1.0
UnauthorisedAbsenceCount    1.0
dtype: float64
Checking for duplicate values:
 1346
Data Shape:
 (25059, 13)

Observations¶

  • The dataset contains missing values in the following features:
    • AuthorisedAbsenceCount: 1% missing
    • UnauthorisedAbsenceCount: 1% missing
  • The next step will focus on handling these missing values to ensure data quality.

4.2.1. Dealing with Missing Values¶

In [23]:
one_hot = ["CentreName", "BookingType", "LeadSource", "Gender", "Nationality", "CourseLevel", "CourseName", "ProgressionUniversity"]
binary = ["IsFirstIntake", "CompletedCourse"]
after_clean = feature_encoding(after_clean, one_hot, binary)
In [24]:
def missing_values(dataset, features):

  for i in features:

    dataset[i] = dataset[i].fillna(dataset[i].median())

  return dataset
In [25]:
after_clean = missing_values(after_clean, new_features)

4.3. Neural Network Model¶

In [26]:
#dataset split and and scaling of age feature
X2 = after_clean.drop("CompletedCourse", axis = 1)
y2 = after_clean["CompletedCourse"]

X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, random_state=1, test_size=0.2)

standardscaler = StandardScaler()
X_train2[["Age", "AuthorisedAbsenceCount", "UnauthorisedAbsenceCount"]] = standardscaler.fit_transform(X_train2[["Age", "AuthorisedAbsenceCount", "UnauthorisedAbsenceCount"]])
X_test2[["Age", "AuthorisedAbsenceCount", "UnauthorisedAbsenceCount"]] = standardscaler.transform(X_test2[["Age", "AuthorisedAbsenceCount", "UnauthorisedAbsenceCount"]])
In [27]:
tuner = kt.RandomSearch(lambda hp: model_builder(hp, X_train2),
                        objective = "val_accuracy",
                        max_trials=10,
                        executions_per_trial=1,
                        directory='dir',
                        project_name='model_4')

early_stop = keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, restore_best_weights=True)
tuner.search(X_train2, y_train2, epochs=100, batch_size=32, validation_split=0.2, callbacks=[early_stop], verbose=1)
Trial 10 Complete [00h 03m 09s]
val_accuracy: 0.8862842917442322

Best val_accuracy So Far: 0.908229410648346
Total elapsed time: 00h 14m 28s
In [28]:
#best model
print("\n\033[1m Best performing Model after tuning\033[0m\n")
best_hyperparameters = tuner.get_best_hyperparameters(num_trials=1)[0]
model_comparism(tuner, best_hyperparameters, X_train2, y_train2, X_test2, y_test2)

#Initial configuration Model
print("\n\033[1m Initial configuration Model\033[0m\n")
worst_hyperparameters = tuner.get_best_hyperparameters(num_trials=10)[2]
model_comparism(tuner, worst_hyperparameters, X_train2, y_train2, X_test2, y_test2)
 Best performing Model after tuning

157/157 ━━━━━━━━━━━━━━━━━━━━ 1s 6ms/step - accuracy: 0.9090 - auc_3: 0.9149 - loss: 0.2646 - precision_3: 0.9255 - recall_3: 0.9701
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
Accuracy: 0.91, 
AUC: 0.91, 
Precision: 0.93, 
Recall: 0.97, 
Confusion Matrix: 
[[ 449  326]
 [ 132 4105]]
 Initial configuration Model

157/157 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - accuracy: 0.9047 - auc_4: 0.9080 - loss: 0.2953 - precision_4: 0.9221 - recall_4: 0.9688
157/157 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
Accuracy: 0.90, 
AUC: 0.90, 
Precision: 0.92, 
Recall: 0.97, 
Confusion Matrix: 
[[ 425  350]
 [ 130 4107]]

4.3. Xgboost Model¶

In [29]:
xgb_model_builder(X_train2, y_train2, X_test2, y_test2)
Model trained with default parameters

Accuracy: 0.91, 
AUC: 0.92, 
Precision: 0.93, 
Recall: 0.97, 
Confusion Matrix: 
[[ 471  304]
 [ 126 4111]]
Fitting 3 folds for each of 10 candidates, totalling 30 fits

Model trained with hyperparameter tuning using RandomizedSearchCV

Accuracy: 0.91, 
AUC: 0.92, 
Precision: 0.93, 
Recall: 0.97, 
Confusion Matrix: 
[[ 469  306]
 [ 122 4115]]
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>

5. Stage Three¶

In [30]:
# File URL
file_url3 = "https://drive.google.com/uc?id=18oyu-RQotQN6jaibsLBoPdqQJbj_cV2-"
data3 = pd.read_csv(file_url3)
In [31]:
data3.head(2)
Out[31]:
CentreName LearnerCode BookingType LeadSource DiscountType DateofBirth Gender Nationality HomeState HomeCity ... CourseName IsFirstIntake CompletedCourse AssessedModules PassedModules FailedModules ProgressionDegree ProgressionUniversity AuthorisedAbsenceCount UnauthorisedAbsenceCount
0 ISC_Aberdeen 2284932 Agent Standard Agent Booking NaN 13/01/1998 Male Chinese Jianye District; Jiangsu Province Nanjing ... Business and Law Pre-Masters True Yes 4.0 4.0 0.0 Msc Econ Accounting and Investment Management University of Aberdeen NaN NaN
1 ISC_Aberdeen 2399500 Agent Standard Agent Booking NaN 12/2/1998 Male Chinese NaN Xi'an ... Life Sciences Undergraduate Foundation Programme False Yes 7.0 7.0 0.0 BSc Biological Sciences University of Aberdeen 93.0 5.0

2 rows × 21 columns

In [32]:
# Separating the five new features
new_features3 = data3[["AuthorisedAbsenceCount", "UnauthorisedAbsenceCount", "AssessedModules", "PassedModules", "FailedModules"]]
before_clean3 = data3.drop(columns=new_features)

# Performing initial preprocessing, similar to Stage One, using previously defined function in stage one
after_clean3, dropped_features3 = initial_preprocessing(before_clean, 50, 200)

# Reintegrating the new features after preprocessing
after_clean3 = pd.concat([after_clean3, new_features3], axis=1)

# Checking the processed data
dropout_data_check(after_clean3)
Missing values:
 CentreName                  0.0
BookingType                 0.0
LeadSource                  0.0
Gender                      0.0
Nationality                 0.0
CourseLevel                 0.0
CourseName                  0.0
IsFirstIntake               0.0
CompletedCourse             0.0
ProgressionUniversity       0.0
Age                         0.0
AuthorisedAbsenceCount      1.0
UnauthorisedAbsenceCount    1.0
AssessedModules             9.0
PassedModules               9.0
FailedModules               9.0
dtype: float64
Checking for duplicate values:
 1070
Data Shape:
 (25059, 16)
In [33]:
one_hot = ["CentreName", "BookingType", "LeadSource", "Gender", "Nationality", "CourseLevel", "CourseName", "ProgressionUniversity"]
binary = ["IsFirstIntake", "CompletedCourse"]
after_clean3 = feature_encoding(after_clean3, one_hot, binary)
In [34]:
def missing_values3(dataset, features, nine_percent):

  dataset = dataset.dropna(subset= nine_percent).copy()
  for i in features:

    dataset[i] = dataset[i].fillna(dataset[i].median())

  return dataset
In [35]:
nine_percent = ["AssessedModules", "PassedModules", "FailedModules"]
after_clean3 = missing_values3(after_clean3, new_features3, nine_percent)
In [36]:
#dataset split and and scaling of age feature
X3 = after_clean3.drop("CompletedCourse", axis = 1)
y3 = after_clean3["CompletedCourse"]

X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y3, random_state=1, test_size=0.2)

scale = ["Age", "AuthorisedAbsenceCount", "UnauthorisedAbsenceCount","AssessedModules", "PassedModules", "FailedModules"]
standardscaler = StandardScaler()
X_train3[scale] = standardscaler.fit_transform(X_train3[scale])
X_test3[scale] = standardscaler.transform(X_test3[scale])
In [37]:
tuner3 = kt.RandomSearch(lambda hp: model_builder(hp, X_train3),
                        objective = "val_accuracy",
                        max_trials=10,
                        executions_per_trial=1,
                        directory='dir',
                        project_name='model_5')

early_stop = keras.callbacks.EarlyStopping(monitor="val_loss", patience=5, restore_best_weights=True)
tuner3.search(X_train3, y_train3, epochs=100, batch_size=32, validation_split=0.2, callbacks=[early_stop], verbose=1)
Trial 10 Complete [00h 01m 10s]
val_accuracy: 0.9830276370048523

Best val_accuracy So Far: 0.986038863658905
Total elapsed time: 00h 15m 06s
In [38]:
#best model
print("\n\033[1m Best performing Model after tuning\033[0m\n")
best_hyperparameters3 = tuner3.get_best_hyperparameters(num_trials=1)[0]
model_comparism(tuner3, best_hyperparameters3, X_train3, y_train3, X_test3, y_test3)

#Initial configuration Model
print("\n\033[1m Initial configuration Model\033[0m\n")
worst_hyperparameters3 = tuner.get_best_hyperparameters(num_trials=10)[2]
model_comparism(tuner3, worst_hyperparameters3, X_train3, y_train3, X_test3, y_test3)
 Best performing Model after tuning

143/143 ━━━━━━━━━━━━━━━━━━━━ 1s 8ms/step - accuracy: 0.9889 - auc_3: 0.9980 - loss: 0.0588 - precision_3: 0.9908 - recall_3: 0.9973
143/143 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step
Accuracy: 0.99, 
AUC: 1.00, 
Precision: 0.99, 
Recall: 1.00, 
Confusion Matrix: 
[[ 268   38]
 [  14 4246]]
 Initial configuration Model

143/143 ━━━━━━━━━━━━━━━━━━━━ 1s 7ms/step - accuracy: 0.9859 - auc_4: 0.9962 - loss: 0.0730 - precision_4: 0.9910 - recall_4: 0.9938
143/143 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
Accuracy: 0.99, 
AUC: 1.00, 
Precision: 0.99, 
Recall: 0.99, 
Confusion Matrix: 
[[ 271   35]
 [  33 4227]]
In [39]:
xgb_model_builder(X_train3, y_train3, X_test3, y_test3)
Model trained with default parameters

Accuracy: 0.99, 
AUC: 1.00, 
Precision: 0.99, 
Recall: 1.00, 
Confusion Matrix: 
[[ 283   23]
 [  20 4240]]
Fitting 3 folds for each of 10 candidates, totalling 30 fits

Model trained with hyperparameter tuning using RandomizedSearchCV

Accuracy: 0.99, 
AUC: 1.00, 
Precision: 0.99, 
Recall: 1.00, 
Confusion Matrix: 
[[ 280   26]
 [  15 4245]]
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>