Applying supervised learning to predict student dropout

Student retention is critical for educational institutions, impacting financial sustainability and
academic success. High dropout rates can lead to revenue losses and reputational damage.
Study Group, a global education provider, aims to enhance student success by identifying at
risk students early and implementing proactive interventions. This study applies supervised
machine learning techniques to predict dropout risks, enabling Study Group to refine its support
strategies and improve student retention.

Note: I can’t make the data available due to a non-disclosure agreement (NDA).

  • Project Definition
  • Jupyter Notebook
  • Report

πŸ“Š Predicting Student Dropout with Supervised Learning

Client: Study Group (Global Education Provider)

Tools & Technologies: Python, Pandas, Scikit-learn, Keras, XGBoost, Data Visualization, Supervised Machine Learning

🧠 Problem

Study Group sought to proactively reduce student dropout rates across its International Study Centres in the UK and Dublin. High dropout rates not only impacted student outcomes but also risked significant revenue loss and reputational harm.

🎯 Objective

Develop a data-driven model to accurately predict student dropout risks using demographic, engagement, and academic performance data β€” enabling early interventions and targeted student support.

πŸ“Œ Approach

1. Data Preprocessing & Feature Selection

  • Handled 25,060 learner records from three datasets across different learner journey stages.
  • Cleaned and reduced dimensionality by eliminating high-cardinality and low-completeness features.

2. Stage-wise Model Development

  • Stage 1 – Demographic & Course Info:
    Created baseline models using data such as gender, course level, nationality, and course completion status.
    ✦ Achieved up to 90% accuracy but found limited predictive power using demographics alone.
  • Stage 2 – Engagement Data:
    Introduced Authorised and Unauthorised Absence metrics.
    ✦ Slight improvement: AUC rose from 0.89 to 0.92, demonstrating the predictive value of real-time student behavior.
  • Stage 3 – Academic Performance Data:
    Integrated academic results, including number of passed and failed modules.
    ✦ Achieved 99% accuracy and 1.00 AUC, proving academic performance to be the strongest dropout predictor.

3. Modeling Techniques

  • Applied and tuned XGBoost and Neural Network models.
  • Conducted hyperparameter tuning with RandomizedSearchCV and Keras Tuner.
  • Used Confusion Matrix, AUC, Precision & Recall for model evaluation.

πŸ’‘ Key Insights

  • Early-stage demographic data is insufficient alone for accurate predictions.
  • Engagement data enhances early detection, but academic data is critical for precision intervention.
  • Strong recall (1.00) ensures no at-risk students go undetected, supporting proactive academic counseling.

πŸ† Outcome & Business Impact

  • Delivered a clear, actionable report to stakeholders highlighting intervention points.
  • Empowered Study Group with an evidence-backed roadmap to:
    • Improve student retention
    • Enhance academic outcomes
    • Save on lost tuition and institutional costs

Leave a Reply