Customer segmentation enables a business to group customers based on demographics (e.g., age, gender, education, occupation, marital status, and family size), geographics (e.g. country, time zone, language, and location), psychographics (e.g. lifestyle, values, personality, and attitudes), behaviour (e.g. purchase history, brand loyalty, and response to marketing activities), technographics (e.g. device type, browser type, and original source), and needs (e.g. product features, service needs, and delivery method). My challenge in this project is to apply critical thinking and machine learning concepts to design and implement clustering models to perform customer segmentation and improve marketing efforts.
- Project Definition
- Jupyter Notebook
- Report
Business context
I was provided with an e-commerce dataset from a real-world organization to perform customer segmentation using clustering models to improve marketing efforts (SAS, 2024). It is a transnational dataset with customers from 47 countries across five continents (Oceania, North America, Europe, Africa, and Asia).
The dataset contains 951,668 rows, each representing a product a customer ordered. It includes details about the customer (e.g., location, product type, and loyalty member) and order (e.g., days to delivery, delivery date, order date, cost, quantity ordered, and profit) based on orders between 1 January 2012 and 30 December 2016.
As each customer is unique, it was critical for me to identify and/or create new features for customer segmentation to inform marketing efforts. The dataset had 20 features that I could choose from:
- Quantity: The quantity the customer orders (e.g., 1, 2, or 3).
- City: Name of the customer’s residence (e.g., Leinster, Berowra, or Northbridge).
- Continent: Name of the continent where the customer resides (e.g., Oceania, North America).
- Postal code: Where the customer resides (e.g., 6437, 2081, or 2063).
- State province: State or province where the customer resides (e.g., Western Australia, Quebec, or New South Wales).
- Order date: The date the order was placed (e.g., 1 January 2012 or 20 June 2014).
- Delivery date: The date the order was delivered (e.g., 12 April 2014 or 19 November 2016).
- Total revenue: Total revenue based on ordered items in USD (e.g., $123.80 or $85.10).
- Unit cost: Cost per unit ordered in USD (e.g., $9.10 or $56.90).
- Discount: Percentage or normal total retail price (e.g., 50% or 30%).
- Order type label: Method in which the order was placed (e.g., internet sale or retail sale).
- Customer country label: The country where the customer resides (e.g., Australia, Canada, or Switzerland).
- Customer birthdate: The date the customer was born (e.g., 8 May 1978 or 18 December 1978).
- Customer group: Loyalty member group (e.g., internet/catalogue customers or Orion club gold members).
- Customer type: Loyalty member level (e.g., internet/catalogue customers or Orion club gold members high activity).
- Order ID: Unique order identifier (e.g., 1230000033).
- Profit: Total profit is calculated: Total profit = total revenue − (unit cost × quantity) in USD (e.g., $1.20, $0.40).
- Days to delivery: The number of days for delivery is calculated: Delivery days = delivery date − order date (e.g., 6, 3, 2).
- Loyalty number: Loyal customer (99) versus non-loyal customer (0).
- Customer ID: A unique identifier for the customer (e.g., 8818, 47793).
Since I was working with a transnational dataset, which implies customers from different continents, several metrics were important for performing customer segmentation for targeted marketing. From a marketing perspective, I focused on metrics that helped understand the nuances of the customer base, buying behavior, preferences, and value to the business.
- Frequency: I analyzed how often a customer made purchases over a given period. High frequency indicated loyal customers, satisfaction, trust, or brand loyalty, and the effectiveness of marketing efforts. Frequency guided my evaluation of target marketing campaigns and highlighted opportunities to engage less active customers.
- Recency: I assessed how recently customers made purchases or placed orders, as this predicted customer churn (turnover) and engagement. Recent customer activity helped me infer satisfaction and overall engagement levels.
- Customer lifetime value (CLV): I calculated the total value customers contributed to the business over their relationship. This metric prioritized marketing efforts and focused on retaining high-value customers.
- Average unit cost: I examined whether customers preferred low-cost or high-cost items. Understanding these preferences helped identify profitability and determine different marketing strategies.
Throughout this project, I encountered data science challenges, which I approached thoughtfully:
- Data quality and management: I addressed issues of inaccuracy, inconsistency, and incompleteness. I ensured clear definitions of customer segments and incorporated rigorous feature engineering and data preprocessing.
- Relevance segmentation: I carefully selected relevant criteria to avoid diluted clustering and overlapping cluster characteristics.
- Dynamic customer behavior: I considered seasonal trends and rapidly changing customer preferences influenced by economic factors.
- Privacy and ethical concerns: I navigated legal and ethical implications when analyzing customer data, ensuring unbiased segmentation.
- Actionability: I balanced the creation of broad and narrow segments to maintain marketing efficiency.
My task was to develop robust customer segmentation to assist the e-commerce company in understanding and serving its customers better. I explored the data, performed preprocessing and feature engineering, applied dimensional reduction techniques, and used clustering models for segmentation.
I successfully prepared a comprehensive report for stakeholders, presenting key insights derived from my analysis. My findings demonstrated how my solution could drive cost savings and strengthen stakeholder trust. Specifically, I:
- Extracted valuable insights from the dimensional reduction analysis.
- Identified patterns in the data and provided actionable recommendations to the company.
- Determined the most effective statistical or ML technique for identifying the optimal number of clusters (k).
- Compared clusters based on frequency, recency, CLV, and average unit cost to reveal meaningful trends.