In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.manifold import TSNE
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
In [29]:
loc = "CUSTOMERS_CLEAN.csv"
retail = pd.read_csv(loc)

Efficiency in Programming: ColumnTransformer and Pipeline¶


Objective¶

To explore the feasibility of incorporating ColumnTransformer and Pipeline for more efficient programming during dimensional reduction and clustering processes.


Findings¶

  1. Current Approach:

    • I utilized functions effectively to streamline repetitive tasks in the project.
    • Core components such as t-SNE, PCA, and StandardScaler were implemented only once, ensuring consistency and avoiding redundancy.
  2. Limitations of ColumnTransformer and Pipeline:

    • The ColumnTransformer and Pipeline frameworks, while designed for efficient pre-processing and modeling workflows, do not align well with the specific structure and methodology I have adopted in this project.
    • The sequential and exploratory nature of this analysis, where each step (e.g., scaling, dimensional reduction, clustering) was iteratively refined, is better suited to the current approach.
  3. Trade-offs:

    • While ColumnTransformer and Pipeline offer modularity and reusable code, the current method already provides clarity and control over individual processes.
    • Attempting to retrofit these tools would add unnecessary complexity without providing significant benefits for this project.

Conclusion¶

  • The current implementation is efficient and avoids redundancy by leveraging functions to centralize operations for t-SNE, PCA, and scaling.
  • For this project, ColumnTransformer and Pipeline are not well-suited due to their rigid workflow structure, which conflicts with the iterative and exploratory process undertaken.
In [4]:
retail.head()
Out[4]:
Quantity City Continent Postal_Code State_Province Order_Date Delivery_Date Total Revenue Unit Cost Discount OrderTypeLabel CustomerCountryLabel Customer_BirthDate Customer_Group Customer_Type Order ID Profit Days to Delivery Loyalty Num Customer ID
0 3 Leinster Oceania 6437 Western Australia 01JAN2012 07JAN2012 $28.50 $9.10 . Internet Sale Australia 08MAY1978 Internet/Catalog Customers Internet/Catalog Customers 1230000033 $1.20 6 99 8818
1 2 Berowra Oceania 2081 New South Wales 01JAN2012 04JAN2012 $113.40 $56.90 . Internet Sale Australia 13DEC1978 Orion Club Gold members Orion Club Gold members high activity 1230000204 ($0.40) 3 99 47793
2 2 Berowra Oceania 2081 New South Wales 01JAN2012 04JAN2012 $41.00 $18.50 . Internet Sale Australia 13DEC1978 Orion Club Gold members Orion Club Gold members high activity 1230000204 $4.00 3 99 47793
3 1 Northbridge Oceania 2063 New South Wales 01JAN2012 03JAN2012 $35.20 $29.60 . Internet Sale Australia 22JUN1997 Orion Club Gold members Orion Club Gold members high activity 1230000268 $5.60 2 0 71727
4 1 Montréal North America NaN Quebec 01JAN2012 04JAN2012 $24.70 $23.60 . Internet Sale Canada 28JAN1978 Orion Club Gold members Orion Club Gold members medium activity 1230000487 $1.10 3 99 74503

The initial data preprocessing will focus on cleaning the features that will be utilized to create five new derived features. Subsequent analysis and clustering will be performed on these five features.

In [35]:
def initial_preprocessing(retail):

    retail = retail.copy()
    print("Missing values:\n", retail.isnull().sum())
    print("Checking for duplicate values:\n", retail.duplicated().sum())
    print("Data Shape:\n", retail.shape)
    # dropping duplicate rows
    retail = retail.drop_duplicates()
    print("Checking for duplicate values after dropping:\n", retail.duplicated().sum())

    return retail
In [37]:
retial = initial_preprocessing(retail)
Missing values:
 Quantity                     0
City                       135
Continent                    0
Postal_Code               3716
State_Province          117192
Order_Date                   0
Delivery_Date                0
Total Revenue                0
Unit Cost                    0
Discount                     0
OrderTypeLabel               0
CustomerCountryLabel         0
Customer_BirthDate           0
Customer_Group               0
Customer_Type                0
Order ID                     0
Profit                       0
Days to Delivery             0
Loyalty Num                  0
Customer ID                  0
dtype: int64
Checking for duplicate values:
 21
Data Shape:
 (951669, 20)
Checking for duplicate values after dropping:
 0

Observations¶

  • Three features in the dataset contain missing values, specifically:
    • City: 135
    • Postal_Code: 3716
    • State_Province: 117192
  • However, the missing values will not be addressed, as the associated features are not essential for the completion of this project.
  • Data size: 951669 and Shape: 20
  • The dataset originally included 21 duplicate rows, which have been identified and removed.
  • The next step in dataset preprocessing will involve cleaning key features essential and features engineering for the successful completion of this project. These features are:
    • Customer ID
    • Delivery_Date
    • Order ID
    • Total Revenue
    • Unit Cost
    • Customer_BirthDate
In [48]:
def initial_preprocessing2(retail, columns):

    retail = retail.copy()

    today = pd.to_datetime("today")
    retail["Delivery_Date"] = pd.to_datetime(retail["Delivery_Date"], format="%d%b%Y")
    retail["Customer_BirthDate"] = pd.to_datetime(retail["Customer_BirthDate"], format="%d%b%Y")
    
    for i in columns:
        retail[i] = retail[i].replace("[\$,]", "", regex=True).astype(float)

    aggregated = retail.groupby("Customer ID").agg(
        frequency = ("Order ID", "count"), # 
        recency = ("Delivery_Date", lambda x: (today - x.max()).days),  # Days since most recent order
        CLV = ("Total Revenue", "sum"),  # Total Revenue as CLV
        Avg_unit_cost = ("Unit Cost", "mean"), # average unit cost
        age = ("Customer_BirthDate", lambda x: round((today - x.max()).days/365, 1))  # cal age
    ).reset_index()

    return aggregated
In [50]:
columns = ["Total Revenue", "Unit Cost"]
project_data = initial_preprocessing2(retail, columns)
In [53]:
project_data = initial_preprocessing(project_data)
Missing values:
 Customer ID      0
frequency        0
recency          0
CLV              0
Avg_unit_cost    0
age              0
dtype: int64
Checking for duplicate values:
 0
Data Shape:
 (68300, 6)
Checking for duplicate values after dropping:
 0

The next step in our analysis will focus on performing exploratory data analysis (EDA) on the newly created features and DataFrame.

In [56]:
plotting_data = project_data.drop("Customer ID", axis = 1)
plotting_data.describe()
Out[56]:
frequency recency CLV Avg_unit_cost age
count 68300.000000 68300.000000 68300.000000 68300.000000 68300.000000
mean 13.933660 3270.399971 1950.168370 78.894765 51.592201
std 11.329121 414.289931 1719.939245 38.004880 17.486454
min 1.000000 2884.000000 0.630000 0.500000 27.000000
25% 6.000000 2948.000000 696.000000 57.579792 36.200000
50% 11.000000 3092.000000 1497.450000 73.449286 51.400000
75% 19.000000 3442.000000 2709.845000 92.515260 66.600000
max 121.000000 4728.000000 18860.960000 1463.500000 82.000000
In [60]:
fig, axes = plt.subplots(1, len(plotting_data.columns), figsize=(18, 6))

for i, column in enumerate(plotting_data.columns):
    sns.histplot(plotting_data[column], ax=axes[i], kde = True, bins = 20, color = "blue")
    axes[i].set_title(f'histogram of {column}')  # Title for each subplot

# Adjust the layout to prevent overlap
plt.tight_layout()
plt.show()
In [62]:
fig, axes = plt.subplots(1, len(plotting_data.columns), figsize=(18, 6))

for i, column in enumerate(plotting_data.columns):
    sns.boxplot(data = plotting_data[column], ax=axes[i])
    axes[i].set_title(f'Boxplot of {column}')  # Title for each subplot

# Adjust the layout to prevent overlap
plt.tight_layout()
plt.show()
In [64]:
sns.scatterplot(data = plotting_data, x = "frequency", y = "CLV") #investigating for linear relationship 
plt.show()
In [66]:
sns.scatterplot(data = plotting_data, x = "Avg_unit_cost", y = "CLV") #investigating for linear relationship 
plt.show()

Data Preprocessing Findings and Exploratory Data Analysis¶

Data Preprocessing Findings¶

The data preprocessing phase 2 focused on ensuring the dataset's quality and reliability for effective customer segmentation. The key metrics—Frequency, Recency, Customer Lifetime Value (CLV), Average Unit Cost, and Age—were prioritized to align with business objectives. Below are the summarized findings:


1. Missing Values¶

  • No missing values remain after aggregating the dataset to one row per customer.

2. Distribution Analysis¶

  • All features exhibit medians lower than their means, indicating potential right-skewness. Histograms confirm this observation.
  • The Age feature demonstrates a multimodal distribution.

3. Outliers¶

  • Significant outliers exist above the typical ranges for all features except Age, which has no outliers.
  • Average Unit Cost has an additional outlier below its typical range.

4. Feature Relationships¶

  • Frequency has a linear relationship with Customer Lifetime Value (CLV), suggesting that customers who purchase more frequently also contribute higher lifetime value.
  • There is no observable relationship between Frequency and Average Unit Cost.

Next Steps¶

To ensure robust segmentation, outlier detection using Isolation Forest will be conducted to better understand and manage the influence of extreme values.


In [69]:
# Sclar 
scaler = StandardScaler()
retail_scaled = scaler.fit_transform(plotting_data)
retail_scaled = pd.DataFrame(retail_scaled, columns = plotting_data.columns)
In [71]:
# 11. dimension reduction with PCA and t-SNE to reduce the data to 2D
# PCA
pca = PCA(n_components = 2)
pca_result = pca.fit_transform(retail_scaled)
X_pca = pd.DataFrame(pca_result, columns=[f'PC{i+1}' for i in range(pca_result.shape[1])])
In [73]:
# t-SNE
TSNE_model = TSNE(n_components=2, perplexity=10)
TSNE_transformed_data = TSNE_model.fit_transform(retail_scaled)
X_tsne = pd.DataFrame(TSNE_transformed_data, columns=[f'tsne{i+1}' for i in range(TSNE_transformed_data.shape[1])])
In [75]:
def plot_pca(X_pca, y):
   
    X_pca["anomaly"] = y
    
    custom_palette = {1: "blue", -1: "red"}
    sns.scatterplot(data= X_pca, x = "PC1", y = "PC2", hue = "anomaly", palette = custom_palette)
    plt.legend(labels=['Normal ', 'Anomaly'], loc='upper right') 
    plt.show()
In [77]:
def iso_forest(retail):

    retail = retail.copy()

    model = IsolationForest(n_estimators = 100, contamination =  0.025, random_state = 42)
    model.fit(retail)

    y_pred = model.predict(retail)
    print(pd.Series(y_pred).value_counts(normalize=True) * 100)

    return y_pred
In [79]:
iso_retail_pred = iso_forest(plotting_data)
 1    97.499268
-1     2.500732
Name: proportion, dtype: float64
In [81]:
plot_pca(X_pca, iso_retail_pred)

Outlier Detection: Observations and Model Performance¶


Isolation Forest Model¶

Model Parameters¶

  • Contamination Factor: Set at 0.025, within the standard guideline range of 1–5%, to identify approximately 2.5% of the data as anomalies.
  • Number of Estimators (n_estimators): Set to 100 for robust detection of outliers.

Model Observations¶

  • The model effectively identified anomalies, yielding results consistent with the target range of 1–5% (specifically 2.5%).
  • The identified outliers represent extreme deviations in one or more of the key metrics: Frequency, Recency, CLV, Average Unit Cost, and Age.

PCA-Based Visualization¶

Visualization Insights¶

  • 2D PCA Scatterplot: Used to visualize the data in a reduced-dimensionality space.
    • Normal data points are represented in one color (e.g., blue).
    • Outliers are highlighted distinctly in another color (e.g., red).
  • The identified outliers are concentrated at the periphery of the scatterplot, suggesting they deviate significantly from the central distribution of the data.
  • Clustering of outliers is observed at both the top and the tail of the scatterplot, indicating regions where anomalous values in the metrics are more pronounced.

Conclusion¶

The Isolation Forest model, combined with PCA visualization, provides a robust approach to identifying and validating outliers in the dataset. The results highlight potential areas of concern or interest, such as extreme values in customer behavior or product preferences. These outliers will be carefully examined and, where necessary, managed to ensure they do not unduly influence the customer segmentation process.


Next Steps¶

Select the optimum value of clusters ( k ) with the Elbow and Silhouette score methods


In [84]:
k_range = range(2, 11)
def k_means(retail, k_range):

    wss = []
    silhouette_scores = []
    retail = retail.copy()
    
    for k in k_range:
        
        kmeans = KMeans(n_clusters = k, init = 'k-means++', random_state = 42, n_init = 10)
        kmeans.fit(retail)
        score = silhouette_score(retail, kmeans.labels_)
        
        wss.append(kmeans.inertia_)
        silhouette_scores.append(score)

    return wss, silhouette_scores
In [86]:
wss, silhouette_score = k_means(retail_scaled, k_range)
In [88]:
# with scaled dataset
plt.figure(figsize=(8, 6))
plt.plot(k_range, wss)
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.show()
In [90]:
plt.figure(figsize=(8,6))
plt.plot(k_range, silhouette_score)
plt.title('Silhouette Score for Different Values of k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.show()
In [92]:
for k, score in zip(k_range, silhouette_score):
    print(f"Silhouette Score for k={k}: {score}")
Silhouette Score for k=2: 0.2608448975514255
Silhouette Score for k=3: 0.2525993965818325
Silhouette Score for k=4: 0.25319115183306273
Silhouette Score for k=5: 0.26694482763779953
Silhouette Score for k=6: 0.2525431591824519
Silhouette Score for k=7: 0.2351264830470789
Silhouette Score for k=8: 0.23495134858551955
Silhouette Score for k=9: 0.24030792886693111
Silhouette Score for k=10: 0.22471383902722936

Determining the Optimal Number of Clusters ( ( k ) )¶


1. Elbow Method for Determining ( k )¶

Overview¶

The Elbow Method was applied to identify the optimal number of clusters (( k )) for customer segmentation by analyzing the within-cluster sum of squares (WSS). This method helps pinpoint the ( k ) value where adding more clusters yields diminishing returns in clustering performance.

Findings¶

  • The WSS was evaluated for ( k ) values ranging from 2 to 10.
  • Based on the elbow plot, the optimal ( k ) lies between 5 and 6:
    • The WSS decreases significantly up to ( k = 5 ).
    • After ( k = 5 ), the rate of improvement slows, indicating marginal gains with additional clusters.

Next Steps¶

To confirm the optimal ( k ) and assess clustering quality, the Silhouette Score method was used as a secondary evaluation metric.


2. Optimal ( k ) Based on the Silhouette Score¶

Overview¶

The Silhouette Score measures clustering quality by evaluating:

  • Cohesion: How similar points are within a cluster.
  • Separation: How distinct clusters are from one another.

Findings¶

  • The Silhouette Scores were calculated for various ( k ) values, with the highest score observed at ( k = 5 ).
  • This result indicates that a clustering structure with 5 clusters provides the best balance between intra-cluster similarity and inter-cluster dissimilarity.

3. Conclusion¶

Perform K-Means Clustering¶

With ( k = 5 ) determined as the optimal number of clusters:

  • K-Means Clustering will be performed to segment customers.
  • The analysis will focus on understanding the characteristics of each cluster based on:
    • Frequency
    • Recency
    • Customer Lifetime Value (CLV)
    • Average Unit Cost
    • Age
  • These insights will inform targeted marketing strategies and customer retention initiatives.
In [95]:
# performing K means with k = 5
def k_mean(retail_scaled, k):

    retail_scaled = retail_scaled.copy()
    kmeans = KMeans(n_clusters = 5, init = "k-means++", random_state = 42, n_init = 10)

    y_cluster = kmeans.fit_predict(retail_scaled)
    print("\nCluster Labels for each data point:\n", pd.Series(y_cluster).value_counts())

    return y_cluster
In [ ]:
k = 5
y_cluster = k_mean(retail_scaled, k)
In [99]:
# Viewing the cluster number associated with each customer_ID
cluster  = project_data[["Customer ID"]].copy()
cluster["class"] = y_cluster
cluster.head()
Out[99]:
Customer ID class
0 1 3
1 3 4
2 4 0
3 5 4
4 6 1

Performing K-Means Clustering and Viewing Cluster Labels¶


1. K-Means Clustering¶

After determining that the optimal number of clusters (( k = 5 )) was based on the Elbow Method and Silhouette Score, we performed K-Means Clustering on the dataset to segment the customers.

Cluster Assignment¶

At Run Time The K-Means algorithm assigned each customer to one of the five clusters. Below are the cluster labels assigned to each customer based on the clustering process:

  • Cluster 1: 22,592 customers
  • Cluster 3: 20,296 customers
  • Cluster 4: 11,111 customers
  • Cluster 2: 10,622 customers
  • Cluster 0: 3,679 customers

3. Next Step: Create Boxplots¶

The next step involves creating boxplots to display the clusters with regard to the following key metrics:

  • Frequency
  • Recency
  • Customer Lifetime Value (CLV)
  • Average Unit Cost
  • Customer Age

These visualizations will help us understand the distribution and variability of these metrics within each cluster, highlighting key differences and trends. The boxplots will also help identify any outliers that might influence cluster interpretations.

Objective¶

Using boxplots, we will compare and contrast clusters based on the metrics to gain actionable insights for targeted marketing strategies and customer retention efforts.

In [102]:
# Create boxplots to display the clusters with regard to frequency, recency, CLV, average unit cost, and customer age.
boxplotting = plotting_data.copy()
boxplotting["class"] = y_cluster

fig, axes = plt.subplots(1, len(boxplotting.columns) - 1, figsize=(18, 6))

for i, column in enumerate(boxplotting.columns):
    if column != "class":
        sns.boxplot(data = boxplotting, x = "class", y = column, ax=axes[i])
        axes[i-1].set_title(f'Boxplot of {column}')  # Title for each subplot

# Adjust the layout to prevent overlap
plt.tight_layout()
plt.show()

Boxplot Analysis of Clusters Based on Key Metrics¶


Objective¶

To understand the differences in customer behavior across the five clusters, we created boxplots for the following metrics:

  • Frequency
  • Recency
  • Customer Lifetime Value (CLV)
  • Average Unit Cost
  • Customer Age

These visualizations provide insights into the distribution and variability of each metric within each cluster, helping us better interpret customer segmentation.


Findings from Boxplots¶

1. Outlier Analysis¶

  • All features exhibit a significant number of outliers above their typical ranges, except for frequency, which shows a more stable distribution across clusters.
  • For age, clusters 1 and 4 display outliers below their typical ranges, indicating the presence of some unusually younger customers in these segments.

2. Cluster-Specific Observations¶

  • Frequency: This feature demonstrates a relatively stable distribution across all clusters, suggesting that purchase frequency is consistent and not a strong differentiator for segmenting customers.
  • Recency: Clusters vary significantly, indicating differences in customer engagement. For example, customers in some clusters tend to make more recent purchases, highlighting potentially higher engagement or loyalty.
  • Customer Lifetime Value (CLV): Larger variability across clusters reflects that some groups contribute significantly more revenue over their relationship with the business, potentially due to high-value or repeated purchases.
  • Average Unit Cost: Wide variability and outliers in this metric suggest that certain clusters include customers purchasing either very low-cost or very high-cost items, possibly indicative of different purchasing power or product preferences.
  • Age: Variability and the presence of younger outliers in clusters 1 and 4 suggest that these groups may include a higher proportion of younger customers, which could be valuable for marketing campaigns targeting this demographic.

Interpretation in Business Context¶

  1. Outliers and Variability:
    The significant number of outliers, especially in CLV and Average Unit Cost, indicates a diverse customer base with varying purchasing power and product preferences. Marketing strategies should focus on identifying and catering to high-value customers while considering the needs of price-sensitive segments.

  2. Cluster Differences:

    • Clusters with higher recency values may represent engaged customers, providing an opportunity for upselling or loyalty programs.
    • Clusters with higher CLV should be prioritized for retention efforts, as they are the most valuable to the business over time.
    • Clusters with younger customers (based on age) can benefit from targeted campaigns promoting products that appeal to this demographic, such as trendy or technology-focused offerings.
  3. Frequency as a Stable Feature:
    The stability in frequency suggests it might be less effective for differentiation. However, it remains crucial for measuring overall engagement and identifying loyal customers who make frequent purchases.


In [105]:
# 2D visualisation to display the clusters with different colours. Use the output from the PCA and t-SNE.
X_tsne_plot = X_tsne.copy()
X_tsne_plot["class"] = y_cluster

sns.scatterplot(data = X_tsne_plot, x = "tsne1", y = "tsne2", hue = "class", palette="Set1")
plt.legend(loc='upper right') 
plt.title("T-SNE Plot")
plt.show()
In [107]:
X_pca_plot = X_pca.copy()
X_pca_plot["class"] = y_cluster

sns.scatterplot(data = X_pca_plot, x = "PC1", y = "PC2", hue = "class", palette="Set1")
plt.legend(loc='upper right') 
plt.title("PCA Plot")
plt.show()

Dimension Reduction with PCA and t-SNE¶


Objective¶

To visualize the customer clusters in a 2D space, we performed dimension reduction using both PCA (Principal Component Analysis) and t-SNE (t-distributed Stochastic Neighbor Embedding). These methods help uncover patterns in high-dimensional data by projecting it into two dimensions while retaining as much meaningful structure as possible.


Findings from PCA and t-SNE¶

1. PCA Visualization¶

  • The PCA-based 2D plot shows clear separations between most clusters.
  • However, there is some overlap involving Cluster 4, which suggests that customers in this group may share similarities with customers in other clusters.

2. t-SNE Visualization¶

  • The t-SNE-based 2D visualization also demonstrates clear separations between clusters, with similar patterns to the PCA plot.
  • Adjusting the perplexity parameter (tested at values of 30, 15, and 10) did not significantly impact the visual output, confirming the stability of the t-SNE results.
  • As in PCA, Cluster 4 overlaps with other clusters, reinforcing that this cluster may represent a transitional or mixed group of customers.

Interpretation in Business Context¶

  1. Clear Cluster Separations:
    The distinct separations observed in both PCA and t-SNE visualizations validate the robustness of the clustering process. Each cluster represents a unique segment of customers with specific behavioral patterns, making them actionable for tailored marketing strategies.

  2. Cluster 1 Overlap:
    The overlap of Cluster 4 with other clusters suggests that this group may include customers whose purchasing behavior, engagement, or demographics share characteristics with multiple segments. This transitional nature may require a nuanced approach:

    • Conduct further analysis of Cluster 4 to identify sub-groups or shared traits.
    • Develop generalized marketing strategies that appeal to broader behaviors while targeting key similarities within this group.
  3. Validation of Clustering with Dimensional Reduction:
    The alignment between PCA and t-SNE results strengthens the confidence in the identified clusters and supports their use for actionable customer segmentation.


Visual Representations¶

  1. PCA-Based 2D Visualization:
    The PCA plot provides a straightforward representation of the cluster structure, highlighting the general separations and overlaps between groups.

  2. t-SNE-Based 2D Visualization:
    The t-SNE plot emphasizes local relationships between data points, revealing nuanced structures and validating the robustness of the clustering process.


In [120]:
# 7. Perform hierarchical clustering and create a dendogram.

retail_sample = retail_scaled.sample(30000)

agglo_cluster = AgglomerativeClustering(n_clusters = 5, metric='euclidean', linkage='ward')
y_agglo_cluster = agglo_cluster.fit_predict(retail_sample)

linked = linkage(retail_sample, method='ward')
plt.figure(figsize=(10, 7))
dendrogram(linked)
plt.title('Dendrogram for Hierarchical Clustering')
plt.show()

The dendrogram revealed a clear cutoff point at 5 clusters, confirmed by the distinct separation of branches at this level. This aligns with the results of the Elbow and Silhouette analyses from K-Means.