Unraveling Recommender Systems: From Popularity to Personalization

Introduction In today’s digital age, personalized recommendations drive engagement. From Netflix suggesting your next watch to Amazon’s “Customers who bought this also bought…”, recommender systems shape our choices. But how do they work? By analyzing user interactions, they uncover hidden patterns. In this article, we’ll explore their mechanics and implement them using real-world movie data. Types of Recommender Systems Recommender systems can be broadly categorized into four main types: Popularity-based, Content-based filtering, Collaborative filtering, and Hybrid approaches. Each method has its own strengths and challenges, which we will explore in detail below. Popularity-based approach Recommends popular items based on overall engagement, not individual preferences. Ideal for suggesting trending movies, best-selling books, or top-streamed songs where personalization isn’t needed. Collaborative filtering approach Suggests items based on user or item similarities (user-based or item-based CF). However, it struggles with the cold start problem when new users or items lack data. Netflix addresses this by asking new users to rate movies upfront, enabling better initial recommendations. Content-based filtering approach Analyzes item attributes like genres and features to recommend similar content based on user preferences. This helps with cold start issues but may cause over-specialization, limiting diversity. For example, a sci-fi fan might only get sci-fi recommendations. Spotify avoids this with features like ‘Discover Weekly,’ which introduces fresh yet relevant content. Hybrid approach Combines multiple techniques to leverage their respective strengths. For instance, a hybrid model might: Recommend trending movies based on overall engagement (popularity-based). Offer highly rated movies enjoyed by users with similar preferences (collaborative filtering). Suggest movies with similar genres to those a user has previously liked (content-based). This ensures recommendations are both personalized and diverse, avoiding issues like over-specialization or cold start problems. Context-aware approach Uses external factors like time, location, device, or mood to refine recommendations in real time. For example, a streaming service might suggest upbeat music during a morning commute and relaxing shows at home in the evening, making recommendations more dynamic and relevant. Building a Hybrid Recommender System We’ll develop a Hybrid recommender system using the MovieLens dataset, progressing from basic to advanced techniques. First, we perform Exploratory Data Analysis (EDA) to examine data structure, detect missing values, and identify trends. Visualizing ratings reveals user preferences and biases, ensuring a solid foundation for model selection. rating = pd.read_csv('/kaggle/input/movielens-20m-dataset/rating.csv') movie = pd.read_csv('/kaggle/input/movielens-20m-dataset/movie.csv') print(rating.info()) print(movie.info()) sns.histplot(rating['rating'], bins=10, kde=True) sns.histplot(rating['movieId'].value_counts(), bins=50, kde=True) Popularity-based approach To implement this, we first aggregate user ratings to compute the average rating and number of ratings for each movie. Since movies with very few ratings may not be reliable indicators of popularity, we apply a popularity threshold—for example, only considering movies with at least 50 ratings. Finally, we rank movies based on the number of ratings and average score, ensuring that frequently rated and highly appreciated titles appear at the top. # Aggregate Ratings movie_ratings = rating.groupby('movieId').agg( avg_rating=('rating', 'mean'), num_ratings=('rating', 'count') ).reset_index() # Merge with Movie Titles movie_ratings = movie_ratings.merge(movie[['movieId', 'title']], on='movieId', how='left') # Set a Popularity Threshold threshold = 50 # Change as needed popular_movies = movie_ratings[movie_ratings['num_ratings'] >= threshold] # Rank Movies by Popularity popular_movies = popular_movies.sort_values(by=['num_ratings', 'avg_rating'], ascending=[False, False]) # Display Top N Popular Movies def get_top_popular_movies(n=10): return popular_movies[['title', 'avg_rating', 'num_ratings']].head(n) # Example Usage print(get_top_popular_movies(10)) Content-based filtering approach We preprocess genre data into a TF-IDF (Term Frequency-Inverse Document Frequency) matrix, a technique that assigns importance to words (or genres) based on their frequency in a dataset. TF (Term Frequency) measures how often a term appears in a movie’s genre list, while IDF (Inverse Document Frequency) reduces the weight of common genres across all movies, ensuring that rare genres are emphasized. Next, we compute cosine similarity to find movies with similar genre profiles. For example, if a user likes Toy Story (1995), content-based filtering will recommend movies with similar genres. While this method personalizes suggestions, it can create a recommendation bubble, where users receive overly similar recommendations. # Fill NaN values in genres movie['genres'] = movie['genres'].fillna('') # Convert genres to a TF-IDF matrix vectorizer = TfidfVectorizer(stop_words='english') tfidf_matrix = vectorizer.fit_transform(movie['genres']) # Compute cosine similarity between movies cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix) # Create a mapping of movie titles to indices indices = pd.Series(movie.index, index=movie['title']).drop_duplicates() # Function to get similar movies def get_content_based_recommendations(title, n=10): if title not in indices: return "Movie not found." idx = indices[title] sim_scores = list(enumerate(cosine_sim[idx])) sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True) sim_scores = sim_scores[1:n+1] movie_indices = [i[0] for i in sim_scores] return movie[['title']].iloc[movie_indices] # Example Usage print(get_content_based_recommendations("Toy Story (1995)", 10)) Collaborative filtering approach Collaborative filtering personalizes recommendations by analyzing user-item interactions. Singular Value Decomposition (SVD) helps uncover hidden preference patterns by factorizing the user-item matrix. Using the MovieLens dataset, we train an SVD model with the Surprise library, splitting data into training and test sets. The trained model predicts ratings for unseen movies, generating personalized recommendations. Finally, we merge seen and recommended movies, providing a complete, tailored experience. Besides SVD, collaborative filtering can use k-Nearest Neighbors (k-NN) for user-item similarity, ALS (Alternating Least Squares) and NMF (Non-negative Matrix Factorization) for matrix factorization, and deep learning-based methods like Neural Collaborative Filtering (NCF) and Autoencoders for capturing complex, non-linear user-item interactions. Graph-based approaches such as Random Walk with Restart (RWR) are particularly effective in social network-based recommendations, while Bayesian Personalized Ranking (BPR) is useful for optimizing implicit feedback models by learning from pairwise ranking comparisons. # Loading and splitting data reader = Reader(rating_scale=(0.5, 5.0)) data = Dataset.load_from_df(rating[['userId', 'movieId', 'rating']], reader) trainset, testset = train_test_split(data, test_size=0.2) # Train SVD model svd = SVD() svd.fit(trainset) # Evaluate model predictions = svd.test(testset) print("RMSE:", rmse(predictions)) # Function to get movies already seen by a user def get_seen_movies(user_id): seen_movies = rating[rating['userId'] == user_id].merge(movie[['movieId', 'title']], on='movieId', how='left') seen_movies['type'] = 'seen' return seen_movies[['title', 'rating', 'type']].sort_values(by='rating', ascending=False) # Function to get movie recommendations for a user def get_collaborative_recommendations(user_id, n=10): movie_ids = rating['movieId'].unique() unseen_movies = [movie for movie in movie_ids if movie not in rating[rating['userId'] == user_id]['movieId'].values] predictions = [(movie, svd.predict(user_id, movie).est) for movie in unseen_movies] recommendations = sorted(predictions, key=lambda x: x[1], reverse=True)[:n] recommended_movies = pd.DataFrame(recommendations, columns=['movieId', 'predicted_rating']) recommended_movies = recommended_movies.merge(movie[['movieId', 'title']], on='movieId', how='left') recommended_movies['type'] = 'recommended' return recommended_movies[['title', 'predicted_rating', 'type']] # Merge the seen and recommended datasets seen_movies = get_seen_movies(1) recommended_movies = get_collaborative_recommendations(1, n=10) merged_movies = pd.concat([recommended_movies, seen_movies], ignore_index=True) display(merged_movies.head(20)) Hybrid approach To finalize our hybrid approach, we strategically combine popularity-based, content-based, and collaborative filtering methods, assigning weighted scores to each. By blending trending selections, personalized suggestions, and collaborative insights, we create a well-rounded recommendation system. This fusion mitigates over-specialization, enhances diversity, and ensures users receive both familiar favorites and fresh discoveries—striking the perfect balance between relevance and exploration. # Hybrid Recommender with Weighted Scoring def get_hybrid_recommendations_with_weights(user_id, user_favorite_movie, popularity_weight=0.33, content_weight=0.33, collaborative_weight=0.33, n=10): # Get top popular movies popular_movies = get_top_popular_movies(n) popular_movies = popular_movies[['title', 'avg_rating']].copy() popular_movies['score'] = popular_movies['avg_rating'] * popularity_weight # Weighted score # Get content-based recommendations (assuming a fixed title, or adapt dynamically) content_recs = get_content_based_recommendations(user_favorite_movie, n) content_recs = content_recs[['title']].copy() content_recs['score'] = content_weight # Assign content-based weight # Get collaborative-based recommendations collaborative_recs = get_collaborative_recommendations(user_id, n) collaborative_recs = collaborative_recs[['title', 'predicted_rating']].copy() collaborative_recs['score'] = collaborative_recs['predicted_rating'] * collaborative_weight # Weighted score # Merge all recommendations using an outer join to retain unique titles all_recs = pd.concat([popular_movies, content_recs, collaborative_recs], ignore_index=True) # Group by movie title and sum scores from different recommendation sources final_recs = all_recs.groupby('title', as_index=False).agg({'score': 'sum'}) # Sort by final weighted score final_recs = final_recs.sort_values(by='score', ascending=False) # Return the top n recommendations return final_recs.head(n) # Example Usage user_id=1 user_favorite_movie="Toy Story (1995)" print(get_hybrid_recommendations_with_weights(user_id, user_favorite_movie, popularity_weight=0.33, content_weight=0.33, collaborative_weight=0.33, n=10)) Closing Thoughts Recommender systems are shaping digital experiences across industries—be it personalized movie suggestions, e-commerce recommendations, or even online learning. While basic models like popularity-based recommendations serve general audiences, advanced hybrid approaches provide tailored experiences that adapt to user behavior in real time. As recommendation technologies evolve, emerging trends like deep learning-based recommenders (e.g., Transformers, Autoencoders) are pushing personalization even further. How do you see these systems shaping the future of your industry? Framework code can be found here: Kaggle Originally published on LinkedIn

Application Use Cases · 2025-03-14

Implementing ML Based MTA Solution

Introduction Are you wasting marketing dollars on the wrong channels? In today’s fast-evolving digital marketing landscape, understanding the customer journey is more critical than ever. When I implemented a Multi-Touch Attribution (MTA) solution for a leading bank, I saw firsthand how precise attribution can transform marketing performance. By reallocating ad spend from low-converting display ads to high-performing email retargeting, we reduced cost per acquisition (CPA) by 20%. Additionally, we discovered that webinar attendees convert at twice the rate of generic ad viewers, allowing for smarter targeting and higher ROI. In this article, I’ll share key insights from my experience and outline how to successfully implement an MTA solution. What is Multi-Touch Attribution? A customer’s journey refers to the series of interactions and touchpoints a person experiences with a brand before making a purchase. These touchpoints can include website visits, emails, social media engagements, and more. MTA is a method used to determine the contribution of each touchpoint in this journey toward conversion. Different Paths to MTA Success There are several approaches to MTA, each with its own method for distributing credit across touchpoints. The linear attribution model gives equal credit to all touchpoints, while the time decay model assigns more credit to interactions closer to the conversion, reflecting their higher influence. Another model, U-shaped attribution, emphasizes the first and last touchpoints, with the rest of the credit spread across the middle. However, the approach we will focus on is data-driven attribution (DDA). DDA uses machine learning to analyze historical data and determine the true contribution of each touchpoint based on customer behavior. Unlike predefined models, DDA adapts to individual customer journeys and provides more precise insights. Where Does the Data Come From? MTA relies on granular, user-level data collected from various sources to map the customer journey. The data typically includes a User ID (unique identifier for each customer), a Timestamp (time of interaction), the Channel (e.g., email, social media, search ads), the Campaign (specific campaign tied to the interaction), and information on Conversion (whether a conversion occurred and its value). Detailed View of User and Timestamp Data Key data sources include: Web: Tools like Adobe Analytics and Google Analytics track user interactions and conversions via tracking pixels and URL tagging, helping monitor campaign performance. Webinar: Platforms like Zoom and On24 provide attendee engagement and registration data, while post-webinar surveys capture feedback and journey insights. Email: Platforms like Mailchimp track open rates, click-through rates, and conversions, using unique links to attribute results to specific campaigns. Social Media: Facebook Insights and LinkedIn Analytics measure engagement and conversions, while tools like Hootsuite track brand mentions and ad performance. Search: SEM (e.g., Google Ads) drives traffic, while SEO improves organic visibility. Search data also reveals customer intent for strategy refinement. Call: Tools like CallRail track call sources and link them to campaigns, with post-call surveys offering additional journey insights. CRM & Marketing Automation: Systems like Salesforce centralize customer data, while platforms like Marketo analyze cross-channel journeys. Preparing the Data MTA data originates from multiple sources, so the first step is to ingest and store it in a data lake using tools like AWS Glue or Databricks. We then perform mapping and transformation to consolidate disparate data into a unified view within the data warehouse, leveraging the same tools. Mapping Data Sources, Infrastructure, and Tools for Seamless Processing With this unified dataset, we begin pre-processing to ensure data quality. Handling missing values is crucial-this may involve filling gaps with statistical measures (mean, median, or mode) or removing incomplete records. We also encode categorical variables, such as channels and campaigns, into numerical representations (e.g., one-hot or label encoding) to make them suitable for machine learning. Next, feature engineering enhances model performance by extracting meaningful patterns. Time-based features (e.g., day of the week, hour of the day) reveal optimal interaction periods, while engagement metrics (e.g., time since last interaction, total interaction time) help quantify user behavior. Additional features, such as channel and campaign diversity or conversion-based metrics (e.g., previous conversions, conversion rate), further refine the dataset. Finally, the processed data is split into training and test sets, ready for modelling. The Machine Learning Approach With the data processed, the next step is to apply machine learning models to predict conversions and assess the impact of different marketing touchpoints. Aggregated User Insights from Timestamp Data Logistic Regression is an ideal starting point since our target variable, Conversion (1 or 0), represents a binary classification problem. In this model, each feature contributes a weighted effect toward the probability of conversion. After training, the model’s coefficients provide valuable insights into feature importance. For example, if the coefficient for Interaction_Count is 0.83, it means that for each additional interaction, the odds of conversion increase by a factor of e⁰.83 ≈ 2.3, highlighting the strong influence of user engagement. # Model development # Logistic Regression model = LogisticRegression() # Initialize the logistic regression model model.fit(X_train, y_train) # Train the model y_pred = model.predict(X_test) # Make predictions # XGBoost model = XGBClassifier() # Initialize the XGBoost model model.fit(X_train, y_train) # Train the model y_pred = model.predict(X_test) # Make predictions While Logistic Regression offers interpretability, XGBoost (Extreme Gradient Boosting) is a more advanced approach capable of capturing non-linear relationships and feature interactions. Unlike linear models, XGBoost builds decision trees sequentially, with each tree learning from the errors of the previous one. This makes it particularly effective for identifying complex patterns in marketing data, such as how different channels, campaigns, and interaction frequencies contribute to conversions. # Model validation # Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy:.2f}') # Display confusion matrix and classification report print(confusion_matrix(y_test, y_pred)) print(classification_report(y_test, y_pred)) # Visualizing confusion matrix with a heatmap cm = confusion_matrix(y_test, y_pred) plt.figure(figsize=(6,5)) sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=[0,1,2], yticklabels=[0,1,2]) plt.xlabel("Predicted Label") plt.ylabel("True Label") plt.title("Confusion Matrix") plt.show() To ensure the reliability of these models, validation on historical data is essential. The dataset is typically split into training and test sets to evaluate performance on unseen data. Metrics such as AUC-ROC, precision-recall, and accuracy help measure how well the models generalize. Additionally, a time-based validation approach, where past data is used for training and more recent data for testing, ensures the models can adapt to evolving marketing trends. By leveraging both models-Logistic Regression for explainability and XGBoost for predictive accuracy-and validating them on historical data, businesses can build a robust, data-driven attribution strategy. To complement this process, the linked Kaggle notebook provides a hands-on example of MTA implementation, covering the complete workflow from data preparation to model evaluation. It also includes practical Python code snippets for encoding categorical data, building machine learning models, and assessing their performance. Closing Thoughts To wrap up, here are some questions to consider: How does your organization currently measure attribution? What are the challenges you face in implementing MTA? How can machine learning improve your current attribution approach? Multi-touch attribution is an evolving field, and Machine Learning provides powerful tools to analyze and optimize marketing performance. Start exploring MTA today to gain deeper insights into your customer journeys! Framework code can be found here: Kaggle Originally published on LinkedIn

Application Use Cases · 2025-02-14

Implementing ML Based Churn Solution

Introduction In today’s fast-paced business landscape, customer expectations are higher than ever, and competition is relentless. Losing customers isn’t just about lost revenue-it impacts brand equity and opportunities for cross-selling other services. But what if businesses could predict churn before it happens? Advanced analytics is transforming customer retention strategies by enabling companies to identify at-risk customers before they leave. By leveraging churn prediction, marketing teams can optimize retention efforts, enhance Customer Lifetime Value (CLV), and personalize campaigns. This leads to more efficient resource allocation, stronger customer relationships, and long-term profitability through data-driven decision-making. Case Study: Real-World Application in Telecommunications In the past, I worked with a client facing significant customer attrition despite offering competitive products. Leveraging their customers demographics and behaviour, touchpoint interaction, and product usage data, I built a machine learning model to predict which customers were most likely to leave. This enabled the client marketing team to implement targeted retention strategies, such as personalized offers, proactive customer engagement, and customer segmentation to improve customer engagement and reduce churn. While the original analysis was conducted for a financial institution, the techniques used are applicable across industries. To demonstrate the approach in a more accessible way, I will simulate a similar analysis using a publicly available telecom dataset. This will allow us to walk through the key steps-data preparation, model selection, and evaluation-while showcasing how machine learning can be leveraged for churn prediction. Defining the Problem and Preparing the Data Every advanced analytics project starts with a clear definition of the problem. In this case, we aim to predict which customers are at risk of churning. This involves analyzing customer data-including demographics, service usage, and payment history-to identify patterns indicative of churn. Model Development Lifecycle Data in organizations is often scattered across multiple systems, so consolidating it into a unified format is the first step. We then clean the data, ensuring it has no missing values. Missing numerical values can be imputed using statistical techniques, while encoding categorical values require methods such as one-hot encoding or label encoding. Next, we scale numerical features using standardization or normalization to ensure uniformity. # Standardization (z-score method) from sklearn.preprocessing import StandardScaler scaler = StandardScaler() df['TotalCharges'] = scaler.fit_transform(df['TotalCharges']) # Label Encoding (variable has two levels) from sklearn.preprocessing import LabelEncoder for col in label_enc_cols: df[col] = LabelEncoder().fit_transform(df[col]) # Label # OneHot Encoding (variable has more than two levels) from sklearn.preprocessing import OneHotEncoder df = pd.get_dummies(df, columns=ohe_cols, drop_first=True).astype(int) # OneHot After preparing the data, we assess class balance. Since churned customers are often underrepresented, we apply oversampling techniques to create a more balanced dataset. Finally, we select key predictive features by analyzing correlations and removing redundant variables to avoid multicollinearity. If the dataset has too many features, we may use dimensionality reduction techniques like Principal Component Analysis (PCA). # Split data into predictor and target variables X = df.drop(columns=['Churn']).values # Predictor variables y = df['Churn'].values # Target variable # Split data into training and test sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) For training, we split the data into an 80:20 ratio-80% for training and 20% for testing. Building and Evaluating Machine Learning Models Churn prediction is a classification problem, and we experiment with several models, each offering different advantages. Below is a brief comparison: Logistic Regression — A great baseline model due to its simplicity and interpretability, making it easy to understand the relationship between features and the target variable. Random Forest — A robust ensemble method that handles non-linearity well and provides insights into feature importance, helping to identify which variables are most influential in predicting churn. XGBoost — An advanced boosting algorithm known for its high performance and efficiency, often outperforming traditional models by optimizing for both speed and accuracy. Artificial Neural Networks (ANNs) — Capable of learning complex patterns in data, ANNs can capture intricate relationships but typically require larger datasets and more computational resources to train effectively. Using Scikit-learn, we implement logistic regression as our baseline model. It’s easy to set up, requiring just a few lines of code. We fit the model to training data and use it to predict churn on test data, evaluating performance based on accuracy and precision. Model Validation and Performance Comparison To compare model performance, we use key evaluation metrics such as: Confusion Matrix — A table that summarizes the performance of a classification model by showing the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). These values are obtained by comparing the model’s predictions with the actual labels from the test dataset. F1-Score — A harmonic mean of precision and recall, giving a balanced performance measure. Accuracy Score — Indicates the overall correctness of the model. These metrics help us determine the best-performing model. In this case, while the ANN achieves the highest accuracy, logistic regression remains valuable due to its interpretability, making it useful for business decision-making. Conclusion: Turning Data into Business Strategy Predicting customer churn isn’t just about building sophisticated models-it’s about understanding customer behaviour and translating insights into actionable strategies. By systematically preparing data, selecting the right models, and validating their performance, businesses gain invaluable insights into churn drivers. This analysis highlights the need to balance accuracy and interpretability. While deep learning models provide high accuracy, traditional models like logistic regression remain relevant for businesses that prioritize explainability. For example, actionable strategies include: Encouraging longer-term contracts with discounts for annual or two-year subscriptions. Implementing early engagement programs such as loyalty bonuses for the first six months. As companies navigate competitive landscapes, adopting data-driven approaches is essential-not just for reducing churn, but for fostering long-term customer loyalty and sustainable growth. By integrating predictive analytics into business strategies, organizations can turn churn risks into opportunities, ensuring a more resilient and customer-centric future. What strategies have you implemented in your organization to predict and reduce customer churn, and how have they impacted your customer relationships? Share your thoughts in the comments below! Framework code can be found here: Kaggle Originally published on LinkedIn

Application Use Cases · 2025-02-10

Customer Segmentation

What is Customer Segmentation? Customer segmentation is a critical component of marketing that helps businesses understand their customers better and tailor their marketing strategies to their specific needs. One popular technique for customer segmentation is k-means clustering, which groups customers based on their similarities in various attributes. In this article, we’ll discuss how you can use k-means clustering to segment your customers and extract valuable insights from your data. Step 1: Gather and Prepare Your Data The first step in customer segmentation using k-means clustering is to gather your data. This includes all relevant customer information, such as demographic data, purchase history, and behavioral data. Once you have your data, you’ll need to clean and prepare it for clustering. This may involve normalizing your data, removing outliers, and transforming your data into a form that can be used in k-means clustering. Step 2: Determine the Number of Clusters The next step is to determine the optimal number of clusters for your data. This can be done using various methods, such as the elbow method or the silhouette method. The elbow method involves plotting the sum of squared distances between data points and their assigned cluster center for various cluster numbers. The optimal number of clusters is where the plot starts to level off, forming an “elbow.” The silhouette method, on the other hand, measures how well each data point fits into its assigned cluster and provides a score between -1 and 1. The optimal number of clusters is where the silhouette score is the highest. Step 3: Run K-means Clustering Once you have determined the optimal number of clusters, you can run k-means clustering on your data. K-means clustering works by assigning each data point to the nearest cluster center and then updating the cluster centers based on the new assignments. This process is repeated until the cluster centers no longer move significantly. Step 4: Interpret the Results After running k-means clustering, you will have a set of customer segments based on the attributes you used in the clustering. You can then analyze these segments to extract valuable insights about your customers. For example, you may find that one segment has a high average purchase value, while another segment has a high purchase frequency. This information can be used to tailor your marketing strategies to each segment. Step 5: Refine and Iterate Customer segmentation is an ongoing process, and you may need to refine and iterate your clusters over time. As your business evolves, your customer segments may change, and you may need to adjust your clustering approach to reflect these changes. It’s important to continue to gather data, refine your clustering approach, and use your customer segments to inform your marketing strategies. Basic implementation of customer segmentation using k-means clustering in Python In this example, customer_data.csv is a file containing the customer data with three features: feature1, feature2, and feature3. We extract these features and perform k-means clustering with 5 clusters. We then add the cluster labels to the original dataframe and visualize the clusters using a scatter plot of feature1 and feature2, with each point colored according to its assigned cluster. # Import necessary libraries import pandas as pd from sklearn.cluster import KMeans import matplotlib.pyplot as plt # Load customer data df = pd.read_csv('customer_data.csv') # Extract relevant features for clustering X = df[['feature1', 'feature2', 'feature3']] # Perform k-means clustering with 5 clusters kmeans = KMeans(n_clusters=5, random_state=0).fit(X) # Add cluster labels to the original dataframe df['cluster'] = kmeans.labels_ # Visualize the clusters plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=kmeans.labels_, cmap='rainbow') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.show() In conclusion, k-means clustering is a powerful tool for customer segmentation that can help you extract valuable insights from your data. By following the steps outlined above, you can use k-means clustering to group your customers into meaningful segments and tailor your marketing strategies to their specific needs. Comments welcome!

Application Use Cases · 2023-03-04

Analyzing Website Traffic Using Google Analytics and AWS

What is Web Analytics? Web analytics refers to the collection, measurement, analysis, and reporting of web data to understand and optimize web usage. It involves gathering data on user behavior on websites, such as pageviews, time spent on a page, clickthrough rates, and conversion rates, and analyzing this data to gain insights into user behavior and website performance. These insights can be used to make informed decisions about website design, content, and marketing strategies to improve user engagement, increase traffic, and drive conversions. Web analytics tools, such as Google Analytics, provide a range of metrics and reports to track and analyze website performance. In this blog entry I will explore how to do web analytics using a combinaiton of Google Analytics and AWS (Amazon Web Services). What are Google Analytics and AWS, and what is the high level process for implementing a web analytics solution? Before we dive into the article, lets talk about what is Google Analytics and AWS. Google Analytics It is a web analytics service offered by Google that tracks and reports website traffic. It provides insights into visitor behavior, including the number of visitors, their demographics, the pages they visit, and the actions they take on the website. With Google Analytics, website owners can monitor and analyze the performance of their website, gain insights to optimize their marketing strategies, and improve their website’s user experience. AWS (Amazon Web Services) It is a cloud computing platform that offers a wide range of services, including compute power, storage, databases, analytics, machine learning, security, and more. AWS allows businesses to operate their IT infrastructure in the cloud, providing them with flexibility, scalability, and reliability. AWS provides a pay-as-you-go pricing model, which allows businesses to pay only for the services they use, without any upfront costs or long-term commitments. AWS is one of the most popular cloud computing platforms, with millions of active customers worldwide. High level outline of the process To analyze website traffic using Google Analytics and AWS, you can follow these high-level steps: Set up a Google Analytics account and obtain the tracking code for your website. Set up an S3 bucket in AWS to store the data from Google Analytics. Set up an AWS Lambda function to pull data from the Google Analytics API and store it in the S3 bucket. Set up an AWS Glue crawler to crawl the S3 bucket and create a data catalog in the AWS Glue Data Catalog. Set up an Amazon Athena query to analyze the data in the S3 bucket using SQL-like queries. High level example code for implementing a small-scale web analytics solution Here’s some sample code to get started: Importing Required Packages import boto3 import datetime from google.oauth2 import service_account from googleapiclient.discovery import build from googleapiclient.errors import HttpError Set up the Google Analytics API credentials credentials = service_account.Credentials.from_service_account_file('/path/to/credentials.json') Set up the S3 bucket s3 = boto3.client('s3') bucket_name = 'my-bucket-name' Set up the Lambda function to pull data from the Google Analytics API and store it in the S3 bucket def lambda_handler(event, context): try: service = build('analyticsreporting', 'v4', credentials=credentials) # Query the Google Analytics API for website traffic data response = service.reports().batchGet( body={ 'reportRequests': [ { 'viewId': '12345678', 'dateRanges': [{'startDate': '7daysAgo', 'endDate': 'today'}], 'metrics': [{'expression': 'ga:sessions'}], 'dimensions': [{'name': 'ga:date'}, {'name': 'ga:hour'}] } ] } ).execute() # Store the website traffic data in the S3 bucket now = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S') filename = f'{now}-website-traffic.csv' data = response['reports'][0]['data']['rows'] s3.put_object(Body=str(data), Bucket=bucket_name, Key=filename) except HttpError as error: print(f'An error occurred: {error}') data = None Set up the Glue crawler to crawl the S3 bucket and create a data catalog glue = boto3.client('glue') response = glue.create_crawler( Name='my-crawler', Role='my-glue-role', DatabaseName='my-database', Targets={ 'S3Targets': [ { 'Path': f's3://{bucket_name}/' } ] } ) Set up the Athena query to analyze the website traffic data in the S3 bucket athena = boto3.client('athena') response = athena.start_query_execution( QueryString='SELECT * FROM my_database.my_table WHERE sessions > 100', QueryExecutionContext={ 'Database': 'my_database' }, ResultConfiguration={ 'OutputLocation': f's3://{bucket_name}/query_results/' } ) Note that this is just a sample code to get started, and you will need to customize it to match your specific use case. Also note that there will be additional configuration and setup required, such as setting up the IAM roles and permissions for the Lambda function, Glue crawler, and Athena query, and configuring the Google Analytics API to allow access to your website data. Benefits of implementing a web analytics solution Web analytics provide many benefits for website owners, marketers, and business analysts, including: Tracking website traffic: Web analytics tools allow you to track website traffic, including the number of visitors, unique visitors, page views, bounce rate, and session duration. Understanding user behavior: Web analytics provide insights into user behavior, including where users are coming from, which pages they are visiting, how long they are staying on the site, and where they are dropping off. Improving website performance: With web analytics, you can identify which pages are performing well and which ones need improvement. This helps you make data-driven decisions to optimize your website for better user experience and engagement. Measuring marketing campaigns: Web analytics tools allow you to track the performance of your marketing campaigns, including the effectiveness of your ads, social media posts, and email campaigns. Identifying business opportunities: By analyzing website data, you can identify new business opportunities, such as new markets, product or service offerings, and potential partnerships. Overall, web analytics provide valuable insights into website performance and user behavior, enabling website owners and marketers to make data-driven decisions that can improve business outcomes. Challenges in implementing a web analytics solution Implementing a web analytics solution can present several challenges. Some of the most common challenges include: Data Accuracy: Ensuring the accuracy of the data collected can be a major challenge. Issues can arise due to multiple domains, ad blockers, and third-party scripts. It is important to verify that the data collected is accurate, and identify and address any issues that arise. Data Volume: The volume of data can be a significant challenge when implementing a web analytics solution. The data collected can be quite extensive, and processing and storing this data can be costly. Data Privacy: Maintaining data privacy and complying with regulations such as GDPR can be a major challenge. It is important to be transparent with users about the data being collected and how it is being used, and to take steps to ensure the data is kept secure and used only for the intended purposes. Technical Challenges: Implementing a web analytics solution can present technical challenges, particularly for organizations with limited technical resources. It is important to ensure the implementation is properly configured and optimized, and that the organization has the necessary resources to manage and maintain the system. Analysis and Action: Collecting data is only the first step in the web analytics process. The real value comes from analyzing the data and taking action to improve the user experience and achieve business goals. This can be a significant challenge for organizations that lack the necessary resources or expertise. Comments welcome!

Application Use Cases · 2023-02-04

Burnout in Analytics Teams

In today’s fast-paced business environment, analytics teams are playing an increasingly critical role in driving decision-making and business strategy. However, the high pressure and demands placed on analytics teams can lead to burnout, which can negatively impact both individual team members and the overall success of the team. Burnout is a state of emotional, mental, and physical exhaustion caused by prolonged and excessive stress. In analytics teams, burnout can arise due to a variety of factors such as unrealistic deadlines, long working hours, repetitive tasks, and high expectations from stakeholders. This can lead to team members feeling overwhelmed, unmotivated, and disengaged from their work. To prevent burnout in analytics teams, it is important to identify the root causes of stress and implement strategies to mitigate these factors. One approach is to foster a culture of open communication and support, where team members feel comfortable discussing their workload and potential sources of stress. Managers can also provide training and resources to help team members manage their workload more effectively, such as time management techniques and prioritization strategies. Another effective approach is to provide opportunities for team members to take breaks and recharge. This can include offering flexible working hours, encouraging regular breaks, and promoting a healthy work-life balance. Additionally, managers can promote team-building activities and recognize the contributions of team members, which can help to boost morale and foster a positive team dynamic. Finally, it is important to monitor the well-being of team members and identify early warning signs of burnout. This can include changes in behavior, decreased productivity, and increased absenteeism. By identifying and addressing these issues early on, managers can help to prevent burnout and promote a healthy and productive work environment. In conclusion, burnout in analytics teams is a real and pressing issue that can negatively impact both individual team members and the overall success of the team. By identifying the root causes of stress and implementing strategies to mitigate these factors, managers can help to prevent burnout and promote a positive and productive work environment. Comments welcome!

Application Use Cases · 2023-01-07

An Agile Approach to Analytics

Scrum is an agile framework for software development, but it can also be applied to other types of projects, including analytics. Scrum emphasizes collaboration, continuous improvement, and flexibility. It is designed to help teams work together to deliver high-quality results quickly and efficiently. In this article, we’ll discuss how to use Scrum in analytics teams. Waterfall Methodology Traditionally analytics teams have followed a waterfall methodology to project management. This involves dividing the project into sequential steps involving requirement gathering, development, testing, delivery, and maintenance. The benefits of this approach are that budgeting is easy, however on the flip side there is less to no contact with the customer after the requirement gathering stage up until delivery. Agile Philosophy Agile is an alternative philosophy that strives to make production process more efficient and manageable. Agile staggers the project into consecutive iterative sprints. There are many different project management methodologies used to implement the Agile philosophy. Some of the most common include Kanban, Extreme Programming (XP), and Scrum. In this article, we will be focusing on the Scrum methodology. Scrum Methodology The Scrum methodology is characterized by short phases or “sprints” when project work occurs. Sprints typically take two weeks, have clear deliverables and are focused on improving the results based on feedback from business users. Each sprint ends with working, tested, ready to ship product. The benefits of this approach are that the customer gets to see the minimum viable product (MVP) quite early. A sprint can be cancelled (only by the Product Owner) if the sprint goal becomes obsolete due to change in laws or regulations, change in direction of the company, or the technology became outdated. Scrum Artifacts Product Backlog Items (PBI): these are user stories or epics. User stories are a way to represent a product feature or functionality in an agile project. They could be about new ideas, features, technical requirements, or bugs. They are usually small enough that dev team can develop half a dozen of them in one sprint. They are always written from a user perspective, for example “As a university student I want to access my marksheet online”. Large user stories are called epics and they are too big to be handled in one sprint so need to be broken down. User stories are not a replacement for requirements documentation. Product Backlog (PB): collectively, user stories and epics form the PB. This is a prioritized list of PBIs. Items at the top will be implemented soon. User stories are supposed to be DEEP: Detailed appropriately - PBIs to be implemented soon have detailed specifications. Estimated - how much time will it take to complete a PBI. More detailed PBIs usually have mode detailed estimation. Some estimation techniques are planning poker, team estimation game, and ideal hours. Emergent - suggests that as the project progresses, user stories are added, removed, or rearranged in the PB. Prioritized - PBIs are arranged in a way that at top are PBIs that will be implemented soon. Scrum Roles Product Owner (PO): only focuses on one product. They negotiate and communicate with the stakeholders and evaluate the product from a user perspective. Additionally, they also manage the PB ensuring that currently developed features are best choice given current circumstances. They need to be available to answer questions during the sprint and act as the encyclopedia on the product. Lastly, they review the result at end of sprint (during a sprint review meeting) and assess whether the product is ready or needs more work before being delivered to user. POs usually possess extensive domain knowledge and excellent interpersonal skills, and they are responsible and decisive. Scrum Master (SM): can work with multiple dev teams, they support but do not manage the team. Their primary role is to ensure that everyone in the team understands Scrum framework and how it is to be applied. They don’t plan the dev teams work or verify the progress they are making, but just promote cross functionality and self organization. Additionally, they eliminate obstacles of dev team during sprints. Further, they strive to maintain effective communication between dev team and PO, and also work closely with the PO to define and organize the PB. SMs are usually courageous, responsible, cool-headed, but most of all adept at influencing. They observe and draw meaningful conclusions, and look for new techniques to improve dev team effectiveness. Development Team: is responsible for delivering the sprint goal. In order to do that they decide how many PBIs can be delivered in a sprint and then decompose them into SMART tasks (specific, measurable, achievable, relevant to sprint goal, time-boxed and trackable). They communicate the progress to the SM during daily Scrum meeting. They are good at self-organizing and cross-functionality and know something about everything and everything about one thing. Scrum Events Sprint planning: the project team identifies a small part of the scope to be completed during the upcoming sprint Development team work: this is where the development team will actually work on the project backlog items. A Burndown chart can be used to monitor sprint progress. The chart shows number of backlog items identified for the sprint on Y axis and the number of days passed since the start of the sprint on X axis. Daily scrum: is a 15 minute meeting held at same place and time everyday. The idea is to Prepare a plan for the next 24 hours specifically identifying what was done yesterday, what will be done today and note down any impediments that can hinder the sprint goal. The Scrum Master is not required to attend this meeting, but ensure that it happens. Sprint review: conducted by the Product Owner, this meeting marks the end of a sprint. Scrum team gathers to check the result and get feedback from invited stakeholders. Scrum team and stakeholders collaborate on how to increase the value of the product in the following iteration. Should not last more than 4 hours for a month long sprint and proportionately shorter for shorter sprints. Sprint retrospective: led by the Scrum Master, this meeting is a review of the sprint. The goal is to discuss problems related to the process, people, relationships and tools as well as the things that went well and helped the scrum team. Should not last more than 3 hours for a month long sprint and proportionately shorter for shorter sprints. Some techniques that can help gather insights during this meeting are 5 times why, cause and effect diagram, a perfect sprint, the worst sprint ever, one wish, speed dating, undercover boss, a written brainstorm, pessimize, drawing a poster, and political party manifesto. Scrum retrospective is arguably one of the most important meeting led by the SM and deserves an article of its own, which i might write later. Backlog refinement: done after a spring, not part of sprint. Facilitated by the Scrum Master, the basic aim of this meeting is to arrive at a list of most defined and ready to implement PBIs for the following iteration. Should not last more than 5-10% of the total sprint duration. To conclude, in todays day and age it is hard to come by a pure waterfall model to project management. Due to tele-commuting there is always some sort of unorganized agile approach being followed. So what I usually do is try to figure out early on who is playing what scrum roles within my project team and try to streamline and organize the development process using the agile framework. This usually leaves me with a mix of waterfall and agile approach that works best and requires least amount of training within the team to implement. Hope this article helps you apply agile philosophy on your projects! Comments welcome!

Application Use Cases · 2020-04-04

Optimizing Retention through Machine Learning

Acquiring a new customer in the financial services sector can be as much as five to 25 times more expensive than retaining an existing one. Therefore, prevention of costumer churn is of paramount importance for the business. Advances in the area of Machine Learning, availability of large amount of customer data, and more sophisticated methods for predicting churn can help devise data backed strategy to prevent customers from churning. Imagine that you are a large bank facing a challenge in this area. You are witnessing an increasing amount of customers churn, which has starting hitting your profit margin. You establish a team of analysts to review your current customer development and retention program. The analysts quickly uncover that the current program is a patchwork of mostly reactive strategies applied in various silos within the bank. However, the upside is that the bank has already collected rich data on customer interactions that could possibly help get a deeper understanding of reasons for churn. Based on this initial assessment, the team recommends a data driven retention solution which uses machine learning to identify the reasons for churn and possible measures to prevent it. The solution consists of an array of sub-solutions focused towards specific areas of retention. The first level of sub-solutions consists of insights that can be directly derived from the existing customer data, answering for example the following business questions: Churn History Analysis: What are characteristics of churning customers? Are there any events that indicate an increased probability for churn, like long periods without contact to the customer, several months of default on a credit product etc.? Customer Segmentation: Are there groups of customers that have similar behavior and characteristics? Do any of these groups show higher churn rates? Customer Profitability: How much profit is the business generating by different customers? What are characteristics of profitable customers? First results can be drawn by these analyses. Additional insights are generated by combining them with data points such as the historical monthly profit that a business loses due to churn. Further, the data can be used for training supervised machine learning models which allow predicting future months or help classifying customers for which rich data is not available yet. This is the idea behind the second level of sub-solutions. Customer Life Time Value: What is the expected profitability for a given customer in the future? Churn Prediction: Which customers are in risk of churn? For which customers a quick intervention can improve retention? The early detection of customers at risk of churn is crucial for improving retention. However, not only is it beneficial to know the churn likelihood but also the expected profit loss that is connected with each customer in case of churn. Constant and fast advances in the area of Machine Learning help to improve these results. Being able to process large amounts of data allows for more customized results that are focused on the individuality of each customer. This is an important point as every customer has different preferences when it comes to contact with the bank, different reactions when it comes to offers and different needs and goals. Combining previously mentioned analyses and a large amount of customer data provides the third level of sub-solutions which allow individualized prescriptive solutions for at-risk customers. The idea behind this prescriptive retention solution is the simulation of alternative paths combined with optimization techniques along different parameters like how many days passed since the last contact of the client with the bank. The first set of descriptive or diagnostic solutions can be implemented relatively quickly as siloed analytics teams within the bank are already exploring them on their own. The second set of solutions which is more predictive in nature could take upto an year to implement. Built atop these, the prescriptive solution utilizes the outcome of previous analyses to suggest improved and individualized retention strategies. As a result the bank can now take different preventive retention measures for each customer. Comments welcome!

Application Use Cases · 2020-03-07

Customer Lifecycle Analytics

How important is it to align your analytics efforts with the customer lifecycle? Imagine you are a credit card department within the consumer banking branch of large bank. You are sending periodic mailers offering credit cards to your customers. Before sending these mail offers you do a minimum screening in a way that you only offer these to customers that have been with the bank for at-least 2 years and have maintained a balance above a certain threshold. However, you notice that the acceptance of your mail offers remains low even after a few campaigns. Why do you think is that? The answer lies in a simple concept, but one that is often overlook by analytics teams. Are you trying to identify which life stage the customer is in? Are you trying to synchronize your sales effort with the customer lifecycle? What is customer lifecycle you ask? Customer lifecycle can be understood as a framework to track the relationship between a customer and a bank. It starts off with the Acquisition stage where your primary focus is to figure out ways to identify and bring on-board customers with which a mutually beneficial relationship can be created. After this comes the Development stage, where the customer is encouraged to expand his portfolio with your products through cross-sell efforts, etc. Finally, comes the Retention stage where the customer has been with you for more than a decade, so you try to enhance the relationship and monitor customer satisfaction so that the customer can act as a good ambassador for you. These are the three basic stages, Acquire > Develop > Retain. You could break-down these stages further to target any pain-points you might be facing in a particular stage. For example, your acquisition through campaigns this year has not been as fruitful as previous years. So you break down Acquisition into Awareness > Consideration > Purchase to pin-point the root cause. Data suggests that the advertising budget is same as previous years. Marketing campaigns to tip consumers in the consideration stage into the purchase stage are also being sent in a timely manner. However, you are still loosing prospective customers in the purchase stage. You sanction a study to identify any changes that might have happened in the way you on-board a customer. Voilà! You identify that the on-boarding form has been appended with two new sections seeking a little more information about the customer before on-boarding. You weigh the necessity of collecting the information which on-boarding and decide to drop these additional sections. Few months later, Acquisition metrics start to return to previous years ballpark. Perhaps the most important aspect in the world of data driven decision making is to align the reporting and analytical efforts with the customer lifecycle. For example, during the acquisition phase your primary aim is to provide the right product just when the prospect customer needs it. This could be achieved though an analysis such as the Best Next Offer, where you use Machine Learning techniques to match your products with profile of prospects created using demographic, psychographic, etc. factors. Similarly, during the Development stage you focus on meticulously reporting and driving cross-sell efforts to increase your product presence in the customer portfolio. Lastly, during the Retention stage your focus should be on minimizing churn through customer satisfaction and this can be achieved through churn analysis on the quality data you collected in this aspect. To close I will reemphasize the importance of collecting good data, analytics and aligning it closely with customer lifecycle for optimal data driven decision making. Comments welcome!

Application Use Cases · 2020-02-01

parashar.ca

Contact

Application Use Cases