View Notebook

Count Data Model

(Poisson Distribution)

Project Overview

This project centers around a detailed analysis of how various factors within a customer database influence the frequency of customer service calls and chatbot interactions. The primary aim is to gain insights into customer behavior and preferences, particularly in relation to the adoption of solar panels by long-time customers of an energy retailer.

Data Descriptions

The data consists of 2,000 customer records of the energy supplier firm, with 64 variables capturing customer information and behaviors, containing:

user_id - Unique user id.
name - Name of customer
customer_since - Date since when they became a customer
customer_since_year - Year they became a customer
av_gas - Average annual household gas consumption
av_elec - Average annual household electricity consumption.
av_bill - Average annual energy bill
hh_size - Number of people in household
Urban - 1 if household is located in an urban area
solar_panels - 1 if household adopted solar panels (since becoming a customer)
solar_panels_since - Year in which household adopted solar panels
nr_solar_panels - Number of solar panels adopted
first_service - First call with customer service
sec_service - Number of solar panels adopted
last_service - Last interaction with customer service
nrservice - Total number of calls with customer service
av_service_length - Average length of customer service call in minutes
service_xxxx - Number of calls with customer service in the year xxxx
pos_service - Number of positive calls with customer service (i.e. compliment)
neg_service - Number of negative calls with customer service (i.e. complaint)
satisfaction - Customer satisfaction score
first_chatbot - First interaction with the online chatbot
sec_chatbot - Second interaction with the online chatbot
last_chatbot - Last interaction with the online chatbot
nrchatbot - Total number of interactions with online chatbot
av_chatbot_length - Average number of messages in chatbot conversation
chatbot_xxxx - Number of interactions with online chatbot in the year xxxx
pos_chatbot - Number of positive interactions with online chatbot (i.e. compliment)
neg_chatbot - Number of negative interactions with online chatbot (i.e. complaint)
email_pers_solar - Number of personalized emails received about the adoption of solar panels
email_gen_solar - Number of general (non-personalized) emails received about the adoption of solar panels
email_newsletter - Number of newsletter emails sent to the customer
email_coupon - Number of emails with coupons sent to the customer
email_sustainability - Number of emails with sustainability tips sent to the customer
email_information - Number of purely informational emails sent to the customer
email_loyalty - Number of emails about the retailer’s loyalty program sent to the customer
email_save - Number of emails with energy saving tips sent to the customer

Preview Dataset : First 10 rows with 64 columns

1.1 Dealing with Missing Values (NAs), Outliers, and Irregularities

We found in total of 4503 missing values within the dataset, which come from variables “hh_size” (n=139), “solar_panels_since” (n=1402), “sec_service” (n=871), “second_chatbot” (n=689).

These missing values are treated based on the analysis purposes. For instance, Missing values in solar_panels_since, sec_service, and second_chatbot are generally transformed into “0” as they indicate that customers do not adopt solar panels or interact with the company through customer service channels after their first interaction. However, variables such as “solar_panels_since” may be transformed into other values for other purposes.

For example, instead of “0”, for analyzing the timing for a customer to adopt a solar panel, the NAs in “solar_panels_since” is changed into “2022” to capture duration from an observation becoming a customer until solar panel adoption.

However, for other step purposes, this variable is transformed into “0” for capturing the duration (in years) a customer owns a solar panel. Furthermore, outliers in “nrservice” and “nrchatbot” variables are treated using the Winsorize method (Rfuction=Winsorize, library=’DescTools’), a data transformation technique used to mitigate the impact of extreme values by replacing the extreme values with the less extreme.

1.2 Descriptive Statistics

Part 1: Data Exploration

Energy consumption:

The average gas consumption appears to increase with the size of the household. However, there's high variability in gas consumption within each household size, as indicated by the standard deviation values. This shows a diverse range of gas consumption habits among customers. The average electricity consumption also seems to increase with household size, with five people consuming the most electricity on average. Yet, the variability within each household size is also high, showing a wide range of electricity usage habits. Urban customers have higher average energy bills compared to rural, implying that urban households consume more energy. However, there's quite a bit of variability in both groups, indicating diverse consumption patterns in both urban and rural areas. The variable "av_bill" represents the average annual bill of customers. The values range from 312.4 to 7160.1. The mean value is 1876.7, indicating that the average bill amount is around $1876.7.

Customer Base:

The majority of customers have been with the company since 2013 and 2014, signifying significant growth in these years. Whether a household is located in an urban area is indicated by the "urban" variable. The variable has binary values of 0 and 1, where 0 denotes a non-urban region and 1 denotes an urban area. The size of the household is indicated by the variable "hh_size". The values have a mean of 2.192 and a range of 1 to 5. Most customers are single-person households living in rural area almost two times higher than in urban.

Customer Engagement:

The variable "nrservice" measures the number of calls with customer service. The values range from 1 to 207, with a mean of 4.392. The finding exhibits the number of calls between 2009 and 2014 implying a large increase between the year, with the highest total of calls from customers in 2014. Further, The degree of customer satisfaction scores is represented by the variable "satisfaction". With a mean of 3.755, the values range from 1.000 to 5.000. The highest frequency of satisfaction scores are ranging between 3.5 to 4.5 with median value of 3.810, which denotes a generally moderately high degree of satisfaction.

Solar Panels Adoption:

The presence of solar panels is indicated by the variable "solar_panels" (0 = No, 1 = Yes). With a mean of 0.29, this means that almost 30% of households have solar panels.

Email Marketing:

The company uses a variety of email types, including personalized and general emails about solar panels, newsletters, coupons, sustainability tips, etc. The firms most frequently sends out newsletters making up approximately 29.4% of all emails sent followed by generalized emails about solar panels and emails with coupons make up 12.6% and 14.5%, respectively.

Part 2: Data Analysis on the ‘' Variables that impact Customer Service Calls and Chatbot Interactions”

In this project, we explore the complex dynamics of customer engagement within the company, focusing on two key points of contact: customer service calls and chatbot interactions, represented as 'nrservice' and 'nrchatbot' in the dataset, respectively.

Selected Model Variables: (see below code)

The dependent variables of the analysis are 'nrservice' and 'nrchatbot', which reflect the total number of customer service calls and chatbot interactions.

Model 1 -> ‘nrservice’ as dependent variable

Model 2 -> ‘nrchatbot’ as dependent variable

The independent variables are selected based on their potential to influence customer interactions via service calls or chatbots. These can be grouped into five main categories, including service quality indicators, customer profile, service utilization, marketing effort, and energy consumption.

Category 1: Service quality indicators

We create two new variables, 'diff_service' and 'diff_chatbot', to capture the net positive experiences through the call center and chatbot, respectively. The hypothesis is that more positive experiences can boost engagement (Li & Zhang, 2023). The 'satisfaction' variable, which reflects customer satisfaction with the company, is also included. Both positive satisfaction (prompting more engagement) and negative experiences (leading to help-seeking or complaint behavior) can influence service use (Mithas, Krishnan, & Fornell, 2005).

Category 2: Customer Profile

This category includes 'customer_since_yr' (reflecting the length of the customer-company relationship, with the expectation that longer relationships lead to more interactions due to familiarity with services) and 'hh_size'’ where larger households may have diverse service needs leading to more interactions (Reinartz & Kumar, 2000).

Category 3: Service Utilization

Variables related to solar panel usage ('nr_solar_panels' and 'solar_panels_since') were included to signify the diversity and duration of service usage.

We also extracted new variables (see Table) 'days_since_first_service' (the difference between the date of the first-ever service call in the dataset to a customer's first service call) and 'days_since_first_chatbot' were created to capture the customer's service usage history better. The expectation is that longer usage periods (early customers) may lead to more interactions due to arising questions or issues.

Category 4: Marketing Effort

Variables related to email communication ('email_pers_solar' and 'email_newsletter') are included. The assumption is that customers receiving personalized information or updates may be more likely to interact with customer service or chatbot for queries or assistance.

Category 5: Energy Consumption

For energy consumption, the 'av_bill' variable is used as a proxy for the customer's energy consumption levels. It is anticipated that high energy users may have more complex needs, potentially increasing their interactions with service channels.

The models for both 'nrservice' and 'nrchatbot' include the same predictor variables, except for 'diff_service' and 'diff_chatbot', which are excluded from the respective models to avoid multicollinearity.

Feature Engineering:

Modeling Approach:

Our methodological approach utilized statistical modeling techniques suitable for count data, including Poisson Regression, Negative Binomial Regression, and Truncated Count Models to investigate factors influencing customer interactions via service calls ('nrservice') and chatbot ('nrchatbot').

These models are chosen as ‘nrchatbot’ and ‘nrservice’ represents the number of occurrences of an event that cannot take negative values and are also discrete variables that can take only integer number (as shown in the figure below).

Data exploration revealed outliers in the count data of 'nrservice' (mean = 4.4, max =207) and 'nrchatbot' (mean = 10.5, max =610). Rather than eliminating these outliers, we decided to apply the ‘winsorization’ technique to retain valuable information about the customers in the outlier groups.

Specifically we found that, customers who are in the outlier group in both ‘nrchatbot’ and ‘nrservice’ are marked by significantly higher positive call center service, longer service length, higher usage rates, and higher energy consumption (as shown in the higher mean value of av_gas, av_elec, and av_bill) than the general customer base. Therefore, the outliers represent a specific subset of customers with unique characteristics, rather than erroneous. Using the winsorization technique to cap the outliers, we can retain valuable information about these customers' unique characteristics.

‘nrservice’ distribution and the comparison of customers behavior & characteristics in outlier vs non-outlier groups

‘nrchatbot’ distribution and the comparison of customers behavior & characteristics in outlier vs non-outlier groups

Overview of the reasons behind the adoption of 3 models in this project

Initial analysis using Poisson Regression identified evidence of overdispersion (result of dispertion test with p < 2.2e-16), where the variance of the count data significantly exceeded the mean. This violation of the equal mean and variance assumption led us to adopt the Negative Binomial Regression model, which includes a parameter to account for excess variance. However, due to zero-truncation in our dataset, we proceeded to employ a Truncated Count Model. This decision was supported by the fact that our dataset include zero-truncation, where observations with a count of zero are not included.

From 2000 observations, we observe 871 and 689 observations have a minimum value of 1 in ‘nrservice’ and ‘nrchatbot’ respectively.

To compare the fit of the three models, we adopt the AIC and BIC information criterion. The results consistently revealed the Truncated Count Model as the best fit for both 'nrservice' and 'nrchatbot'. For 'nrservice', AIC and BIC scores were lowest for the Truncated Count Model (AIC=7452.185, BIC=7524.997) when compared to both the Poisson (AIC=9726.4, BIC=9793.61) and Negative Binomial Models (AIC=8580.325, BIC=8653.137). Similar results were observed for 'nrchatbot', where the Truncated Count Model (AIC= 9777.1, BIC= 9849.9) outperformed the Poisson (AIC=20028.17, BIC=20100.98) and Negative Binomial Models (AIC=10989.33, BIC=11062.14). Thus, the finding clearly demonstrates the importance of proper model selection based on data characteristics and the usefulness of the Truncated Count Model for zero-truncated count data. The analysis of the truncated negative binomial models has provided substantial insights into the variables influencing customer service calls, 'nrservice' .

Summarization of results

Coding & Results for ‘nrservice‘

A unit increase in 'diff_chatbot' leads to an estimated 1.82% increase in the expected count of 'nrservice', which implies that customers who have more positive experiences with the chatbot tend to use the service calls more frequently. This highlights the intertwined nature of different service channels and the need for integrated management strategies that synergistically improve both.
The variable 'days_since_first_service' has a negative coefficient, showing customers who start using the service call in the early periods make more 0.074% service calls for each additional days. This insight implies that older customers may be accustomed to and rely more on traditional service (calls).
Conversely, the 'days_since_first_chatbot' variable positively impacts 'nrservice', suggesting that recent (new) chatbot users are more likely to have 0.04% more service calls for each additional day since the first chatbot take place. This could indicate that after an initial interaction with the chatbot, customers may have unmet needs or questions that lead them to seek support via a service call. Alternatively, it might indicate that chatbot users are more comfortable with technology and, thus, more likely to use multiple service channels.
The 'years_since_customer' shows that longer-term customers make fewer service calls, decreasing by about 20.6% for each additional year. Customers may become less dependent on service calls as they become more familiar with the products.
Additionally, for each extra member in the household ('hh_size'), 'nrservice' increases by around 20.5%, suggesting larger households may have more complex needs and require more customer service interactions.
Lastly, a slightly positive impact exists with the 'av_bill' variable, where a unit increase leads to a 0.01% increase in expected 'nrservice'.

Coding & Results for ‘nrchatbot‘

Analyzing the truncated negative binomial model on chatbot interactions (‘nrchatbot’) shows several significant indicators.
For every unit increase in the net positive experience in service calls (diff_service), there's approximately a 10% increase in the expected count of ‘nrchatbot’, which is consistent with the findings earlier on the adoption of multiple service channels.
The 'days_since_first_service' and 'days_since_first_chatbot' also play roles but with marginal effects of a 0.01% increase in chatbot usage and a 0.12% decrease in usage, respectively. Customer satisfaction has a positive impact, with each unit increase in satisfaction predicting approximately a 14% increase in chatbot usage.
Household size is a positive predictor, with each unit increase in household size leading to a 24% increase in chatbot usage.
Years since the installation of solar panels has had a negative impact on chatbot usage. For each additional year since solar panel installation, chatbot usage decreases by approximately 7.4%. This could suggest that customers with older solar panel installations interact less with the chatbot, possibly because they are more familiar with the technology and have fewer questions.
Interestingly, customers who have received a personalized solar email are about 4.2% less likely to interact with the chatbot, while those who receive newsletter email have about 0.19% more interactions with the chatbot than those who didn't.
Finally, the average bill has a positive but marginally significant impact of a 0.01% increase in chatbot usage.
Overall, this analysis suggests that proactive communication and enhanced onboarding support for new customers can reduce their need for further assistance. Investments should be targeted towards improving chatbot technology, especially considering its increased use with larger households, while ensuring high customer satisfaction. Lastly, personalized services based on customer traits and thorough solar panel installation education can further augment customer satisfaction while reducing dependence on customer service and chatbot interactions.