The factors of yearly beer sales — Linear model in time series analysis
Beer is one of the most famous drinks in Germany, and of course, is part of the German culture! One of the grocery delivery companies in Berlin has collected whole year data on beer sales and other factors. They would like to know which factor is important for beer sales and could adjust their business strategy upon that.
All the code and the original data can be found here on Github
Data
This dataset contains the following values for each day of 2019:
- Number of crates of beers sold
- Hours of daylight
- Public holiday (0 or 1)
- The mean temperature in °C
Exploratory Analysis
Before we run any models, we would like to know the data in a deeper way.
- Time-base data trend — Daily
we can see the sales have definitely some relationship with the hours of daylight and temperature (not surprising at all, who does not enjoy beer more in summer?)
- Time-base data trend — Weekday
We can see crates_sold
is higher on Thursday, Friday, and Saturday (not surprising also).
Correlation between dependent variables
I calculate the correlation and make the scatter plot of hours_of_daylight
and mean_temperature
. The correlation is very high (Pearson correlation 0.801), which might cause the collinearity problem in the model.
Simple linear model
Because of the high correlation hours_of_daylight
and mean_temperature
we only put mean_temperature
in the model to avoid collinearity. I have tried to put the interaction in the model, but the interaction term was not significant. So I select the final model only with the main effect.
mod = smf.ols(formula='crates_sold ~ mean_temperature+public_holiday', data=data)
reg = mod.fit()
model_norm_residuals = reg.get_influence().resid_studentized_internal
print(reg.summary())
Model Diagnostic of the simple linear model
Assumption of the linear model: the residual should be
- Normally distributed with a mean equal to 0
- Equal variance
- i.i.d
It looks good for the normality, but how about the correlation between the residuals?
It is not random…means they are somehow correlated
Autocorrelation
The error should be independent of each other. But in the following plot, We can see a trend of 7, which might indicate the error term is related to weekdays.
acf = smt.graphics.plot_acf(reg.resid, lags=40 , alpha=0.05)
acf.show()
We can see there are three peaks during the 7-day period which is the match on Thursday, Friday, Saturday.
Re-model by adding new features
I add weekdays (Thursday, Friday, Saturday), months (3–12) in the model
mod_final = smf.ols(formula='crates_sold ~ mean_temperature +public_holiday +Weekday_Name_Friday + Weekday_Name_Saturday + Weekday_Name_Thursday + Month_3 + Month_4 +Month_5 + Month_6 + Month_7 +Month_8 +Month_9 + Month_10 + Month_11 + Month_12', data=new_data)
reg_final = mod_final.fit()print(reg_final.summary())
Model Diagnostic again
Autocorrelation is GONE!!!!!!!
Conclusion and Remarks
The final model fits the data quite well and the residuals fit the model assumption.
Concern:
The R-square of the model is 0.94. This is extremely high, even a bit unusual. This might be due to overfitting. For solving this potential problem, I will use more historical data to verify this model.
Bussiness Remarks:
- Make the sales strategy:
Based on the data, the sales situation on certain weekdays and months is lower. We can suggest the marketing team provide special deals on those days to enhance the orders. - Make the plan for resource arrangements:
The sales situation depends highly on the season. We can adjust the arrangements for drivers/vehicles for delivery. This arrangement can also be applied to drink storage. - Difference between cities:
This data is from Berlin. When new markets are opened in other cities, this model can be used as the baseline, and we can change the model structure according to the real data.
PS.
this is an exercise when I applied for the Data Scientist position in one online drink delivery company back in 2020. It used to be one of my dream companies. Unfortunately, I am not selected, but on the other hand, this is not too bad. The company was merged with the other one a few months later and a lot of the employees are let go. Maybe this is from the Lord.
During the interview, I got to know one data scientist there and we even had a meal together, that was a good experience though! that is my pleasure to know Daniel :)