The factors of yearly beer sales — Linear model in time series analysis

Wei hung Pan
5 min readMar 28, 2022
Beer in Supermarket!

Beer is one of the most famous drinks in Germany, and of course, is part of the German culture! One of the grocery delivery companies in Berlin has collected whole year data on beer sales and other factors. They would like to know which factor is important for beer sales and could adjust their business strategy upon that.

All the code and the original data can be found here on Github

Data

This dataset contains the following values for each day of 2019:

  • Number of crates of beers sold
  • Hours of daylight
  • Public holiday (0 or 1)
  • The mean temperature in °C
data for the sales

Exploratory Analysis

Before we run any models, we would like to know the data in a deeper way.

  • Time-base data trend — Daily

we can see the sales have definitely some relationship with the hours of daylight and temperature (not surprising at all, who does not enjoy beer more in summer?)

  • Time-base data trend — Weekday
weekday trend for the beer sale

We can see crates_sold is higher on Thursday, Friday, and Saturday (not surprising also).

Correlation between dependent variables

I calculate the correlation and make the scatter plot of hours_of_daylight and mean_temperature. The correlation is very high (Pearson correlation 0.801), which might cause the collinearity problem in the model.

correlation between temperature and daylight

Simple linear model

Because of the high correlation hours_of_daylight and mean_temperature we only put mean_temperature in the model to avoid collinearity. I have tried to put the interaction in the model, but the interaction term was not significant. So I select the final model only with the main effect.

mod = smf.ols(formula='crates_sold ~ mean_temperature+public_holiday', data=data)
reg = mod.fit()
model_norm_residuals = reg.get_influence().resid_studentized_internal
print(reg.summary())
Result of the simple linear model

Model Diagnostic of the simple linear model

Assumption of the linear model: the residual should be

  1. Normally distributed with a mean equal to 0
  2. Equal variance
  3. i.i.d
Residual histogram and QQ-plot

It looks good for the normality, but how about the correlation between the residuals?

It is not random…means they are somehow correlated

Autocorrelation

The error should be independent of each other. But in the following plot, We can see a trend of 7, which might indicate the error term is related to weekdays.

acf = smt.graphics.plot_acf(reg.resid, lags=40 , alpha=0.05)
acf.show()

We can see there are three peaks during the 7-day period which is the match on Thursday, Friday, Saturday.

Re-model by adding new features

I add weekdays (Thursday, Friday, Saturday), months (3–12) in the model

mod_final = smf.ols(formula='crates_sold ~ mean_temperature +public_holiday +Weekday_Name_Friday + Weekday_Name_Saturday + Weekday_Name_Thursday  + Month_3 + Month_4 +Month_5 + Month_6 + Month_7 +Month_8 +Month_9 + Month_10 + Month_11 + Month_12', data=new_data) 
reg_final = mod_final.fit()
print(reg_final.summary())

Model Diagnostic again

Autocorrelation is GONE!!!!!!!

Conclusion and Remarks

The final model fits the data quite well and the residuals fit the model assumption.

Concern:

The R-square of the model is 0.94. This is extremely high, even a bit unusual. This might be due to overfitting. For solving this potential problem, I will use more historical data to verify this model.

Bussiness Remarks:

  1. Make the sales strategy:
    Based on the data, the sales situation on certain weekdays and months is lower. We can suggest the marketing team provide special deals on those days to enhance the orders.
  2. Make the plan for resource arrangements:
    The sales situation depends highly on the season. We can adjust the arrangements for drivers/vehicles for delivery. This arrangement can also be applied to drink storage.
  3. Difference between cities:
    This data is from Berlin. When new markets are opened in other cities, this model can be used as the baseline, and we can change the model structure according to the real data.

PS.

this is an exercise when I applied for the Data Scientist position in one online drink delivery company back in 2020. It used to be one of my dream companies. Unfortunately, I am not selected, but on the other hand, this is not too bad. The company was merged with the other one a few months later and a lot of the employees are let go. Maybe this is from the Lord.

During the interview, I got to know one data scientist there and we even had a meal together, that was a good experience though! that is my pleasure to know Daniel :)

--

--