Our astrophysicist Dobroslav Slijepcevic, as a data analyst for our agency Inceptly, recently slowly introduced you to the world of MMM, in order to be able to understand its massive value for marketers generally.
If you missed it, you could read it here.
Now is a time for part 2.
Finally, let's do some modeling...
… Or wait a little bit more. First, we need to do some descriptive and exploratory analysis of our data, to become familiar with it. We don’t know anything about them, maybe our assumption that a linear model could fit doesn’t make sense.
You can find data on google spreadsheet here.
I recommend you first calculate the average weekly spending by channel and average weekly revenue, by using the functions ‘AVERAGE’ and ‘SUM’. For our example data, that would be:
Now, it’s a good idea to plot some graphs and do some simple analysis to see what we are dealing with. Since revenue is the variable we want to explain and predict, usually, it’s smart to plot it on the timeline:
Now, we visually showed the changes in revenue by the week. You must admit that it’s a better way to become a friend with your data, then just to stare into the table of 51 rows and try to capture anything. Now, we clearly see what weeks are the spikes and what are the dips in our revenue.
Hence, we see that we had a minimum in week 1, while the maximum in revenue is at week 17. Generally speaking, we have a slightly uprising trend, which we can check by adding a trendline to our graph. Clearly, our revenue has had an uprising trend in those weeks. You can even add the average revenue to your graph and see what weeks had revenue below average, or what weeks got income above average.
Now it’s time to see how we spent on channels by week. Generally, we spent the most on Facebook Ads, then on Google videos, and then on TV. We have calculated the weekly average spending for each channel and confirmed what we saw in the timeline graphs. Also, by comparison of the revenue timeline graph and ad spend timeline graph we should conclude that ad spending is correlated to revenue, when we spend more, we have a bigger gross in revenue! One of the most usual ways to visually confirm that is to make a scatter plot of Revenue vs each of the Ad channels.
It is very simple to plot these graphs in Gsheets, you just need to mark the proper two columns (Revenue and one of the ad channels) and insert a new chart. In setup, you must choose a scatter chart, and set up the Ad spend channel as axis x (horizontal) while Revenue is axis y (vertical). By doing that for all ad channels, you’ll get three independent graphs, shown in the figure below:
How to read these graphs? Each dot represents one pair of variables (revenue - ad spend) for the appropriate week. And we are trying to assume what pattern these dots are following. One look is enough to catch the trend all graphs follow: An increase in spending for each of the channels separately is followed by an increase in revenue! The relationships are not perfectly linear, but we are good with that, I doubt that there’s a person who saw a perfect linear relationship in the dataset from the real world ever. So, we showed that there is a correlation between revenue and ad spends, but, for now, only qualitatively.
To express this correlation by the exact numbers, which means quantitatively, we will use the so-called correlation matrix. Basically, it is a table that shows us the coefficients of correlation between the variables we use in our dataset. Coefficients of correlation are numbers between -1 and +1 which give us information about how strong is the linear relationship between the variables. Coefficient +1 means perfect positive correlation, the relationship between variables is perfectly linear, and an increase in one variable leads to an increase in another variable too. Coefficient -1 means perfect negative correlation, the relationship between variables is perfectly linear, but an increase in one variable leads to a decrease in another variable! All dots lie along a straight line and the relationship is perfectly described with linear function (remember that we had y=ax+b):
It’s clear that we consider the linear relationship stronger as the coefficient of correlation is closer to +1 or -1. In practice, if we have at least +0.7 to +1 (or between -0.7 and -1), we consider it strong enough.
In google sheets, you should insert the function =CORREL (RANGE of variable 1, Range of Variable 2) into any field if you want to calculate the coefficient of correlation. For example, if I want to calculate the correlation coefficient between the revenue (Column B) and Facebook Spend (Column C) and I have 51 rows of data, I should insert the function =CORREL(B2:B52, C2:C52) into the field I want. If you arrange the fields properly, you can organize a table like this one below, and even color it to easily categorize coefficients by the value:
As you can see, the coefficient of correlation between Revenue and Facebook spend has the highest value (0.87), but the correlation between Revenue and Google Spend and TV spend is very strong, too (0.74 and 0.73). With this, we are pretty sure that the relationship between our dependent variable (revenue) and our independent variables (three channels of ad spending) could be described as linear, and we can finally make a simple model.
Multiple linear regression
How to make a model: We defined what is simple linear regression, and I explained it through one dependent variable (y, or Revenue in our example) which we want to explain and predict, and one independent variable (x, or Ad spend). But, as you notice for sure, we have three channels of Ad spends and one column which shows the weeks when we had some kind of special events (promotions) when we changed prices, had actions like buy 1 and get 1 gratis and similar… The easiest way to somehow incorporate special events into the model is to input data about them as 1 (true) or 0 (false). So, we did it, weeks with promotions have 1 as the value in that column, and weeks without promotions have zero values. Data organized in this way is called DUMMY DATA.
Hence, we have 4 independent variables: Facebook Spend (x1), TV Spend (x2), Google Spend (x3), and Promotions (x4) and we want to predict Revenue. Thus, our equation will have a form:
y=a1× x1 +a2× x2 +a3× x3 + a4× x4+b
where each independent variable will have the slope coefficient a for itself, while b is the baseline again.
It is quite easy to do multiple linear regression in Gsheets by using the function LINEST, which requires four arguments. All you need to do is to choose a random empty field and insert =LINEST(y_data, x_data, 1, 1). In our example, it would be =LINEST(B2:B52, C2:F52, 1, 1).
It’s clear why we insert the first two arguments of the function, but I must explain the meaning of two numbers of “1” as the third and fourth arguments. “1” as the third argument means that you want to calculate the intercept because there are situations when you want to “enforce” the line to pass through the origin (where all of the variables have the value of zero). In that case, you set up the third argument as zero, otherwise, you choose “1” and get the y-intercept (or baseline).
If the fourth argument is “1”, you’ll get some additional statistics as a result along with the slope coefficients and y-intercept. It’s important statistics, but it isn’t part of this simple course (except for one field, but we shall discuss it later). If you follow the described steps, you should get the following:
ii) Interpreting the results:
Let me interpret what you have got, but have in mind that we are especially interested in the first row. First row, but the last column, that field is the y-intercept, or better said our baseline. Thus, if we don't spend a single penny on the ads, we should have a revenue of about $4,687.
The other values in the first row are the slopes, but the LINEST function gives them in the reverse order, the first column is for the fourth independent variable, the second column is for the third independent variable, etc…(you have the correct order in the last row). Thus, the slope for FB Spend is 6.32, the slope for TV Spend is 2.79, and the slope for Google Spend is 5.92. Also, the slope for Promotions is 312.48.
As I said before, the meaning of these slopes, (if we used revenue as a dependent variable we want to predict, and if we did linear regression without any transformations) is that they represent the ROAS of our marketing channels.
The model tells us that for $1.00 invested in Google video we earn $5.92, for $1.00 invested in Facebook ads, we earn $6.32 and for each $1.00 invested in TV ads, we earn $2.79. And also, every time when we have some kind of promotion, we earned an extra $312. Uhm, maybe it’s not totally realistic, but this is only the exercise example, why not be optimistic 🙂
iii) Describe what is a response to all marketing activities by the channels on our revenue:
Now, you can easily calculate the effect share of each marketing channel and promotion in your total revenue! What do you need to do? If the coefficient of Facebook is 6.32, then you should multiply it by the total Facebook spend. Total facebook spend can be calculated as =SUM(C2:C52), and in my Gsheet file, the coefficient for Facebook is the P1 field. Thus, the final formula for Facebook Share in total revenue is =P1*SUM(C2:C52). You should do the similar for Google Spend and TV Spend.
For baseline and promotions is a little bit different. There, you only need to multiply the number of weeks (51) by the baseline (for promotions, you multiply the number of weeks with promotions, which is six, in our case, by the promotions coefficient).
Percentage Effect shares in total revenue you can calculate only by dividing each share by total revenue, and by formatting the fields as a percent (Format > Number > Percent)
Percentage Spend share you can calculate if you divide total channel spend by the total ad spend
When you do all of that, you’ll get the following:
Show this visually (Try to google how to make these charts, two bar charts and one waterfall chart):
So, about 32% of your revenue is actually your baseline, if you stop your marketing activities, you could expect to keep getting that part of your revenue.
But, even 36% of your total revenue is generated by Facebook while you spent 45% of all your marketing costs on Facebook. Google participates with about 23% in total revenue, but you spent significantly less on it in comparison to Facebook. It seems that TV generates the least amount of money (9%), although the cost of TV is not negligible at all (25%). If this was a real situation, the simple recommendation would be:
1. Don't spend too much on TV, better spend on Facebook and Google,
2. It’s very common for other marketing channels to steal from TV. The effect of TV activities is long-term effect. Maybe we need to incorporate that into our model and then decide!
iv) Calculate the predicted revenue by the model and make a chart of Actual/Predicted to visually evaluate your model’s performance.
How to calculate the weekly revenue which our model predicts? Note that we fitted the best straight line through the dots which don’t lie exactly on the line. Thus, predicted values will differ from the real values. But, for how much? Let us see. Our equation for the predicted revenue is:
Predicted Revenue = Facebook Coefficient * Facebook Spend +TV Coefficient * TV Spend + Google Coefficient * Google Spend + Promotions Coefficient * 1, if there were promotions + Baseline
For the first week, it would be:
predicted revenue = 6.32 ✕ 755.43 + 2.79 ✕ 250.00 + 5.92 ✕ 377.06 +4687.35
Actual value is $9,657, thus, the difference (known as RESIDUAL in statistics) is:
Actual - Predicted = $9,657 - $12,391= - $2,734
Expressed as a percentage, this error would be:
Actual - Predicted Actual 100% = -2,7349,657 100% = -28.3%
It seems as a very large error, but note that this is only one week of 51. We are interested in the whole picture, so we must calculate all predicted weekly revenues, residuals, and percentage errors. On our data set, it is very easy to do. The formula for predicted revenue (in our case) is:
where we use $$ sign to keep the proper fields (with coefficients) constant while we go through the column. If we insert the formula above to the field 2 of some column and drag down ‘till field 52, you’ll get all predicted revenues very quick!
I calculated the Predicted revenue in Column R. To get the chart Actual vs Predicted, you need to mark the week column (A in my case), Actual Revenue (Column B), and Predicted Revenue (Column R) and go to insert > chart. Set the chart to be a Line chart. Choose Week as the X axis, and set Revenue and Predicted Revenue as Series. Add names to the axis, and change the type of the lines, if you want. You should get something like this:
Doesn’t look bad at all! We caught all dips and peaks, it seems that the predicted and actual values agree with each other.
v) Calculate errors, and evaluate the accuracy of a model
There are many ways to evaluate the accuracy of a model, but for now, we shall talk about two parameters. The first of them is R2 (or R-squared), and the second is MAPE (Mean Absolute Percentage ERROR). The goal is to get as high an R-squared as possible, and MAPE as LOW as possible.
We don’t need to calculate R-squared. The LINEST function calculated it for us, it is in the first column of the LINEST results, and in the third row:
As we can see, R-squared is 0.84. The highest possible value is 1.00, so it counts as pretty OK. It tells us how much of the values we wanted to predict can be explained by the input data (spending and promotions), and we got about 84% which is not bad. Anyway, it’s wrong only to evaluate the model by looking at R-squared. There are many types of errors we can choose to estimate the accuracy of the model. One of them is MAPE, and it is easy to calculate. MAPE is the average value of all percentage errors, and as I said, we want it to be low! How to calculate it?
We need to calculate all of the residuals
Then, we need to calculate percentage errors
Now, we need to transform percentage errors into their absolute values
At the end, we must sum all the absolute percentage errors and divide them by the number of observation (51)
If you follow these steps, you’ll get that:MAPE = 5.66%
Let it be your Homework 🙂
vi) Do some forecasts. Allocate budget. For example, your maximum weekly revenue was about $20,000, while the budget was about $2,900. What would be, if we try to spend the same money, but allocate it like this: $1,200 on Facebook, $1,400 on Google, and $300.00 on TV? Forecasting is:
Forecast =6.32 ✕ 1,200 + 2.79 ✕ 300 + 5.92 ✕ 1,400 +4687.35=$21.500
It would get an increment of $21.500 - $1.500=$1.500 or 1,500/20,000 * 100% =7.5% in our revenue!
Find the best scenario for you!
In the next lesson, we shall discuss how to incorporate a time lag in marketing effects (you can’t expect all of your marketing activities to give the best result immediately) and also how to handle “diminishing returns”, because every marketing activity must become “saturated” and get a less response in return.
What are your YouTube ad questions?
In the meantime though, what questions do YOU have about YouTube ads?
Let us know in the Comments section below, and we’ll make sure to cover your question in an upcoming post.
Have a nice day!
The VidTao Team
Sign up for VidTao.com: The Free Tool to Discover & Track Your Market's Best YouTube Ads.
Are you spending over $1k/day on paid traffic and want to scale with YouTube ads? Schedule your free YouTube ad brainstorming call here: inceptly.com/call
VidTao.com is brought to you by Inceptly.com - High-Performance YouTube Ad Creative & Media Buying Agency Managing $5M/month+ in YouTube Ad Traffic.