In this article I’m going to analyze a marketing database using Principal Component Analysis or PCA.
PCA is particularly useful when we are working with data sets that have a lot of variables. These datasets cannot be easily visualized the in its raw format, and hence it is difficult to get a sense of the trends and relations that are present within. PCA enable us to summarize the data and to identify similarities between individuals and to see what variables correlate with one another and contribute strongly to the same principal component.
There are two general methods to perform PCA in R :
- Spectral decomposition which examines the covariances / correlations between variables
- Singular value decomposition which examines the covariances / correlations between individuals
The difference between the approaches lie in the orthogonality of the decomposition.
- The basis of the spectral decomposition is not necessarily orthogonal, the eigenbasis of the SVD is orthonormal!
- Every matrix has a SVD, no questions asked – it does not need to be square or fulfill any other requirements. On the other hand, not even every square matrix has an spectral decomposition, that is a fundamental difference that makes the SVD very powerful.
In this section, you will use a PCA using a marketing dataset. I compiled a dataset from different sources and I anonymized it. The data set can be downloaded from this link. The data set contain data for an unknown Brand in an unknown category. It contains sales data in term of market share, store variables, some competition variables and more importantly advertising activity data.
Definition: Private label products are those manufactured by one company for sale under store own brands name sometime also known as white labels.
GRP: Le GRP is an indicator of the advertising pressure of a given media. It corresponds to the average number of advertising contacts obtained on 100 individuals of the targeted target.
Reach: refers to the total number of different people or households exposed, at least once, to a medium during a given period.
Here is a description of the fields in the data set:
- week: the week number
- Year: the data span approximately 3 years from mi 2010 to mid 2013
- Market.Share: the category market share of the product
- Av.Price.per.kg: average price of 1 kilogram of the product
- Non.Promo.Price.per.kg: Non promotional price of the product
- Promo.Vol.Share: ratio of the promotion to. Normal sales
- Total.Weigh: total weight of the product sold
- Share.of.Ean.Weigh:
- Avg.price.vs.PLB: Ratio of price versus the store private brand in the same category.
- Non.promo.price.vs.PLB: average non promotion price ration to the private label brand
- Promo.vol.sh.index.vs.PLB: ratio promotion volume to the private label brand
- Total.cm.shelf: Total of linear space taken by the product in centimeters
- Shelf.share: share of the total shelf taken by the category
- Top.of.mind: ratio interview that cited the brand top of mind. (this is an answer to the question: can you cite some brands in the category X)
- Spontaneous: ratio of interviewees spontaneously citing the brand
- Aided: ratio of the interviewees that recognized the brand by their logo
- Penetration: ratio of the household that bought at least once the brand in the year.
- Competitor: one competitor market share. This is a competitor brand that is of interest in the analysis.
- GRP.radio: GRP of the radio in a given week.
- Reach.radio: Reach of the radio advertising in a given week.
- GRP.TV: GRP of TV advertising
- Reach.TV: reach of TV advertising
- Reach.cinema: Reach of Cinema advertising
- GRP.outdoor: GRP of outdoor advertising
- GRP.print: GRP of Print advertising
- Share.of.spend: share of the marketing budget in these activities in the given week.
Data acquisition
Data is hosed in a github repository use the following to load it.
library(readr) file<-"https://raw.githubusercontent.com/mbenhaddou/MarketingAnalytics/master/data/data.csv" data<-read_csv(file) head(data)
Find highly correlated variables
Let remove some correlated variables. It is always questionnable whether this is necessary.I think that there may be some merit in discarding variables thought to be measuring the same underlying aspect of a collection of variables, for example Reach and GRP in our case. But it should be abundantly clear that setting aside variables known to be strongly correlated with others can have a substantial effect on the PCA results
There are several reasons why we would want to remove correlated variables:
- because including the nearly-redundant variables can cause the PCA to overemphasise their contribution.
- there is nothing mathematically right (or wrong) about such a procedure; it’s a judgment call based on the analytical objectives and knowledge of the data
- it helps in visualisation when displaying factor plots or biplots.
require(caret) marketing.data&lt;-data corr <- cor(as.matrix(marketing.data)) highCorr <- findCorrelation(corr, 0.95) print(highCorr) #while findCorrelation suggests fields to eliminate, I prefer cheking for my self. names.data&lt;-names(marketing.data) for (i in c(1:length(marketing.data))){ for (j in c(1:length(marketing.data))){ corval<-cor(marketing.data[,i],marketing.data[,j]) if(corval > 0.95 and i<j){ print("+++++++++++++") print(paste(names.data[i], " ---&amp;gt; ", i)) print(paste(names.data[j], " ---&amp;gt; ", j)) print("+++++++++++++") } } }
[1] 12 13 14 16 21 20 [1] "+++++++++++++" [1] "Total.cm.shelf ---> 12" [1] "Shelf.share ---> 13" [1] "+++++++++++++" [1] "+++++++++++++" [1] "Top.of.mind ---> 14" [1] "Spontaneous ---> 15" [1] "+++++++++++++" [1] "+++++++++++++" [1] "Top.of.mind ---> 14" [1] "Aided ---> 16" [1] "+++++++++++++" [1] "+++++++++++++" [1] "Spontaneous ---> 15" [1] "Aided ---> 16" [1] "+++++++++++++" [1] "+++++++++++++" [1] "GRP.radio ---> 19" [1] "Reach.radio ---> 20" [1] "+++++++++++++" [1] "+++++++++++++" [1] "GRP.TV ---> 21" [1] "Reach.TV ---> 22" [1] "+++++++++++++"
As expected Reach and GRP are correlated. We can also see that Total.cm.shelf and Shelf.share are correlated. Top.of.mind, Aided and spontaneous are very correlated also. We will then remove the following variables: *Total.cm.shelf, Spontaneous, Aided, GRP.radio and GRP.TV. We will also remove the column *year* from the data set as this column will be used to group the data points.
marketing.data<-marketing.data[,-c(2,12,15,16, 19, 21)] head(marketing.data)
Computing Principal Components
After removing correlated fields, we are left with a matrix of 21 columns and 156 rows, which we will pass to the prcomp() function, assigning the output to marketing.pca variable . We will also set two arguments, center and scale, to be TRUE.
marketing.pca <- prcomp(marketing.data, center = TRUE,scale. = TRUE) summary(marketing.pca)
Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 Standard deviation 2.3867 1.8473 1.4594 1.35062 1.12545 1.0030 0.98249 0.96700 0.83463 Proportion of Variance 0.2848 0.1706 0.1065 0.09121 0.06333 0.0503 0.04826 0.04675 0.03483 Cumulative Proportion 0.2848 0.4554 0.5619 0.65314 0.71647 0.7668 0.81503 0.86179 0.89662 PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 Standard deviation 0.71005 0.61199 0.56523 0.46799 0.44543 0.4050 0.33044 0.26692 0.2365 Proportion of Variance 0.02521 0.01873 0.01597 0.01095 0.00992 0.0082 0.00546 0.00356 0.0028 Cumulative Proportion 0.92183 0.94055 0.95653 0.96748 0.97740 0.9856 0.99106 0.99462 0.9974 PC19 PC20 Standard deviation 0.18106 0.13728 Proportion of Variance 0.00164 0.00094 Cumulative Proportion 0.99906 1.00000
The variance retained by each principal component can be obtained and plotted using the factoextra package as follow:
require(factoextra) fviz_screeplot(marketing.pca, ncp=10)
We can see that 53% of the variance is explained by the first 2 components.
Plotting PCA
Now it’s time to plot our PCA. We will make a biplot, which includes both the position of each sample in terms of first 2 principal components, PC1 and PC2 and also will show us how the initial variables map onto this. We will use the ggbiplot package, which offers a user-friendly and pretty function to plot biplots which is a type of plot that will allow us to visualize how the samples relate to one another in our PCA and will simultaneously reveal how each variable contributes to each principal component. first let let install the library ggbiplot
library(devtools) devtools::install_github("vqv/ggbiplot") ggbiplot(marketing.pca)
The axes are seen as arrows originating from the center point. Here, you see that the variables *market.Share*, *Shelf.share*, and *cpmpetitor* all contribute to PC1, with higher values in those variables moving the samples to the right on this plot. This lets us see how the data points relate to the axes, but it’s not very informative without knowing which point corresponds to which sample. To gain better insights we will show data points ids and colour each point according to the year variable.
ggbiplot(marketing.pca,ellipse=TRUE, labels=rownames(marketing.data), groups=as.factor(data$Year))
Now you see something interesting: The years are clearly clustered together. 2010 form a distinct cluster to the right. Looking at the axes, you see that 2010 is characterised by high values for Promo.Vol.share, Promo.vol.share.index.vs.PLB, and Shelf.share 2011characterised by higher Reach.cinema, the 2 years are characterised by higher market and shelf shares. 2012 and 2013 on the other hand, are characterised by high marketing activities (Reach.TV, Reach.radio and Share.of.spend), the apparition of a competitor. If we focus on market share, we can see that:
- Competitor is cannibalising sales.
- Marketshare is very correlated to shef.share
- Market share is correlated with the promotional volume.
- Non promo price does not impact the market share, which could be a good news for the manufacturer
- We can also see that Outdoor and print marketing have zero correlation with sales.
This website truly has all of the info I needed concerning this subject and didn’t know who to ask.
Simply wish to say your article is as amazing. The clearness for your submit is simply great and that i could suppose you are knowledgeable on this subject. Fine together with your permission allow me to snatch your feed to stay up to date with coming near near post. Thank you 1,000,000 and please continue the gratifying work.
I would like to thank you for the efforts you have put in penning this website. I am hoping to check out the same high-grade content from you later on as well. In truth, your creative writing abilities has motivated me to get my very own blog now 😉
It’s hard to find well-informed people about this subject, but you sound like you know what you’re talking about! Thanks
Hi there I am so excited I found your web site, I really found you by error, while I was looking on Google for something else, Nonetheless I am here now and would just like to say many thanks for a marvelous post and a all round thrilling blog (I also love the theme/design), I don’t have time to look over it all at the moment but I have bookmarked it and also added your RSS feeds, so when I have time I will be back to read much more, Please do keep up the fantastic work.
You should take part in a contest for one of the best blogs on the web. I will recommend this site!