Association rules is about discover pattern in data, usually transnational data, like sales (each product when you do a purchase is an item), temporal events (each purchase with sequential order), and could be used in texts (where each item would be a word ).
So what is the trick behind that?, apriori algorithm mainly counts every time an item appears, later calculated some metrics like "confidence", and "support" in each iteration.
Here a few concepts association rules.
Support: it show the transaction proportion where a item appears.
X: count the times that an item appears in the dataset
N: quantity of transaction.
S(x) = X/N
Confidence: it's the confidence of a rule. that indicates how much accurate is a rule.
So, the transaction format could be:
Single.
taken the example of sales, in this format a line represent a product, so should be more of one lines with different products which referrer to the same transaction. here a example:
Basket sparse sequential.
Each line represent a transaction, so you get a sparse format with variation of the number of columns by row instead of a csv format with equals columns.
Basket.
Each line represent a transaction but with equals columns, so for large products
this could be a nightmare, if your machine doesn't have a lot of memory. this is support by SPSS (clementine or modeler)
R use the format basket sparse and single, here I used format basket sparse.
install.packages("arules"); install.packages("arulesViz"); install.packages("arulesSecuences");
We need to define the support and the confidence,
you could edit this in the file arules.r
support1 = c(0.2) #it's a low support because
#I want to see what happens #at this level
support2 = c(0.7) # a higher support,
confidence = c(0.9) # and confidence often should be over 0.8
tr = read.transactions("transacciones.basket", sep=',', cols=c(1), format="basket"); image(tr); summary(tr);Image plot is like a heatmap for display frequently bought products. If the list products is too big,
this is not useful. On the other hand "summary" show us an overview.
itemFrequencyPlot(tr, supp=support1)
the command above makes this graph:
And here we, execute the apriori algorithm with the data transaction (tr) and the parameters we defined before:
rules = apriori(tr, parameter= list(supp=support1, conf=confidence)) inspect(rules)
plot(rules, method="graph", control=list(type="items")) plot(rules, method="grouped")
No comments:
Post a Comment