Thursday, 14 September 2017

Apriori algorithm with R

 

The apriori algorithm is used to discover association rules, and what is that?.

Association rules is about discover pattern in data, usually transnational data,  like sales (each product when you do a purchase is an item), temporal events (each purchase with sequential order), and could be used in texts (where each item would be a word ).

So what is the trick behind that?, apriori algorithm  mainly counts every time an item appears, later calculated some metrics like "confidence", and "support" in each iteration.

Here a few concepts association rules.

Support:  it show the transaction proportion where a item appears.
X: count the times that an item appears in the dataset
N: quantity of transaction.

S(x) = X/N

Confidence: it's the confidence of a rule. that indicates how much accurate is a rule.

So, the transaction format could be:

Single.
taken the example of sales, in this format a line represent a product, so should be more of one lines with different products which referrer to the same transaction. here a example:

Basket sparse sequential.

Each line represent a transaction, so you get a sparse format with variation of the number of columns by row instead of a csv format with equals columns.

Basket.

Each line represent a transaction but with equals columns, so for large products
this could be a nightmare, if your machine doesn't have a lot of memory. this is support by SPSS (clementine or modeler)




Well first, we need to install these packages,  "arules""arulesViz", "arulessecuences".
R use the format basket sparse and single, here I used format basket sparse.

install.packages("arules");
install.packages("arulesViz");
install.packages("arulesSecuences");

We need to define the support and the confidence,
you could edit this in the file arules.r

support1 = c(0.2) #it's a low support because 
                  #I want to see what happens
                  #at this level 
support2 = c(0.7)   # a higher support,
confidence = c(0.9) # and confidence often should be over 0.8

tr = read.transactions("transacciones.basket",
                       sep=',',
                       cols=c(1),
                       format="basket");
image(tr);
summary(tr);
Image plot is like a heatmap for display frequently bought products. If the list products is too big,
this is not useful. On the other hand "summary" show us an overview.

itemFrequencyPlot(tr, supp=support1)

the command above makes this graph:

And here we, execute the apriori algorithm with the data transaction (tr) and the parameters we defined before:

rules = apriori(tr, parameter= list(supp=support1, conf=confidence))
inspect(rules)

plot(rules, method="graph", control=list(type="items"))
plot(rules, method="grouped")
 
 
 

No comments:

Post a Comment

Azure AzCopy Command in Action

Azure AzCopy Command  in Action -  Install - Module - Name Az - Scope CurrentUser - Repository PSGallery - Force # This simple PowerShell ...