Showing posts with label Data science. Show all posts
Showing posts with label Data science. Show all posts

Thursday 14 September 2017

Apriori algorithm with R

 

The apriori algorithm is used to discover association rules, and what is that?.

Association rules is about discover pattern in data, usually transnational data,  like sales (each product when you do a purchase is an item), temporal events (each purchase with sequential order), and could be used in texts (where each item would be a word ).

So what is the trick behind that?, apriori algorithm  mainly counts every time an item appears, later calculated some metrics like "confidence", and "support" in each iteration.

Here a few concepts association rules.

Support:  it show the transaction proportion where a item appears.
X: count the times that an item appears in the dataset
N: quantity of transaction.

S(x) = X/N

Confidence: it's the confidence of a rule. that indicates how much accurate is a rule.

So, the transaction format could be:

Single.
taken the example of sales, in this format a line represent a product, so should be more of one lines with different products which referrer to the same transaction. here a example:

Basket sparse sequential.

Each line represent a transaction, so you get a sparse format with variation of the number of columns by row instead of a csv format with equals columns.

Basket.

Each line represent a transaction but with equals columns, so for large products
this could be a nightmare, if your machine doesn't have a lot of memory. this is support by SPSS (clementine or modeler)




Well first, we need to install these packages,  "arules""arulesViz", "arulessecuences".
R use the format basket sparse and single, here I used format basket sparse.

install.packages("arules");
install.packages("arulesViz");
install.packages("arulesSecuences");

We need to define the support and the confidence,
you could edit this in the file arules.r

support1 = c(0.2) #it's a low support because 
                  #I want to see what happens
                  #at this level 
support2 = c(0.7)   # a higher support,
confidence = c(0.9) # and confidence often should be over 0.8

tr = read.transactions("transacciones.basket",
                       sep=',',
                       cols=c(1),
                       format="basket");
image(tr);
summary(tr);
Image plot is like a heatmap for display frequently bought products. If the list products is too big,
this is not useful. On the other hand "summary" show us an overview.

itemFrequencyPlot(tr, supp=support1)

the command above makes this graph:

And here we, execute the apriori algorithm with the data transaction (tr) and the parameters we defined before:

rules = apriori(tr, parameter= list(supp=support1, conf=confidence))
inspect(rules)

plot(rules, method="graph", control=list(type="items"))
plot(rules, method="grouped")
 
 
 

Friday 11 August 2017

Data Science Project Checklist

 
Before Starting The Data Science Project Checklist
The checklist to go through for before starting the project is further broken down into five different sections.
  1. What question are you asking/answering and for whom?
  2. What data are you using?
  3. What techniques are you going to try?
  4. How will you evaluate your methods and results?
  5. What do you expect the result to be?
For each section, there will be additional questions that you should think about and answer before you get started with your data science project.

Refer to full article for more detail.

Saturday 17 June 2017

Peekaboo: A Wordcloud in Python

Peekaboo: A Wordcloud in Python: Last week I was at Pycon DE , the German Python conference. After hacking on scikit-learn a lot last week, I decided to to something differe...

Wednesday 31 May 2017

Scikit-learn cheet sheet



Often the hardest part of solving a machine learning problem can be finding the right estimator for the job.
Different estimators are better suited for different types of data and different problems.
The flowchart below is designed to give users a bit of a rough guide on how to approach problems with regard to which estimators to try on your data.




https://qph.ec.quoracdn.net/main-qimg-fa26fa7e8001fd19dd53057397da999f
http://scikit-learn.org/stable/tutorial/machine_learning_map/


Azure AzCopy Command in Action

Azure AzCopy Command  in Action -  Install - Module - Name Az - Scope CurrentUser - Repository PSGallery - Force # This simple PowerShell ...