Showing posts with label Machine learning. Show all posts

Thursday, 14 September 2017

Apriori algorithm with R

The apriori algorithm is used to discover association rules, and what is that?.

Association rules is about discover pattern in data, usually transnational data, like sales (each product when you do a purchase is an item), temporal events (each purchase with sequential order), and could be used in texts (where each item would be a word ).

So what is the trick behind that?, apriori algorithm mainly counts every time an item appears, later calculated some metrics like "confidence", and "support" in each iteration.

Here a few concepts association rules.

Support: it show the transaction proportion where a item appears.
X: count the times that an item appears in the dataset
N: quantity of transaction.

S(x) = X/N

Confidence: it's the confidence of a rule. that indicates how much accurate is a rule.

So, the transaction format could be:

Single.
taken the example of sales, in this format a line represent a product, so should be more of one lines with different products which referrer to the same transaction. here a example:

Basket sparse sequential.

Each line represent a transaction, so you get a sparse format with variation of the number of columns by row instead of a csv format with equals columns.

Basket.

Each line represent a transaction but with equals columns, so for large products
this could be a nightmare, if your machine doesn't have a lot of memory. this is support by SPSS (clementine or modeler)

Well first, we need to install these packages, "arules", "arulesViz", "arulessecuences".
R use the format basket sparse and single, here I used format basket sparse.

install.packages("arules");
install.packages("arulesViz");
install.packages("arulesSecuences");

We need to define the support and the confidence,

you could edit this in the file arules.r

support1 = c(0.2) #it's a low support because

                  #I want to see what happens
                  #at this level

support2 = c(0.7)   # a higher support,
confidence = c(0.9) # and confidence often should be over 0.8

tr = read.transactions("transacciones.basket",
                       sep=',',
                       cols=c(1),
                       format="basket");
image(tr);
summary(tr);

Image plot is like a heatmap for display frequently bought products. If the list products is too big,
this is not useful. On the other hand "summary" show us an overview.

itemFrequencyPlot(tr, supp=support1)

the command above makes this graph:

And here we, execute the apriori algorithm with the data transaction (tr) and the parameters we defined before:

rules = apriori(tr, parameter= list(supp=support1, conf=confidence))
inspect(rules)

plot(rules, method="graph", control=list(type="items"))
plot(rules, method="grouped")

Sunday, 27 August 2017

R vs Python for machine learning

Some real important differences to consider when you are choosing R or Python over one another:

Machine Learning has 2 phases. Model Building and Prediction phase. Typically, model building is performed as a batch process and predictions are done realtime. The model building process is a compute intensive process while the prediction happens in a jiffy. Therefore, performance of an algorithm in Python or R doesn't really affect the turn-around time of the user. Python 1, R 1.
Production: The real difference between Python and R comes in being production ready. Python, as such is a full fledged programming language and many organisations use it in their production systems. R is a statistical programming software favoured by many academia and due to the rise in data science and availability of libraries and being open source, the industry has started using R. Many of these organisations have their production systems either in Java, C++, C#, Python etc. So, ideally they would like to have the prediction system in the same language to reduce the latency and maintenance issues. Python 2, R 1.
Libraries: Both the languages have enormous and reliable libraries. R has over 5000 libraries catering to many domains while Python has some incredible packages like Pandas, NumPy, SciPy, Scikit Learn, Matplotlib. Python 3, R 2.
Development: Both the language are interpreted languages. Many say that python is easy to learn, it's almost like reading english (to put it on a lighter note) but R requires more initial studying effort. Also, both of them have good IDEs (Spyder etc for Python and RStudio for R). Python 4, R 2.
Speed: R software initially had problems with large computations (say, like nxn matrix multiplications). But, this issue is addressed with the introduction of R by Revolution Analytics. They have re-written computation intensive operations in C which is blazingly fast. Python being a high level language is relatively slow. Python 4, R 3.
Visualizations: In data science, we frequently tend to plot data to showcase patterns to users. Therefore, visualisations become an important criteria in choosing a software and R completely kills Python in this regard. Thanks to Hadley Wickham for an incredible ggplot2 package. R wins hands down. Python 4, R 4.
Dealing with Big Data: One of the constraints of R is it stores the data in system memory (RAM). So, RAM capacity becomes a constraint when you are handling Big Data. Python does well, but I would say, as both R and Python have HDFS connectors, leveraging Hadoop infrastructure would give substantial performance improvement. So, Python 5, R 5.

So, both the languages are equally good. Therefore, depending upon your domain and the place you work, you have to smartly choose the right language. The technology world usually prefers using a single language. Business users (marketing analytics, retail analytics) usually go with statistical programming languages like R, since they frequently do quick prototyping and build visualisations (which is faster done in R than Python).

https://datascience.stackexchange.com/questions/326/python-vs-r-for-machine-learning

Tuesday, 22 August 2017

Top data scientists to follow

Andrew Ng

Andrew Ng is chief scientist at Baidu Research, associate professor at Stanford University, and founder and chairman of the board at Coursera. He is the founder of the Google Brain project, which developed large-scale artificial neural networks, one of which taught itself to recognize cats in videos. He specializes in deep learning and has published over 100 papers in machine learning and other fields. You can find links to many of them from the Andrew Ng web page.

Geoffrey Hinton

You must have heard the term ‘Back Propagation’? He is the brain (co-inventor) behind this algorithm for training neural nets and deep learning simulations. Moreover, Geoff invented the term ‘Dark Knowledge’. Its inspired by the idea that most of the knowledge is in the ratios of tiny probabilities that have virtually no influence on the cost function used for training or on the test performance. He is widely known for his work on Artificial Neural Networks. In 2013, he joined Google and led its AI team. Geoff holds a PhD in Artificial Intelligence from Edinburgh. He and his research group have been the driving force behind the resurgence of neural networks and deep learning.

Dean Abbott

Dean Abbott is co-founder and chief data scientist at SmarterHQ, and founder and president of Abbott Analytics. He is a co-author of the IBM SPSS Modeler Cookbook, and the author of Applied Predictive Analytics: Principles and Techniques for the Professional Data Analyst. Follow his blog at http://abbottanalytics.blogspot.com.

Kenneth Cukier

Kenneth Cukier is the data editor for The Economist. He's a co-author of the book Big Data: A Revolution That Will Transform How We Live, Work, and Think, and is a popular speaker. Watch him gave a fascinating TED talk on "Big data is better data."

Sebastian Thrun

Sebastian Thrun is CEO at Udacity and a research professor at Stanford University. He founded Google[x], where he led the Google driverless car project, and continues as an advisor. He is the recipient of numerous awards and prizes, a popular influencer on LinkedIn, and has a large following on Twitter. Thrun has authored and contributed to 11 books and hundreds of papers, and was named one of Popular Science's Brilliant Ten scientists of 2005.

Vincent Granville

Visionary data science executive with broad spectrum of domain expertise, technical knowledge, and proven success in bringing measurable added value to companies ranging from startups to fortune 100, across multiple industries (finance, Internet, media, IT, security), domains (data science, operations research, machine learning, computer science, business intelligence, statistics, applied mathematics, growth hacking, IoT) and roles (data scientist, founder, CFO, CEO, HR, product development, marketing, media buyer, operations, management consulting).

Harrison Kinsley

Founder of multiple businesses, all of which leverage the Python programming language. From using Flask web development on all of my business sites, to Scikit-Learn and Pandas for machine learning and data analysis with Ensmo.com, to the Natural Language Toolkit for natural language processing with Sentdex.com, to teaching a massive variety of Python programming topics on PythonProgramming.net, Python and programming is a major part of my life and work.

Karlijn Willems

Although Karlijn is Data Science Journalist, but would like to include in this list. Karlijn Willems holds a degree in Literature and Linguistics (English and Spanish) and Information Management from KU Leuven. Before joining DataCamp as a data science journalist, she worked as a junior big data developer with Hadoop, Spark and Scala. Now, she writes for the DataCamp community, focusing on data science and data science education.

Martin Gornar

Have extensive knowledge of deep learning and tensor flow, great explanation about intelligence brain. Watch Tensorflow and deep learning - without a PhD by Martin Görner

Dr. DJ Patil

Although not residing in India, no conversation about data science is complete without talking about the Indian-origin Dhanurjay “DJ” Patil. The first Indian to be appointed the Chief Data Scientist of the United States Office of Science and Technology Policy, DJ famously tweeted, saying “Data Science doesn’t care about what you majored in or even if you got a degree. It’s what you do with the data that matters”. To top it all, DJ Patil is credited with having coined the term “Data Science”.

Ravi Vijayaraghavan

A Silicon Valley veteran, with experience in heading digital marketing and analytics teams at 24 [7] and Mu Sigma, he has worked across the domain from scientific research to IT to marketing. A Big Data specialist (in 2014) Mr. Vijayaraghavan was the recipient of the Henry Ford Technology award – the highest technical recognition at Ford Motor Company. Mr. Vijayaraghavan has been instrumental in shaping Flipkart into the data-driven powerhouse that it is today.

Rohan Rao

Numbers, puzzles, and sudoku are Rohan’s fascination and perhaps that’s one of the reasons he is keen about Data Science too. With a Masters degree from IIT Bombay in applied statistics, Rohan started off his career with 64 Squares as a data scientist. Post his two-year stint there, he moved to Belong.co as a data scientist and later joined as the lead data scientist for AdWyze. He now works as a senior data scientist for Paytm.

Sudalai Rajkumar

Freshdesk’s lead data scientist, Sudalai Rajkumar takes care of building scalable machine learning and analytics systems for his organization. Kaggle’s top 25 data scientists have his name and he has been one of the top solvers in the CrowdAnalytx platform. To quote more of his achievements, he has published papers in some reputed global conferences and even has a patent under his name. He has worked in the Research and Development team at Global Analytics and in Tiger Analytics, where he lent his expertise to solve numerous customer concerns.

https://www.analyticsvidhya.com/blog/2015/09/ultimate-data-scientists-world-today/

http://analyticsindiamag.com/top-10-data-scientists-india-2016/

https://techbeacon.com/10-influential-data-scientists-why-you-should-follow-them-twitter

http://blog.hackerearth.com/top-17-data-scientists-india-kaggle

Friday, 11 August 2017

Data Science Project Checklist

Before Starting The Data Science Project Checklist
The checklist to go through for before starting the project is further broken down into five different sections.

What question are you asking/answering and for whom?
What data are you using?
What techniques are you going to try?
How will you evaluate your methods and results?
What do you expect the result to be?

For each section, there will be additional questions that you should think about and answer before you get started with your data science project.

Refer to full article for more detail.

Wednesday, 5 July 2017

Librosa raised NoBackendError() error in windows environment

Librosa raised NoBackendError() error in windows environment during media file reading.

Librosa does not handle audio coding directly. It relies on the audioread package to interface between different decoding libraries (pymad, gstreamer, ffmpeg, etc).

The exception that you're getting is coming from audioread because it can't find a back-end to handle mp3 encoding. You should verify that one of the mp3-supporting back-ends is properly installed.

for more detail follow https://groups.google.com/forum/#!topic/librosa/Z1-HXBsHLi0

FFmpeg is the leading multimedia framework to decode, encode, transcode, mux, demux, stream, filter and play. FFmpeg supports a wide variety of video formats and can utilize hardware acceleration to minimize conversion time.

Download the FFmpeg program. When you visit the download page, you will see a variety of options. Download the latest Static version in either 32-bit or 64-bit, depending on your operating system. follow instruction given @ http://www.wikihow.com/Install-FFmpeg-on-Windows

Tuesday, 4 July 2017

Visual C++ 2015 Build Tools requied for librosa

pypi

The simplest way to install librosa is through the Python Package Index (PyPI). This will ensure that all required dependencies are fulfilled. This can be achieved by executing the following command:

pip install librosa

or:

sudo pip install librosa

to install system-wide, or:

pip install -u librosa

to install just for your own user.

conda

If you use conda/Anaconda environments, librosa can be installed from the conda-forge channel:

conda install -c conda-forge librosa 
 
This command was extremely helpful for resolve issue regarding librosa 
asking about "Visual C++ 2015 Build Tools".

Source

If you’ve downloaded the archive manually from the releases page, you can install using the setuptools script:

tar xzf librosa-VERSION.tar.gz
cd librosa-VERSION/
python setup.py install

Alternately, the latest development version can be installed via pip:

pip install git+https://github.com/librosa/librosa

ffmpeg

To fuel audioread with more audio-decoding power, you can install ffmpeg which ships with many audio decoders. Note that conda users on Linux and OSX will have this installed by default; Windows users must install ffmpeg separately.
OSX users can use homebrew to install ffmpeg by calling brew install ffmpeg or get a binary version from their website https://www.ffmpeg.org.

originally post on librosa.github.io

Monday, 3 July 2017

TextBlob Sentiment: Calculating Polarity and Subjectivity

The TextBlob package for Python is a convenient way to do a lot of Natural Language Processing (NLP) tasks. For example:

from textblob import TextBlob
TextBlob("not a very great calculation").sentiment
## Sentiment(polarity=-0.3076923076923077, subjectivity=0.5769230769230769)

This tells us that the English phrase “not a very great calculation” has a polarity of about -0.3, meaning it is slightly negative, and a subjectivity of about 0.6, meaning it is fairly subjective.
But where do these numbers come from?

please read original post here , it is very informative
http://planspace.org/20150607-textblob_sentiment/

Sunday, 2 July 2017

Pandas vs SQL

Comparison with SQL¶

Since many potential pandas users have some familiarity with SQL, this page is meant to provide some examples of how various SQL operations would be performed using pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the library.
As is customary, we import pandas and numpy as follows:

In [1]: import pandas as pd

In [2]: import numpy as np

Most of the examples will utilize the tips dataset found within pandas tests. We’ll read the data into a DataFrame called tips and assume we have a database table of the same name and structure.

In [3]: url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv'

In [4]: tips = pd.read_csv(url)

In [5]: tips.head()
Out[5]: 
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

SELECT¶

In SQL, selection is done using a comma-separated list of columns you’d like to select (or a * to select all columns):

SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;

With pandas, column selection is done by passing a list of column names to your DataFrame:

In [6]: tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
Out[6]: 
   total_bill   tip smoker    time
0       16.99  1.01     No  Dinner
1       10.34  1.66     No  Dinner
2       21.01  3.50     No  Dinner
3       23.68  3.31     No  Dinner
4       24.59  3.61     No  Dinner

Calling the DataFrame without the list of column names would display all columns (akin to SQL’s *).

WHERE¶

Filtering in SQL is done via a WHERE clause.

SELECT *
FROM tips
WHERE time = 'Dinner'
LIMIT 5;

DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.

In [7]: tips[tips['time'] == 'Dinner'].head(5)
Out[7]: 
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

The above statement is simply passing a Series of True/False objects to the DataFrame, returning all rows with True.

In [8]: is_dinner = tips['time'] == 'Dinner'

In [9]: is_dinner.value_counts()
Out[9]: 
True     176
False     68
Name: time, dtype: int64

In [10]: tips[is_dinner].head(5)
Out[10]: 
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

Just like SQL’s OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and & (AND).

-- tips of more than $5.00 at Dinner meals
SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;

# tips of more than $5.00 at Dinner meals
In [11]: tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]
Out[11]: 
     total_bill    tip     sex smoker  day    time  size
      39.42   7.58    Male     No  Sat  Dinner     4
      30.40   5.60    Male     No  Sun  Dinner     4
      32.40   6.00    Male     No  Sun  Dinner     4
      34.81   5.20  Female     No  Sun  Dinner     4
      48.27   6.73    Male     No  Sat  Dinner     4
     29.93   5.07    Male     No  Sun  Dinner     4
     29.85   5.14  Female     No  Sun  Dinner     5
     50.81  10.00    Male    Yes  Sat  Dinner     3
      7.25   5.15    Male    Yes  Sun  Dinner     2
     23.33   5.65    Male    Yes  Sun  Dinner     2
     23.17   6.50    Male    Yes  Sun  Dinner     4
     25.89   5.16    Male    Yes  Sat  Dinner     4
     48.33   9.00    Male     No  Sat  Dinner     4
     28.17   6.50  Female    Yes  Sat  Dinner     3
     29.03   5.92    Male     No  Sat  Dinner     3

-- tips by parties of at least 5 diners OR bill total was more than $45
SELECT *
FROM tips
WHERE size >= 5 OR total_bill > 45;

# tips by parties of at least 5 diners OR bill total was more than $45
In [12]: tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]
Out[12]: 
     total_bill    tip     sex smoker   day    time  size
      48.27   6.73    Male     No   Sat  Dinner     4
     29.80   4.20  Female     No  Thur   Lunch     6
     34.30   6.70    Male     No  Thur   Lunch     6
     41.19   5.00    Male     No  Thur   Lunch     5
     27.05   5.00  Female     No  Thur   Lunch     6
     29.85   5.14  Female     No   Sun  Dinner     5
     48.17   5.00    Male     No   Sun  Dinner     6
     50.81  10.00    Male    Yes   Sat  Dinner     3
     45.35   3.50    Male    Yes   Sun  Dinner     3
     20.69   5.00    Male     No   Sun  Dinner     5
     30.46   2.00    Male    Yes   Sun  Dinner     5
     48.33   9.00    Male     No   Sat  Dinner     4
     28.15   3.00    Male    Yes   Sat  Dinner     5

NULL checking is done using the notnull() and isnull() methods.

In [13]: frame = pd.DataFrame({'col1': ['A', 'B', np.NaN, 'C', 'D'],
   ....:                       'col2': ['F', np.NaN, 'G', 'H', 'I']})
   ....: 

In [14]: frame
Out[14]: 
  col1 col2
0    A    F
1    B  NaN
2  NaN    G
3    C    H
4    D    I

Assume we have a table of the same structure as our DataFrame above. We can see only the records where col2 IS NULL with the following query:

SELECT *
FROM frame
WHERE col2 IS NULL;

In [15]: frame[frame['col2'].isnull()]
Out[15]: 
  col1 col2
1    B  NaN

Getting items where col1 IS NOT NULL can be done with notnull().

SELECT *
FROM frame
WHERE col1 IS NOT NULL;

In [16]: frame[frame['col1'].notnull()]
Out[16]: 
  col1 col2
0    A    F
1    B  NaN
3    C    H
4    D    I

GROUP BY¶

In pandas, SQL’s GROUP BY operations are performed using the similarly named groupby() method. groupby() typically refers to a process where we’d like to split a dataset into groups, apply some function (typically aggregation) , and then combine the groups together.
A common SQL operation would be getting the count of records in each group throughout a dataset. For instance, a query getting us the number of tips left by sex:

SELECT sex, count(*)
FROM tips
GROUP BY sex;
/*
Female     87
Male      157
*/

The pandas equivalent would be:

In [17]: tips.groupby('sex').size()
Out[17]: 
sex
Female     87
Male      157
dtype: int64

Notice that in the pandas code we used size() and not count(). This is because count() applies the function to each column, returning the number of not null records within each.

In [18]: tips.groupby('sex').count()
Out[18]: 
        total_bill  tip  smoker  day  time  size
sex                                             
Female          87   87      87   87    87    87
Male           157  157     157  157   157   157

Alternatively, we could have applied the count() method to an individual column:

In [19]: tips.groupby('sex')['total_bill'].count()
Out[19]: 
sex
Female     87
Male      157
Name: total_bill, dtype: int64

Multiple functions can also be applied at once. For instance, say we’d like to see how tip amount differs by day of the week - agg() allows you to pass a dictionary to your grouped DataFrame, indicating which functions to apply to specific columns.

SELECT day, AVG(tip), COUNT(*)
FROM tips
GROUP BY day;
/*
Fri   2.734737   19
Sat   2.993103   87
Sun   3.255132   76
Thur  2.771452   62
*/

In [20]: tips.groupby('day').agg({'tip': np.mean, 'day': np.size})
Out[20]: 
           tip  day
day                
Fri   2.734737   19
Sat   2.993103   87
Sun   3.255132   76
Thur  2.771452   62

Grouping by more than one column is done by passing a list of columns to the groupby() method.

SELECT smoker, day, COUNT(*), AVG(tip)
FROM tips
GROUP BY smoker, day;
/*
smoker day
No     Fri      4  2.812500
       Sat     45  3.102889
       Sun     57  3.167895
       Thur    45  2.673778
Yes    Fri     15  2.714000
       Sat     42  2.875476
       Sun     19  3.516842
       Thur    17  3.030000
*/

In [21]: tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})
Out[21]: 
              tip          
             size      mean
smoker day                 
No     Fri    4.0  2.812500
       Sat   45.0  3.102889
       Sun   57.0  3.167895
       Thur  45.0  2.673778
Yes    Fri   15.0  2.714000
       Sat   42.0  2.875476
       Sun   19.0  3.516842
       Thur  17.0  3.030000

JOIN¶

JOINs can be performed with join() or merge(). By default, join() will join the DataFrames on their indices. Each method has parameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or the columns to join on (column names or indices).

In [22]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
   ....:                     'value': np.random.randn(4)})
   ....: 

In [23]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
   ....:                     'value': np.random.randn(4)})
   ....: 

Assume we have two database tables of the same name and structure as our DataFrames.
Now let’s go over the various types of JOINs.

INNER JOIN¶

SELECT *
FROM df1
INNER JOIN df2
  ON df1.key = df2.key;

# merge performs an INNER JOIN by default
In [24]: pd.merge(df1, df2, on='key')
Out[24]: 
  key   value_x   value_y
0   B -0.318214  0.543581
1   D  2.169960 -0.426067
2   D  2.169960  1.138079

merge() also offers parameters for cases when you’d like to join one DataFrame’s column with another DataFrame’s index.

In [25]: indexed_df2 = df2.set_index('key')

In [26]: pd.merge(df1, indexed_df2, left_on='key', right_index=True)
Out[26]: 
  key   value_x   value_y
1   B -0.318214  0.543581
3   D  2.169960 -0.426067
3   D  2.169960  1.138079

LEFT OUTER JOIN¶

-- show all records from df1
SELECT *
FROM df1
LEFT OUTER JOIN df2
  ON df1.key = df2.key;

# show all records from df1
In [27]: pd.merge(df1, df2, on='key', how='left')
Out[27]: 
  key   value_x   value_y
0   A  0.116174       NaN
1   B -0.318214  0.543581
2   C  0.285261       NaN
3   D  2.169960 -0.426067
4   D  2.169960  1.138079

RIGHT JOIN¶

-- show all records from df2
SELECT *
FROM df1
RIGHT OUTER JOIN df2
  ON df1.key = df2.key;

# show all records from df2
In [28]: pd.merge(df1, df2, on='key', how='right')
Out[28]: 
  key   value_x   value_y
0   B -0.318214  0.543581
1   D  2.169960 -0.426067
2   D  2.169960  1.138079
3   E       NaN  0.086073

FULL JOIN¶

pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not the joined columns find a match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL).

-- show all records from both tables
SELECT *
FROM df1
FULL OUTER JOIN df2
  ON df1.key = df2.key;

# show all records from both frames
In [29]: pd.merge(df1, df2, on='key', how='outer')
Out[29]: 
  key   value_x   value_y
0   A  0.116174       NaN
1   B -0.318214  0.543581
2   C  0.285261       NaN
3   D  2.169960 -0.426067
4   D  2.169960  1.138079
5   E       NaN  0.086073

UNION¶

UNION ALL can be performed using concat().

In [30]: df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],
   ....:                     'rank': range(1, 4)})
   ....: 

In [31]: df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],
   ....:                     'rank': [1, 4, 5]})
   ....: 

SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
/*
         city  rank
      Chicago     1
San Francisco     2
New York City     3
      Chicago     1
       Boston     4
  Los Angeles     5
*/

In [32]: pd.concat([df1, df2])
Out[32]: 
            city  rank
0        Chicago     1
1  San Francisco     2
2  New York City     3
0        Chicago     1
1         Boston     4
2    Los Angeles     5

SQL’s UNION is similar to UNION ALL, however UNION will remove duplicate rows.

SELECT city, rank
FROM df1
UNION
SELECT city, rank
FROM df2;
-- notice that there is only one Chicago record this time
/*
         city  rank
      Chicago     1
San Francisco     2
New York City     3
       Boston     4
  Los Angeles     5
*/

In pandas, you can use concat() in conjunction with drop_duplicates().

In [33]: pd.concat([df1, df2]).drop_duplicates()
Out[33]: 
            city  rank
0        Chicago     1
1  San Francisco     2
2  New York City     3
1         Boston     4
2    Los Angeles     5

Pandas equivalents for some SQL analytic and aggregate functions¶

Top N rows with offset¶

-- MySQL
SELECT * FROM tips
ORDER BY tip DESC
LIMIT 10 OFFSET 5;

In [34]: tips.nlargest(10+5, columns='tip').tail(10)
Out[34]: 
     total_bill   tip     sex smoker   day    time  size
     23.17  6.50    Male    Yes   Sun  Dinner     4
     28.17  6.50  Female    Yes   Sat  Dinner     3
      32.40  6.00    Male     No   Sun  Dinner     4
     29.03  5.92    Male     No   Sat  Dinner     3
      24.71  5.85    Male     No  Thur   Lunch     2
     23.33  5.65    Male    Yes   Sun  Dinner     2
      30.40  5.60    Male     No   Sun  Dinner     4
      34.81  5.20  Female     No   Sun  Dinner     4
      34.83  5.17  Female     No  Thur   Lunch     4
     25.89  5.16    Male    Yes   Sat  Dinner     4

Top N rows per group¶

-- Oracle's ROW_NUMBER() analytic function
SELECT * FROM (
  SELECT
    t.*,
    ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
  FROM tips t
)
WHERE rn < 3
ORDER BY day, rn;

In [35]: (tips.assign(rn=tips.sort_values(['total_bill'], ascending=False)
   ....:                     .groupby(['day'])
   ....:                     .cumcount() + 1)
   ....:      .query('rn < 3')
   ....:      .sort_values(['day','rn'])
   ....: )
   ....: 
Out[35]: 
     total_bill    tip     sex smoker   day    time  size  rn
95        40.17   4.73    Male    Yes   Fri  Dinner     4   1
90        28.97   3.00    Male    Yes   Fri  Dinner     2   2
170       50.81  10.00    Male    Yes   Sat  Dinner     3   1
212       48.33   9.00    Male     No   Sat  Dinner     4   2
156       48.17   5.00    Male     No   Sun  Dinner     6   1
182       45.35   3.50    Male    Yes   Sun  Dinner     3   2
197       43.11   5.00  Female    Yes  Thur   Lunch     4   1
142       41.19   5.00    Male     No  Thur   Lunch     5   2

the same using rank(method=’first’) function

In [36]: (tips.assign(rnk=tips.groupby(['day'])['total_bill']
   ....:                      .rank(method='first', ascending=False))
   ....:      .query('rnk < 3')
   ....:      .sort_values(['day','rnk'])
   ....: )
   ....: 
Out[36]: 
     total_bill    tip     sex smoker   day    time  size  rnk
95        40.17   4.73    Male    Yes   Fri  Dinner     4  1.0
90        28.97   3.00    Male    Yes   Fri  Dinner     2  2.0
170       50.81  10.00    Male    Yes   Sat  Dinner     3  1.0
212       48.33   9.00    Male     No   Sat  Dinner     4  2.0
156       48.17   5.00    Male     No   Sun  Dinner     6  1.0
182       45.35   3.50    Male    Yes   Sun  Dinner     3  2.0
197       43.11   5.00  Female    Yes  Thur   Lunch     4  1.0
142       41.19   5.00    Male     No  Thur   Lunch     5  2.0

-- Oracle's RANK() analytic function
SELECT * FROM (
  SELECT
    t.*,
    RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk
  FROM tips t
  WHERE tip < 2
)
WHERE rnk < 3
ORDER BY sex, rnk;

Let’s find tips with (rank < 3) per gender group for (tips < 2). Notice that when using rank(method='min') function rnk_min remains the same for the same tip (as Oracle’s RANK() function)

In [37]: (tips[tips['tip'] < 2]
   ....:      .assign(rnk_min=tips.groupby(['sex'])['tip']
   ....:                          .rank(method='min'))
   ....:      .query('rnk_min < 3')
   ....:      .sort_values(['sex','rnk_min'])
   ....: )
   ....: 
Out[37]: 
     total_bill   tip     sex smoker  day    time  size  rnk_min
67         3.07  1.00  Female    Yes  Sat  Dinner     1      1.0
92         5.75  1.00  Female    Yes  Fri  Dinner     2      1.0
111        7.25  1.00  Female     No  Sat  Dinner     1      1.0
236       12.60  1.00    Male    Yes  Sat  Dinner     2      1.0
237       32.83  1.17    Male    Yes  Sat  Dinner     2      2.0

UPDATE¶

UPDATE tips
SET tip = tip*2
WHERE tip < 2;

In [38]: tips.loc[tips['tip'] < 2, 'tip'] *= 2

DELETE¶

DELETE FROM tips
WHERE tip > 9;

In pandas we select the rows that should remain, instead of deleting them

In [39]: tips = tips.loc[tips['tip'] <= 9] 
 
https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html#compare-with-sql-join