I will do data analysis and data visualization using python, jupyter notebook

Python is a popular programming language for data analysis and data visualization. Jupyter Notebook is an open-source web application that supports many languages, including Python and R, and it’s perfectly suited for data analysis and visualization 12.

Get data analysis and data visualization using python, jupyter notebook

Pandas is a primary library for data manipulation and analysis in Python 2. It provides several different options for visualizing your data with .plot() 3. Even if you’re at the beginning of your pandas journey, you’ll soon be creating basic plots that will yield valuable insights into your data 3.

Bokeh is another Python library that can be used for data visualization 4. It provides interactive visualization capabilities that are not available in pandas 4.

Whether you’re just getting to know a dataset or preparing to publish your findings, visualization is an essential tool. Python’s popular data analysis library, pandas, provides several different options for visualizing your data with .plot(). Even if you’re at the beginning of your pandas journey, you’ll soon be creating basic plots that will yield valuable insights into your data.

In this tutorial, you’ll learn:

What the different types of pandas plots are and when to use them
How to get an overview of your dataset with a histogram
How to discover correlation with a scatter plot
How to analyze different categories and their ratios

Free Bonus: Click here to get access to a Conda cheat sheet with handy usage examples for managing your Python environment and packages.

Set Up Your Environment

You can best follow along with the code in this tutorial in a Jupyter Notebook. This way, you’ll immediately see your plots and be able to play around with them.

You’ll also need a working Python environment including pandas. If you don’t have one yet, then you have several options:

If you have more ambitious plans, then download the Anaconda distribution. It’s huge (around 500 MB), but you’ll be equipped for most data science work.
If you prefer a minimalist setup, then check out the section on installing Miniconda in Setting Up Python for Machine Learning on Windows.
If you want to stick to pip, then install the libraries discussed in this tutorial with pip install pandas matplotlib. You can also grab Jupyter Notebook with pip install jupyterlab.
If you don’t want to do any setup, then follow along in an online Jupyter Notebook trial.

Once your environment is set up, you’re ready to download a dataset. In this tutorial, you’re going to analyze data on college majors sourced from the American Community Survey 2010–2012 Public Use Microdata Sample. It served as the basis for the Economic Guide To Picking A College Major featured on the website FiveThirtyEight.

First, download the data by passing the download URL to pandas.read_csv():

In [1]: import pandas as pd

In [2]: download_url = (

...: "https://raw.githubusercontent.com/fivethirtyeight/"

...: "data/master/college-majors/recent-grads.csv"

...: )

In [3]: df = pd.read_csv(download_url)

In [4]: type(df)

Out[4]: pandas.core.frame.DataFrame

By calling read_csv(), you create a DataFrame, which is the main data structure used in pandas.

Note: You can follow along with this tutorial even if you aren’t familiar with DataFrames. But if you’re interested in learning more about working with pandas and DataFrames, then you can check out Using Pandas and Python to Explore Your Dataset and The Pandas DataFrame: Make Working With Data Delightful.

Now that you have a DataFrame, you can take a look at the data. First, you should configure the display.max.columns option to make sure pandas doesn’t hide any columns. Then you can view the first few rows of data with .head():

In [5]: pd.set_option("display.max.columns", None)

In [6]: df.head()

You’ve just displayed the first five rows of the DataFrame df using .head(). Your output should look like this:

The output of df.head()

The default number of rows displayed by .head() is five, but you can specify any number of rows as an argument. For example, to display the first ten rows, you would use df.head(10).

Remove ads

Create Your First Pandas Plot

Your dataset contains some columns related to the earnings of graduates in each major:

"Median" is the median earnings of full-time, year-round workers.
"P25th" is the 25th percentile of earnings.
"P75th" is the 75th percentile of earnings.
"Rank" is the major’s rank by median earnings.

Let’s start with a plot displaying these columns. First, you need to set up your Jupyter Notebook to display plots with the %matplotlib magic command:

In [7]: %matplotlib

Using matplotlib backend: MacOSX

The %matplotlib magic command sets up your Jupyter Notebook for displaying plots with Matplotlib. The standard Matplotlib graphics backend is used by default, and your plots will be displayed in a separate window.

Note: You can change the Matplotlib backend by passing an argument to the %matplotlib magic command.

For example, the inline backend is popular for Jupyter Notebooks because it displays the plot in the notebook itself, immediately below the cell that creates the plot:

In [7]: %matplotlib inline

There are a number of other backends available. For more information, check out the Rich Outputs tutorial in the IPython documentation.

Now you’re ready to make your first plot! You can do so with .plot():

In [8]: df.plot(x="Rank", y=["P25th", "Median", "P75th"])
Out[8]: <AxesSubplot:xlabel='Rank'>

.plot() returns a line graph containing data from every row in the DataFrame. The x-axis values represent the rank of each institution, and the "P25th", "Median", and "P75th" values are plotted on the y-axis.

Note: If you aren’t following along in a Jupyter Notebook or in an IPython shell, then you’ll need to use the pyplot interface from matplotlib to display the plot.

Here’s how to show the figure in a standard Python shell:

>>> import matplotlib.pyplot as plt
>>> df.plot(x="Rank", y=["P25th", "Median", "P75th"])
>>> plt.show()

Notice that you must first import the pyplot module from Matplotlib before calling plt.show() to display the plot.

The figure produced by .plot() is displayed in a separate window by default and looks like this:

line plot with P25, median, P75 earnings

Looking at the plot, you can make the following observations:

The median income decreases as rank decreases. This is expected because the rank is determined by the median income.

Some majors have large gaps between the 25th and 75th percentiles. People with these degrees may earn significantly less or significantly more than the median income.

Other majors have very small gaps between the 25th and 75th percentiles. People with these degrees earn salaries very close to the median income.

Your first plot already hints that there’s a lot more to discover in the data! Some majors have a wide range of earnings, and others have a rather narrow range. To discover these differences, you’ll use several other types of plots.

Note: For an introduction to medians, percentiles, and other statistics, check out Python Statistics Fundamentals: How to Describe Your Data.

.plot() has several optional parameters. Most notably, the kind parameter accepts eleven different string values and determines which kind of plot you’ll create:

"area" is for area plots.
"bar" is for vertical bar charts.
"barh" is for horizontal bar charts.
"box" is for box plots.
"hexbin" is for hexbin plots.
"hist" is for histograms.
"kde" is for kernel density estimate charts.
"density" is an alias for "kde".
"line" is for line graphs.
"pie" is for pie charts.
"scatter" is for scatter plots.

The default value is "line". Line graphs, like the one you created above, provide a good overview of your data. You can use them to detect general trends. They rarely provide sophisticated insight, but they can give you clues as to where to zoom in.

If you don’t provide a parameter to .plot(), then it creates a line plot with the index on the x-axis and all the numeric columns on the y-axis. While this is a useful default for datasets with only a few columns, for the college majors dataset and its several numeric columns, it looks like quite a mess.

Note: As an alternative to passing strings to the kind parameter of .plot(), DataFrame objects have several methods that you can use to create the various kinds of plots described above:

.area()
.bar()
.barh()
.box()
.hexbin()
.hist()
.kde()
.density()
.line()
.pie()
.scatter()

In this tutorial, you’ll use the .plot() interface and pass strings to the kind parameter. You’re encouraged to try out the methods mentioned above as well.

BASIC : $50

I will do simple data analysis for a single dataset.

STANDARD : $110

I will do standard data analysis and visualization.

PREMIUM : $220

Views Coupon

Widget HTML #1

I will do data analysis and data visualization using python, jupyter notebook

I will do data analysis and data visualization using python, jupyter notebook

Get data analysis and data visualization using python, jupyter notebook

I will do advanced data analysis and visualization.