Drawing scatterplots

The incenp.plotting.scatterplot module provides a scatterplot function to facilitate the creation of scatter plots.

Note that what I call a ”scatter plot” here may not be the most common acceptation of the term. I do not mean the 2-dimensional plotting of two variables (one on the x-axis, the other on the y-axis). Rather, I mean the plotting of a single variable on the y-axis, akin to a bar chart, but with all data points depicted as scattered dots.

_images/scatterplot1.png

A sample scatter plot.

The figure above is a sample “scatter plot”. The orange boxes are not part of the plot, but have been added to illustrate what are tracks and subtracks in the context of the incenp.plotting.scatterplot module.

Sample data

The module is intended to work with indexed DataFrame objects (including multi-indexed DataFrame). Let’s create such an object, which we will use throughout this page:

index = pd.MultiIndex.from_arrays([
        ['foo'] * 40 + ['bar'] * 40 + ['baz'] * 40 + ['qux'] * 40,
        ['one', 'two'] * 80
    ],
    names=['first', 'second']
    )
df = pd.DataFrame(np.random.randn(160,4), index = index,
    columns=['A', 'B', 'C', 'D'])

This creates a DataFrame with 4 columns (A to D) and 160 rows, indexed in two levels (level first, with 4 distinct values foo, bar, baz, and qux; and level second, with 2 distinct values one and two).

Quick start

As an initial example, here is the call to scatterplot to draw the graph above (ax is supposed to be a matplotlib.axes.Axes object):

scatterplot(ax, df, columns='A',
            tracks=['foo', 'bar', 'baz'], trackname='first',
            subtracks=['one', 'two'], subtrackname='second')
ax.legend(['one', 'two'])

The columns parameter indicates that the values to be plotted comes from the column named A.

The tracks parameter gives the index values used to distribute the values of column A into three different tracks (one track for rows with index value foo, one track for rows with index value bar, and so on); the associated trackname parameter indicates which index level to use to lookup the values specified in the previous parameter, if df is a multi-indexed DataFrame.

The subtracks and subtrackname parameters are similar to the tracks and trackname parameter above, but for subtracks instead of tracks. Here, they are used to say that values from rows with index value one are to be plotted on one subtrack, while values from rows with index value two are to be plotted on another subtrack.

Playing with tracks, subtracks, columns

The following code will plot the same values as above, but will invert the tracks and the subtracks: the second-level index (second) will be used to distribute values along tracks while the first-level index (first) will be used to distribute values along subtracks:

scatterplot(ax, df, columns='A',
            tracks=['one', 'two'], trackname='second',
            subtracks=['foo', 'bar', 'baz'], subtrackname='first')
ax.legend(['foo', 'bar', 'baz'])
_images/scatterplot2.png

A scatterplot with inverted tracks and subtracks.

Values from several columns in the source DataFrame can be plotted at once, by giving a list of column names (instead of a single name) to the columns parameter. By default, values from each column are plotted in a different track. In the following examples, values from the columns A, B, and C are plotted; the first-level index is used to distribute values along three different subtracks; the second-level index is used to filter the DataFrame prior to plotting so that only rows with the index value one are plotted.

scatterplot(ax, df.xs('one', level='second'),
            columns=['A', 'B', 'C'],
            subtracks=['foo', 'baz', 'qux'], subtrackname='first')
 ax.legend(['foo', 'baz', 'qux'])
_images/scatterplot3.png

A scatterplot with values from several columns of the source DataFrame.

To plot values from several columns as different subtracks rather than different tracks, use the subtrackcolumns parameter as in the example below. The tracks and trackname parameters may then be used to define what goes into the tracks.

scatterplot(ax, df.xs('one', level='second'),
            columns=['A', 'B', 'C'], subtrackcolumns=True,
            tracks=['foo', 'baz', 'qux'], trackname='first')
ax.legend(['A', 'B', 'C'])
_images/scatterplot4.png

A scatterplot with values from several columns of the source DataFrame, plotted as separate subtracks.

Miscellaneous features

When plotting two subtracks, the testfunc parameter may be used to have the scatterplot function draws the result of a statistical test comparing the values from each subtrack in each track.

The value of the testfunc parameter should be a function accepting two DataSeries and returning a P-value, such as a the following wrapper around Scipy’s mannwhitneyu function:

from scipy.stats import mannwhitneyu

def do_mannwhitney(a, b):
    result = mannwhitneyu(a, b)
    return result.pvalue

Below is an example of using such a wrapper, with the resulting plot:

scatterplot(ax, df, columns='B',
            tracks=['foo', 'baz', 'qux'], trackname='first',
            subtracks=['one', 'two'], subtrackname='second',
            testfunc=do_mannwhitney,
            colors='cm')
ax.legend(['one', 'two'])
_images/scatterplot5.png

A scatterplot with results of statistical tests between subtracks.

The example above also shows the colors parameter, used to change the colors for the different subtracks. It can either be a string containing one-letter color codes, or a list of Matplotlib colors. The string or the list must be at least as long as the number of subtracks to plot.