********************
Drawing scatterplots
********************

The ``incenp.plotting.scatterplot`` module provides a ``scatterplot``
function to facilitate the creation of scatter plots.

Note that what I call a ”scatter plot” here may not be the most common
acceptation of the term. I do *not* mean the 2-dimensional plotting of
two variables (one on the x-axis, the other on the y-axis). Rather, I
mean the plotting of a single variable on the y-axis, akin to a bar
chart, but with all data points depicted as scattered dots.

.. figure:: scatterplot1.png

   A sample scatter plot.

The figure above is a sample “scatter plot”. The orange boxes are not
part of the plot, but have been added to illustrate what are *tracks*
and *subtracks* in the context of the ``incenp.plotting.scatterplot``
module.


Sample data
===========

The module is intended to work with indexed `DataFrame` objects
(including multi-indexed `DataFrame`). Let’s create such an object,
which we will use throughout this page:

.. code-block:: python

   index = pd.MultiIndex.from_arrays([
           ['foo'] * 40 + ['bar'] * 40 + ['baz'] * 40 + ['qux'] * 40,
           ['one', 'two'] * 80
       ],
       names=['first', 'second']
       )
   df = pd.DataFrame(np.random.randn(160,4), index = index,
       columns=['A', 'B', 'C', 'D'])

This creates a `DataFrame` with 4 columns (``A`` to ``D``) and 160
rows, indexed in two levels (level ``first``, with 4 distinct values
``foo``, ``bar``, ``baz``, and ``qux``; and level ``second``, with 2
distinct values ``one`` and ``two``).


Quick start
===========

As an initial example, here is the call to ``scatterplot`` to draw the
graph above (``ax`` is supposed to be a `matplotlib.axes.Axes` object):

.. code-block:: python

   scatterplot(ax, df, columns='A',
               tracks=['foo', 'bar', 'baz'], trackname='first',
               subtracks=['one', 'two'], subtrackname='second')
   ax.legend(['one', 'two'])

The ``columns`` parameter indicates that the values to be plotted comes
from the column named ``A``.

The ``tracks`` parameter gives the index values used to distribute the
values of column ``A`` into three different tracks (one track for rows
with index value ``foo``, one track for rows with index value ``bar``,
and so on); the associated ``trackname`` parameter indicates which index
level to use to lookup the values specified in the previous parameter,
if ``df`` is a multi-indexed `DataFrame`.

The ``subtracks`` and ``subtrackname`` parameters are similar to the
``tracks`` and ``trackname`` parameter above, but for subtracks instead
of tracks. Here, they are used to say that values from rows with index
value ``one`` are to be plotted on one subtrack, while values from rows
with index value ``two`` are to be plotted on another subtrack.


Playing with tracks, subtracks, columns
=======================================

The following code will plot the same values as above, but will invert
the tracks and the subtracks: the second-level index (``second``) will
be used to distribute values along tracks while the first-level index
(``first``) will be used to distribute values along subtracks:

.. code-block:: python

   scatterplot(ax, df, columns='A',
               tracks=['one', 'two'], trackname='second',
               subtracks=['foo', 'bar', 'baz'], subtrackname='first')
   ax.legend(['foo', 'bar', 'baz'])

.. figure:: scatterplot2.png

   A scatterplot with inverted tracks and subtracks.


Values from several columns in the source `DataFrame` can be plotted at
once, by giving a list of column names (instead of a single name) to the
``columns`` parameter. By default, values from each column are plotted
in a different track. In the following examples, values from the columns
``A``, ``B``, and ``C`` are plotted; the first-level index is used to
distribute values along three different subtracks; the second-level
index is used to filter the `DataFrame` prior to plotting so that only
rows with the index value ``one`` are plotted.

.. code-block:: python

   scatterplot(ax, df.xs('one', level='second'),
               columns=['A', 'B', 'C'],
               subtracks=['foo', 'baz', 'qux'], subtrackname='first')
    ax.legend(['foo', 'baz', 'qux'])

.. figure:: scatterplot3.png

   A scatterplot with values from several columns of the source
   DataFrame.


To plot values from several columns as different subtracks rather than
different tracks, use the ``subtrackcolumns`` parameter as in the
example below. The ``tracks`` and ``trackname`` parameters may then be
used to define what goes into the tracks.

.. code-block:: python

   scatterplot(ax, df.xs('one', level='second'),
               columns=['A', 'B', 'C'], subtrackcolumns=True,
               tracks=['foo', 'baz', 'qux'], trackname='first')
   ax.legend(['A', 'B', 'C'])

.. figure:: scatterplot4.png

   A scatterplot with values from several columns of the source
   DataFrame, plotted as separate subtracks.


Miscellaneous features
======================

When plotting *two* subtracks, the ``testfunc`` parameter may be used to
have the ``scatterplot`` function draws the result of a statistical test
comparing the values from each subtrack in each track.

The value of the ``testfunc`` parameter should be a function accepting
two `DataSeries` and returning a P-value, such as a the following
wrapper around Scipy’s ``mannwhitneyu`` function:

.. code-block:: python

   from scipy.stats import mannwhitneyu

   def do_mannwhitney(a, b):
       result = mannwhitneyu(a, b)
       return result.pvalue

Below is an example of using such a wrapper, with the resulting plot:

.. code-block:: python

   scatterplot(ax, df, columns='B',
               tracks=['foo', 'baz', 'qux'], trackname='first',
               subtracks=['one', 'two'], subtrackname='second',
               testfunc=do_mannwhitney,
               colors='cm')
   ax.legend(['one', 'two'])

.. figure:: scatterplot5.png

   A scatterplot with results of statistical tests between subtracks.

The example above also shows the ``colors`` parameter, used to change
the colors for the different subtracks. It can either be a string
containing one-letter color codes, or a list of Matplotlib colors. The
string or the list must be at least as long as the number of subtracks
to plot.