Box plot in Python

First we need a sample data set, let's google some appropriate data set. On Data Story Lab page we can get the data for survival days of patients with advanced cancer treated with vitamin C (ascorbate). See page http://lib.stat.cmu.edu/DASL/Stories/CancerSurvival.html for further information. Download this data and save it as cancer1.txt.

In [1]:
# Import all the packages we will use
import numpy as np
import scipy as sc
import matplotlib.pyplot as plt 

We want matplotlib to generate plot within this document and not as a separate window. To do so we should force matplotlib to produce figure inline, the command below does exactly the same thing.

In [2]:
%matplotlib inline

Next we need to read the data into an array, there are many ways in Python to do that. We will use the numpy's function genfromtxt.

In [3]:
# import data into numpy array named CX 
CX=np.genfromtxt('cancer1.txt', dtype=[('survival','i4'),('organ','S8')],skiprows=1)

In the above function 'cancer.txt' is the name of the file and it has two types of data - named 'survival' (number of days) which is an integer 'i4', and 'organ' which is an string 'S8'. To skip the first row of the file while reading it we set 'skiprows=1'. Next let's print the array to see what we have read:

In [4]:
print CX
[(124, 'Stomach') (42, 'Stomach') (25, 'Stomach') (45, 'Stomach')
 (412, 'Stomach') (51, 'Stomach') (1112, 'Stomach') (46, 'Stomach')
 (103, 'Stomach') (876, 'Stomach') (146, 'Stomach') (340, 'Stomach')
 (396, 'Stomach') (81, 'Bronchus') (461, 'Bronchus') (20, 'Bronchus')
 (450, 'Bronchus') (246, 'Bronchus') (166, 'Bronchus') (63, 'Bronchus')
 (64, 'Bronchus') (155, 'Bronchus') (859, 'Bronchus') (151, 'Bronchus')
 (166, 'Bronchus') (37, 'Bronchus') (223, 'Bronchus') (138, 'Bronchus')
 (72, 'Bronchus') (245, 'Bronchus') (248, 'Colon') (377, 'Colon')
 (189, 'Colon') (1843, 'Colon') (180, 'Colon') (537, 'Colon')
 (519, 'Colon') (455, 'Colon') (406, 'Colon') (365, 'Colon') (942, 'Colon')
 (776, 'Colon') (372, 'Colon') (163, 'Colon') (101, 'Colon') (20, 'Colon')
 (283, 'Colon') (1234, 'Ovary') (89, 'Ovary') (201, 'Ovary') (356, 'Ovary')
 (2970, 'Ovary') (456, 'Ovary') (1235, 'Breast') (24, 'Breast')
 (1581, 'Breast') (1166, 'Breast') (40, 'Breast') (727, 'Breast')
 (3808, 'Breast') (791, 'Breast') (1804, 'Breast') (3460, 'Breast')
 (719, 'Breast')]

And our dtype names and sizes can be found by using following commands

In [5]:
print CX.dtype.names 
('survival', 'organ')
In [6]:
print CX.dtype.itemsize 
12

Because we have read string of 8 characters by using 'S8' and integer of length 4 using 'i4'.

The column 'survival' can be accessed by CX['survival'] and column 'organ' can be accessed by CX['organ'] as shown below

In [7]:
print CX['survival']
[ 124   42   25   45  412   51 1112   46  103  876  146  340  396   81  461
   20  450  246  166   63   64  155  859  151  166   37  223  138   72  245
  248  377  189 1843  180  537  519  455  406  365  942  776  372  163  101
   20  283 1234   89  201  356 2970  456 1235   24 1581 1166   40  727 3808
  791 1804 3460  719]
In [8]:
print CX['organ']
['Stomach' 'Stomach' 'Stomach' 'Stomach' 'Stomach' 'Stomach' 'Stomach'
 'Stomach' 'Stomach' 'Stomach' 'Stomach' 'Stomach' 'Stomach' 'Bronchus'
 'Bronchus' 'Bronchus' 'Bronchus' 'Bronchus' 'Bronchus' 'Bronchus'
 'Bronchus' 'Bronchus' 'Bronchus' 'Bronchus' 'Bronchus' 'Bronchus'
 'Bronchus' 'Bronchus' 'Bronchus' 'Bronchus' 'Colon' 'Colon' 'Colon'
 'Colon' 'Colon' 'Colon' 'Colon' 'Colon' 'Colon' 'Colon' 'Colon' 'Colon'
 'Colon' 'Colon' 'Colon' 'Colon' 'Colon' 'Ovary' 'Ovary' 'Ovary' 'Ovary'
 'Ovary' 'Ovary' 'Breast' 'Breast' 'Breast' 'Breast' 'Breast' 'Breast'
 'Breast' 'Breast' 'Breast' 'Breast' 'Breast']

Unique labels for plotting our box plot can be copied to an array 'labels' by following command.

In [9]:
labels=np.unique(CX['organ'])

'labels' now should have the unique elements from column 'organ', let's check it.

In [10]:
print labels 
['Breast' 'Bronchus' 'Colon' 'Ovary' 'Stomach']

Data for any label can be obtained by using nonzero functionality from numpy, which just finds those indices in an array that satisfy a particular condition. For example here we can find data from 'Colon' cancer by running the command.

In [11]:
print CX['survival'][np.nonzero(CX['organ']=='Colon')] 
[ 248  377  189 1843  180  537  519  455  406  365  942  776  372  163  101
   20  283]

np.nonzero(CX['organ']=='Colon') finds the indices of CX where the field 'organ' is identical to 'Colon' .

Next we will need data for each label, we can save that in a list and use this list to plot our box plot.

In [12]:
xd=[] # declearation for empty list. 
for label in labels:
    xd.append(CX['survival'][np.nonzero(CX['organ']==label)[0]])
In [13]:
print xd
[array([1235,   24, 1581, 1166,   40,  727, 3808,  791, 1804, 3460,  719], dtype=int32), array([ 81, 461,  20, 450, 246, 166,  63,  64, 155, 859, 151, 166,  37,
       223, 138,  72, 245], dtype=int32), array([ 248,  377,  189, 1843,  180,  537,  519,  455,  406,  365,  942,
        776,  372,  163,  101,   20,  283], dtype=int32), array([1234,   89,  201,  356, 2970,  456], dtype=int32), array([ 124,   42,   25,   45,  412,   51, 1112,   46,  103,  876,  146,
        340,  396], dtype=int32)]

"for label in labels:" iterates over each element of labels, np.nonzero(CX['organ']==label) finds elements which satisfy 'organ'==label.

We have everything we need to plot box plot, let's plot it using matplotlib.

In [14]:
plt.boxplot(xd,labels=labels,sym='ro',whis=1.5);

To see all the details about boxplot command in matplotlib, please visit the following: http://matplotlib.org/api/axes_api.html?highlight=boxplot#matplotlib.axes.Axes.boxplot

In [ ]: