First we need a sample data set, let's google some appropriate data set. On Data Story Lab page we can get the data for survival days of patients with advanced cancer treated with vitamin C (ascorbate). See page http://lib.stat.cmu.edu/DASL/Stories/CancerSurvival.html for further information. Download this data and save it as cancer1.txt.
# Import all the packages we will use
import numpy as np
import scipy as sc
import matplotlib.pyplot as plt
We want matplotlib to generate plot within this document and not as a separate window. To do so we should force matplotlib to produce figure inline, the command below does exactly the same thing.
%matplotlib inline
Next we need to read the data into an array, there are many ways in Python to do that. We will use the numpy's function genfromtxt.
# import data into numpy array named CX
CX=np.genfromtxt('cancer1.txt', dtype=[('survival','i4'),('organ','S8')],skiprows=1)
In the above function 'cancer.txt' is the name of the file and it has two types of data - named 'survival' (number of days) which is an integer 'i4', and 'organ' which is an string 'S8'. To skip the first row of the file while reading it we set 'skiprows=1'. Next let's print the array to see what we have read:
print CX
And our dtype names and sizes can be found by using following commands
print CX.dtype.names
print CX.dtype.itemsize
Because we have read string of 8 characters by using 'S8' and integer of length 4 using 'i4'.
The column 'survival' can be accessed by CX['survival'] and column 'organ' can be accessed by CX['organ'] as shown below
print CX['survival']
print CX['organ']
Unique labels for plotting our box plot can be copied to an array 'labels' by following command.
labels=np.unique(CX['organ'])
'labels' now should have the unique elements from column 'organ', let's check it.
print labels
Data for any label can be obtained by using nonzero functionality from numpy, which just finds those indices in an array that satisfy a particular condition. For example here we can find data from 'Colon' cancer by running the command.
print CX['survival'][np.nonzero(CX['organ']=='Colon')]
np.nonzero(CX['organ']=='Colon') finds the indices of CX where the field 'organ' is identical to 'Colon' .
Next we will need data for each label, we can save that in a list and use this list to plot our box plot.
xd=[] # declearation for empty list.
for label in labels:
xd.append(CX['survival'][np.nonzero(CX['organ']==label)[0]])
print xd
"for label in labels:" iterates over each element of labels, np.nonzero(CX['organ']==label) finds elements which satisfy 'organ'==label.
We have everything we need to plot box plot, let's plot it using matplotlib.
plt.boxplot(xd,labels=labels,sym='ro',whis=1.5);
To see all the details about boxplot command in matplotlib, please visit the following: http://matplotlib.org/api/axes_api.html?highlight=boxplot#matplotlib.axes.Axes.boxplot