An Introduction to NumPy

from:

Continum Analytics

and presented by: Dr. Evil, Ph.D.

Preferred title format, but --to latex --post PDF isn't happy with url referenced images.

An Introduction to NumPy

Presented By

your name in lights

Some Simple Setup

We're going to run a few quick commands in ipython to shorten a few names and to make some nice graphics interaction (if you are in ipython notebook).

In [2]:
%pylab inline --no-import-all

import numpy as np
import matplotlib.pyplot as plt
import os.path as osp
Populating the interactive namespace from numpy and matplotlib

META

Conventions

  • use H2 headings to break up notebook sections
  • use H3 to highlight a sub-section/topic
  • centered images look ok when large (in the notebook view)
    • smaller images look a bit out of place when surrounded by smaller text
  • I had used italics to mark out of code Python.
    • replace it with block (back quotes)
    • or maybe bold block bold.block
    • currently using just block
  • I didn't do this with earlier version, but run Cell-> All Output -> Clear before commiting.

    • this prevents output images (matplotlib.plot) from living in the repo and making ugly diffs
  • cells (should normally) hold one idea, set of comparisons, or thought

    • be kind to the output reader. spacing and labels count. yes, they could just look up a few lines. but they are learning.
    • if the cell has one (and only one) output, labelled output isn't mandatory
    • conversely, if the cell has more than one output, then strongly consider labeling them all

Pedagogy

  • while the active notebook is a vast improvement over static slides (for student engagment), it can still become an instructor talk-fest is you let it
  • don't simply type examples and think that the fact "it is running" is enough to excite your audience
  • elicit some input from the students (make them think about what concept is being expressed)
  • encourage them to predict what a given code execution will be
    • when they can't: you probably need to backfill some knowledge
    • when they can, you can step on the gas (a little)

MEF Local Resources (aka, why it's better to have the material in one place)

  • /home/mfenner/continuum/classes/2013-12-python_for_finance
  • /home/mfenner/continuum/workspace/gold_via_matt-t/pfs-2013-aug/PfSClass_20130829/Exercises (index.html)
  • /home/mfenner/continuum/repos/presentations/python_for_science/numpy_chapter (served via python -m SimpleHTTPServer)

    #!/usr/bin/env bash
    
    cd ~/continuum/repos/presentations/python_for_science/numpy_chapter
    python -m SimpleHTTPServer &
    chromium localhost:8000 &
    
    cd ~/continuum/classes/2013-12-python_for_finance
    ipython notebook &
    
    chromium ~/continuum/workspace/gold_via_matt-t/pfs-2013-aug/PfSClass_20130829/Exercises/index.html &
    
    cd ~/continuum/repos/presentations/modular/numpy_chapter
    ipython notebook &

What is NumPy?

Numpy is a Python library that provides multi-dimensional arrays, matrices, and fast operations on these data structures.

NumPy arrays have:

  • fixed size
  • all elements have the same type
    • that type may be compound and/or user defined
  • fast operations from:
    • vectorization - implicit looping
    • pre-compiled C code using high-quality libraries
      • NumPy default
      • BLAS/ATLAS
      • Intel's MKL

NumPy's Uses and Capabilities

  • Image and signal processing
  • Linear algebra
  • Data transformation and query
  • Time series analysis
  • Statistical analysis

NumPy Ecosystem

![](img/ecosystem.lightbg.png)

NumPy Arrays

NumPy arrays (numpy.ndarray) are the fundamental data type in NumPy. They have:

  • shape
  • an element type called dtype

For example:

In [3]:
M, N = 6, 5
arr = np.zeros(shape = (M,N), dtype=float)
print arr
[[ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]]

is the NumPy array corresponding to the two-dimensional matrix:

default

NumPy has both a general N-dimensional array and a specific 2-dimensional matrix data type. NumPy arrays may have an arbitrary number of dimensions.

NumPy arrays support vectorized mathematical operation.

In [4]:
arr = np.arange(15).reshape(3,5)
print "original:"
print arr

print

print "doubled:"
print arr * 2
original:
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]

doubled:
[[ 0  2  4  6  8]
 [10 12 14 16 18]
 [20 22 24 26 28]]

We are generally going to visually inspect the str representation of array. Here's how the str and repr differ:

In [5]:
arr = np.arange(10).reshape(2,5)
print str(arr)
print repr(arr)
[[0 1 2 3 4]
 [5 6 7 8 9]]
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

Array Shape

Many array creation functions take a shape parameter. For a 1D array, the shape can be an integer.

In [6]:
print np.zeros(shape=5, dtype=float)
[ 0.  0.  0.  0.  0.]

For nD arrays, the shape needs to be given as a tuple.

In [7]:
print np.zeros(shape=(4,3,2), dtype=float)
[[[ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]]

 [[ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]]

 [[ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]]

 [[ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]]]

In the above example, the outer-most dimension (the first item in the tuple passed to shape, 4) is the dimension that varies the most slowly. The inner-most dimension (the last item in the tuple, 2) is the dimension that varies the most quickly. Visually, this means that for 3D arrays, we might describe the shape as: (panels, rows, columns). Contrast this with the 2D case where we usually discuss shape as: (rows, columns).

We can always ask an array its shape by accessing the shape attribute (e.g., arr.shape).

In [8]:
arr = np.zeros(shape=(4,3,2), dtype=float)
print "shape is:", arr.shape
shape is: (4, 3, 2)

Array Type

All arrays have a specific type for their associated elements. Every element in the array shares that type. The NumPy terminology for this type is dtype. The basic types are bool, int, uint, float, and complex. The types may be modified by a number indicating their size in bits. Python's built-in types can be used as a corresponding dtype. Note, the generic NumPy types end with an underscore ("_") to differentiate the name from the Python built-in.

META HTML table not supported by latex/pdf converter

Python TypeNumPy dtype
bool np.bool\_
int np.int\_
float np.float\_
complex np.complex\_

META No markdown table (sad face):

+------------+-----------+
+Python Type |NumPy dtype+
==========================
+------------+-----------+
|bool        |np.bool_   |
+------------+-----------+
|int         |np.int_    |
+------------+-----------+
|float       |np.float_  |
+------------+-----------+
|complex     |np.complex_|
+------------+-----------+

Here is one example of specifying a dtype:

In [9]:
arr = np.zeros(shape=(5,), dtype=np.float_)     # NumPy default sized float
print arr, "->", arr.dtype
[ 0.  0.  0.  0.  0.] -> float64

And, we can make a clever loop over some possible dtypes and pass them through eval:

In [10]:
# META:  FIXME  possibly too clever.  the trade off is between an ugly cell that loses the 
#        thread of what's being compared to more complicated Python
for dt in ["float", "np.float_", "np.float64", "np.uint8", "np.int"]:
    arr = np.zeros(shape=(5,), dtype=eval(dt)) 
    print "%10s: %s -> %s" % (dt, arr, arr.dtype)
     float: [ 0.  0.  0.  0.  0.] -> float64
 np.float_: [ 0.  0.  0.  0.  0.] -> float64
np.float64: [ 0.  0.  0.  0.  0.] -> float64
  np.uint8: [0 0 0 0 0] -> uint8
    np.int: [0 0 0 0 0] -> int64

Now, we'll take just a moment to define one quick helper function to show us these details in a pretty format.

In [12]:
def dumpArray(arr):
    print "%s array of %s:" % (arr.shape, arr.dtype)
    print arr

META Are scattered "in-text" links at all valuable? Should they be aggregated at the end (in an outline: General Links: , Data Types:, etc.

See also: http://docs.scipy.org/doc/numpy/user/basics.types.html

Array Creation

NumPy provides a number of ways to create an array.

np.zeros and np.ones

In [13]:
zrr = np.zeros(shape=(2,3))
dumpArray(zrr)
(2, 3) array of float64:
[[ 0.  0.  0.]
 [ 0.  0.  0.]]
In [ ]:
print np.ones(shape=(2,5))
In [14]:
one_arr = np.ones(shape = (2,2), dtype=int)
dumpArray(one_arr)
(2, 2) array of int64:
[[1 1]
 [1 1]]

np.empty

np.empty is lightning quick because it simply requests some amount of memory from the operating system and then does nothing with it. Thus, the array returned by np.empty is uninitialized. Consider yourself warned. np.empty is very useful is you know you are going to fill up all the (used) elements of your array later.

In [15]:
# DANGER!  uninitialized array 
# (re-run this cell and you will very likely see different values)
err = np.empty(shape=(2,3), dtype=int)
dumpArray(err)
(2, 3) array of int64:
[[0 0 0]
 [0 0 0]]

np.arange

np.arange generates sequences of numbers like Python's range built-in. Non-integer step values may lead to unexpected results: for these cases, you may prefer np.linspace and see below. (For a quick - and mostly practical - discussion of the perils of floating-point approximations, see https://docs.python.org/2/tutorial/floatingpoint.html).

  • a single value is a stopping point
  • two values are a starting point and a stopping point
  • three values are a start, a stop, and a step size

As with range, the ending point it not included.

In [16]:
print "int arg: %s\n" % np.arange(10)          # python's range/xrange --> stopping

print "float arg: %s\n" % np.arange(10.0)   # python's range/xrange --> stopping

print "step: %s\n" % np.arange(0, 12, 2)    # end point not included, like python's range()

print "neg. step: %s\n" % np.arange(10, 0, -1.0)
int arg: [0 1 2 3 4 5 6 7 8 9]

float arg: [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]

step: [ 0  2  4  6  8 10]

neg. step: [ 10.   9.   8.   7.   6.   5.   4.   3.   2.   1.]

In [17]:
.1 + .2 == .3
Out[17]:
False

np.linspace and np.logspace

np.linspace(BEGIN, END, NUMPT) generates exactly NUMPT number of points, evenly spaced, on $[BEGIN, END]$. Unlike Python's range and np.arange, these functions are inclusive at BEGIN and END (they produce a closed interval).

In [18]:
print "End-points are included:"
print np.linspace(0, 10, 2) 
print np.linspace(0, 10, 3)
print np.linspace(0, 10, 4)
End-points are included:
[  0.  10.]
[  0.   5.  10.]
[  0.           3.33333333   6.66666667  10.        ]

For np.logspace(BEGIN, END, NUMPTS), the array is closed on $[10^{BEGIN}, 10^{END}]$ with NUMPT points spread evenly on a logarithmic scale.

In [19]:
# (10^BEGIN, 10^END, NUMPT)
print np.logspace(0, 3, 4)
[    1.    10.   100.  1000.]

Diagonal arrays: np.eye and np.diag

np.eye(N) produces an array with shape (N,N) and ones on the diagonal (an NxN identity matrix).

In [20]:
print np.eye(3)
[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]

np.diag produces a diagonal 2D array from an array argument.

In [21]:
print np.diag(np.arange(1,4))
[[1 0 0]
 [0 2 0]
 [0 0 3]]

It can also be used to extract a diagonal from a given array.

In [22]:
# the diagonal of an identity matrix ...
print np.diag(np.eye(3))
[ 1.  1.  1.]

Arrays from Random Distributions

It is common to create arrays whose elements are samples from a random distribution. For the many options, see:

In [23]:
npr = np.random

Uniform on [0,1)

In [24]:
print "Uniform on [0,1):"
dumpArray(npr.random((2,5)))
Uniform on [0,1):
(2, 5) array of float64:
[[ 0.52447153  0.49869214  0.5245176   0.92427828  0.14256492]
 [ 0.90636824  0.58998608  0.40450844  0.82303538  0.21461324]]

Standard Normal

np.random has some redundancy. It also has some variation in calling conventions.

  • standard_normal takes one tuple argument
  • randn (which is very common to see in code) takes n arguments where n is the number of dimensions in the result
In [25]:
print "std. normal - N(0,1):"
dumpArray(npr.standard_normal((2,5)))

print
dumpArray(npr.randn(2,5)) # one tuple parameter
std. normal - N(0,1):
(2, 5) array of float64:
[[ 0.00555851  1.03820014  2.63506338  1.34211666  0.8911758 ]
 [-0.36020101  0.3082247  -0.01768406  0.55375311 -0.03641624]]

(2, 5) array of float64:
[[ 0.32189608 -0.41538282 -0.79358559 -1.6054438   1.28340597]
 [ 0.85340538 -0.05498719 -0.10689896  1.35270264  0.86757328]]

Uniform Integers

There are also some differences on discrete distributions:

  • randint excludes its upper bound
  • random_integers includes its upper bound
In [26]:
print "Uniform ints on [0,5) - upper open:"
dumpArray(npr.randint(0, 5, (2,5)))

print "\nUniform ints on [0,5] - upper closed:"
dumpArray(npr.random_integers(0, 5, (2,5)))
Uniform ints on [0,5) - upper open:
(2, 5) array of int64:
[[1 0 4 1 2]
 [1 1 1 3 3]]

Uniform ints on [0,5] - upper closed:
(2, 5) array of int64:
[[2 2 0 4 0]
 [2 0 2 2 1]]
/Users/lchen/anaconda/lib/python2.7/site-packages/ipykernel_launcher.py:5: DeprecationWarning: This function is deprecated. Please call randint(0, 5 + 1) instead
  """

Arrays From a Python List

Note, it is also possible to create NumPy arrays from Python lists and tuples. While this is a nice capability, remember that instantiating a Python list can take relatively long compared to directly using NumPy building blocks. Other containers and iterables will not, generally, give useful results.

In [27]:
dumpArray(np.array([1, 2, 3]))

print
dumpArray(np.array([10.0, 20.0, 3]))
(3,) array of int64:
[1 2 3]

(3,) array of float64:
[ 10.  20.   3.]

Dimensionality is maintained within nested lists:

In [28]:
dumpArray(np.array([[1, 2, 3], 
                    [4, 5, 6]]))

print
dumpArray(np.array([[1, 2],
                    [3, 4],
                    [5, 6]]))
(2, 3) array of int64:
[[1 2 3]
 [4 5 6]]

(3, 2) array of int64:
[[1 2]
 [3 4]
 [5 6]]

From row and column stacks: np.r_ and np.c_

np.r_ and np.c_ are special objects, then when sliced (think obj[some, :, indexers]), return NumPy arrays constructed from the contents of the slices. These tools can be used to abbreviate some array constructions.

In [29]:
np.c_[np.arange(10), np.arange(10)*2]
Out[29]:
array([[ 0,  0],
       [ 1,  2],
       [ 2,  4],
       [ 3,  6],
       [ 4,  8],
       [ 5, 10],
       [ 6, 12],
       [ 7, 14],
       [ 8, 16],
       [ 9, 18]])
In [32]:
np.r_[np.arange(10), np.arange(10)*2]
Out[32]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  0,  2,  4,  6,  8, 10, 12,
       14, 16, 18])

Exercise I

[15 minutes to do. 10 minutes to discuss.]

Unless otherwise specified, try to solve these problems using NumPy but not using raw Python.

  1. Create a Python list with the ints from 1 to 10. Create a NumPy array from that list.
    1. For both, add one to each element.
    2. For both, multiply each element by two.
  2. Create an int array of all zeros.
  3. Create a float array of all zeros.
  4. Create an evenly spaced grid of 100 floating point values on [-10, 10].
  5. Create an int array with the powers of two from 1 to 1024.
    1. Bonus: Can you figure out a second "NumPy only" way to do it (Hint: help(function) is your friend)?
  6. META: move to later - when we discuss dtypes in more depth Images can be stored as (R,G,B) value triples. Frequently, the color values (red, green, or blue) range from [0, 255]. What would be an ideal NumPy data type for one color value?
  7. Explain what NumPy dtype would be well-suited for (and why):
    • Temperatures
    • Counts of occurances of an event
    • Differences in counts
    • Probabilities
  8. Come up with two ways to create a (2,5,3) shaped int array filled the value 42.
  9. Generate a (5,5) array with values from a Normal distribution of mean 10 and standard deviation 1.
    1. Now, try to do it another way.
  10. Define a function of N, that returns an array with N values all equal to $1/N$.

FIXME The following problem implies giving the students at least one way to time things (and probably two). timeit.timeit/repeat (from a script) and %timeit seem like the best options.

  1. Create a Python list with the floating-point values [1.0, 2.0, 3.0, ..., 1E6].
    • Do the same with a NumPy array.
    • Time how long it takes to multiply each sequence by np.pi.

Accessing Array Items

Indexing

Items in NumPy arrays may be accessed using a single index composed of multiple values

default

META Compare several small outputs in one cell to several small cells. [next section has a nice sequence of small cells]. is one preferred all the time? is it case by case? [markfe: slight preference for many small cells]

In [33]:
arr = np.arange(24).reshape(4,6) # random.randint(11, size=(4, 6))

print "the array:"
print arr

print "index [2,2] :", arr[2, 2]

print "index [2]   :", arr[2]

# non-idiomatic, creates a copy of arr[2] and then indexes into that copy
print "index [2][2]:", arr[2][2] 
the array:
[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]
 [12 13 14 15 16 17]
 [18 19 20 21 22 23]]
index [2,2] : 14
index [2]   : [12 13 14 15 16 17]
index [2][2]: 14

Compare this with indexing into a nested Python list

In [ ]:
aList = [list(row) for row in arr]
print aList
print aList[2][2]

# META:  I (MarkFe) find these to be far more helpful (for students) than a raw traceback.  
try:
    print aList[2,2]
except TypeError as e:
    print "Unhappy with multi-value index"
    print "Exception message:", e

Slicing

We can also use slicing to select entire row and columns at once:

FIXME Images needs some re-love (want consistent row,col ... not i,j).

default

default

In [ ]:
print "array:"
print arr

print "accessing a row:"
dumpArray(arr[2,:])

print "accessing a column:"
dumpArray(arr[:,2])

#print "a row:", arr[2,:], "has shape:", arr[2,:].shape # third row, "all" columns
#print "a col:", arr[:,2], "has shape:", arr[:,2].shape # "all" rows, at third column

Bear in mind that numerical indexing will reduce the dimensionality of the array. Slicing from index to index + 1 can be used to keep that dimension if you need it.

In [ ]:
print "lost dimension:", 
dumpArray(arr[2, 1:4])

print "kept dimension:", 
dumpArray(arr[2:3, 1:4])

Region Selection and Assignment

Multiple slices, as part of an index, can select a region out of an array

In [ ]:
print "array:"
print arr

print "a sub-array:"
dumpArray(arr[1:3, 2:4])

Slices are always views of the underlying array. Thus, modifying them modifies the underlying array

In [34]:
print "even elements (at odd indices) of first row:"
print arr[0, ::2] # select every other element from first row

arr[0,::2] = -1   # update is done in-place, no copy

print "after assinging to those:"
print arr
even elements (at odd indices) of first row:
[0 2 4]
after assinging to those:
[[-1  1 -1  3 -1  5]
 [ 6  7  8  9 10 11]
 [12 13 14 15 16 17]
 [18 19 20 21 22 23]]

You may have noticed something peculiar in the cell above. There is an assignment of a scalar to an array. This is called broadcasting and in this simple case, it simply expands the scalar value to fill the elements of the target. Here's another example:

In [ ]:
arr[0] = 10
print arr

And, here is one more example of broadcasting and slicing: assigning a value to fill a column.

In [ ]:
arr[:,2] = 99
print arr

As with Python lists, empty start/end points in a slice represent the beginning/end of the NumPy array.

In [ ]:
# fill the visual lower-left box with 0
arr[2:, :2] = 0 
print arr
In [ ]:
# fill the visual lower-right box with -1
arr[2:, 3:] = -1 
print arr

We can also assign sequences, if the shapes on the left-hand side and the right-hand side match

In [ ]:
arr[3, :] = [10, 20, 30, 40, 50, 60]
print arr

Sequence assignment extends to multi-dimensional objects

In [ ]:
arr[1:3, 3:5] = [[2, 4], [8, 16]]
print arr

Other Common Slicing Patterns

Shifting

As with Python, -index is the equivalent of len(seq) - 1.

META What to do with teaching notes like the following? In an empty notebook they make sense. In a filled notebook, they do not makes sense.

  • PREDICT: what element is -not- updated by this assignment? what value is lost/clobbered by shifting?
  • so, i rewrote the text below. and it turns into a nice thinking question (for the reader ... i.e., a student reviewing this) ... which is a good way to make the reading more active.

-1 (on the RHS) is an end-point so it is -not- included. So, the interpreation is that we "do not include the last element". Thus, on the RHS we lose one element and on the LHS we lose one element. Both sides are the same length - the original length of the array minus one - and it is a legal assignment. But, not all items of arr have been updated. Which element is unaffected by the assignment?

In [ ]:
# shift array
arr = 2**np.arange(10)
print "original:"
dumpArray(arr)

arr[1:] = arr[:-1]
print "after slicing assignment"
dumpArray(arr)

Reversal

In [ ]:
# using a negative stride indicates walking backward
# it may be surprising, but this is still -not- a copy
# the reverse striding still shares data with the underlying array
rev_arr = arr[::-1]
rev_arr[0] = -99
print arr

Exercise II

[Est to do. Est to discuss.]

This is from way later. Makes far more sense here:

  1. Let a = np.arange(200)

    1. access the last element of the array
    2. slice all but the last element of the array
    3. slice the last 5 elements of the array
    4. slice the first 5 elements of the array
  2. Create a sample array with shape (3,4).

    1. Using single item assignments, place your favorite number in the four corners.
    2. Make the first column equal to -1.
    3. Make the last row equal to 99.
    4. Make a 2x2 block in the bottom-center contain the values .25, .5, .75, and 1.0
    5. Replace a row with the values: 2, 4, 8, and 16.
  3. We used slicing to do a right shift on a 1-D array. Do a left shift on a 1-D array.
  4. Can you replace every element of an array with a particular value (say, 42.0).

These two are more difficult. We won't answer them now (we will revisit them in a bit), but see if you can figure them out.

  1. Can you replace every row with a particular row (for example, 2, 4, 8, 16)?
  2. [Don't strain yourself] Can you replace every column with a particular column?

Working with Arrays

Math is quite simple -- and this is part of the reason that using NumPy arrays can significantly simplify numerical code. The generic pattern array OP scalar (or scalar OP array), applies OP (with the scalar value) across elements of array.

In [ ]:
# array OP scalar applies across all elements and creates a new array
arr = np.arange(10)
print "     arr:", arr
print " arr + 1:", arr + 1
print " arr * 2:", arr * 2
print "arr ** 2:", arr ** 2
print "2 ** arr:", 2 ** arr

print " arr | 1:", arr | 1   # bit-wise ops (see also np.logical_and, etc.)
print " arr & 1:", arr & 1

# NOTE:  arr += 1, etc. for in-place
In [ ]:
# FIXME:  possibly way too cute/fancy
for expr in ["arr", 
             "arr + 1", "arr * 2", "arr ** 2", "2 ** arr", 
             "arr | 1", "arr & 1"]: # bitwise ops (see also:  np.logical_and, etc.)
    print "%10s: %s" % (expr, eval(expr))
In [ ]:
# array OP array works element-by-element and creates a new array
arr1 = np.arange(5)
arr2 = 2 ** arr1 # makes a new array

print arr1, "+", arr2, "=", arr1 + arr2
print arr1, "*", arr2, "=", arr1 * arr2

A number of important mathematical operations on arrays are defined as functions in the NumPy module (not as methods on NumPy arrays). Some operations are even available both ways. Some of the more important mathematical routines include: sin, cos, tan, exp, log. We can use these as np.sin, for example. For a complete list, see http://docs.scipy.org/doc/numpy/reference/routines.math.html

In [ ]:
arr = np.arange(-np.pi, np.pi, np.pi/4)
print "some multiples of pi:"
print arr

print "... and their cosines:"
print np.cos(arr)

Exercise III

Suppose we wanted to graph several functions on the interval [-2.0, 2.0). matplotlib lets us graph xy-values very quickly using: plt.plot(xvals, yvals) given in the form [x1, x2, x3, ...], [y1, y2, y3, ...] (i.e., pairwise sequences). Repeated plt.plot commands will go to the same graph. Graph the following functions:

  • $y=x + 1$
  • $y=e^x$
  • $y=cos(x^2) + sin(x^2)$
  • $y=cos(x)^2 + sin(x)^2$

Graph a parametric equation over $t$ on $[0,2\pi]$ defined by:

  • $y(t) = sin(t)$
  • $x(t) = cos(t)$

You may want to issue a matplotlib statement: plot.axis('equal') to ensure you don't get a skewed perspective on your result.

In [35]:
#FIXME a solution
ts = np.linspace(0, 2*np.pi, 1000)
plt.axis('equal')
plt.plot(np.sin(ts), np.cos(ts))
Out[35]:
[<matplotlib.lines.Line2D at 0x1183da9d0>]
In [36]:
# FIXME:  a solution
xvals = np.linspace(-2, 2, 100)
plt.plot(xvals, xvals+1)
plt.plot(xvals, np.e**xvals)
plt.plot(xvals, np.cos(xvals**2) + np.sin(xvals**2))
plt.plot(xvals, np.cos(xvals)**2 + np.sin(xvals)**2)
Out[36]:
[<matplotlib.lines.Line2D at 0x11b6dfc90>]

Selection from Arrays

META What is the name for this in the numpy docs?

Boolean operations on an array proceed elementwise and result in bool values.

In [37]:
# this also holds for comparisons, but results in a boolean array
arr = 2 ** np.arange(5)
print "arr:"
dumpArray(arr)

print "after a boolean test:"
tested = arr > 4
dumpArray(tested)
arr:
(5,) array of int64:
[ 1  2  4  8 16]
after a boolean test:
(5,) array of bool:
[False False False  True  True]

We can also use the boolean array as an indexer back into the orignal array.

In [38]:
arr[tested]
Out[38]:
array([ 8, 16])

Sometimes we want to remove elements from an array that don't meet a certain criteria. We can do that with np.where. np.where is defined in the NumPy module. It is not a method on NumPy arrays. We will encounter more of these functions shortly. One note: np.where is designed to work on N-dimensional arrays. Because of that, np.where insists on returning a tuple of arrays one array of indices per dimension of the original array. This makes perfect sense when the number of dimension is greater than one. However, it can be a gotcha when there is only one dimension. In either case, the indices can be used directly as an array index.

META Note, this is a first instance of fancy indexing. But, let that slide. We'll get into it more later.

In [39]:
# pick the indices of the even elements of arr 
arr = np.arange(10, 20)
indices = np.where(arr % 2 == 0)
print "the indices:  ", indices
print "the elements: ", arr[indices]
the indices:   (array([0, 2, 4, 6, 8]),)
the elements:  [10 12 14 16 18]

In the multi-dimensional case from np.where, the indices are lined up pairwise and used to select elements at that position.

In [ ]:
arr = np.arange(20).reshape(5,4)
print "a 2D array:"
print arr

indices = np.where(arr % 2 == 0)
print "the indices:  ",  indices
print "the elements: ", arr[indices]

np.where also has an alternate usage pattern: np.where(CONDITION, VALUE_WHEN_TRUE, VALUE_WHEN_FALSE). Here, two arrays provide the values when the condition is met or fails. If VALUE_WHEN_TRUE/FALSE is not of the right shape, it will be broadcast (expanded) into a compatible shape, if possible.

In [ ]:
# np.where can also be used to select elements out of arrays (directly)
arr = np.arange(10, 20)
print "                        original:", arr
print " map evens -> 1.0, odds -> -99.0:", np.where(np.logical_not(arr % 2), 1.0, -99.0)

# either or both VALUEs can be arrays
print "map odds -> 0.0, evens stay same:", np.where(np.logical_not(arr % 2), arr, 0.0)

Exercise IV

  1. Generate an array of 66 random integers from 0 to 60.
    1. Determine what percent of them are even.
    2. Determine what percent of them are greater than 50.
  2. Suppose we wanted to compute the square roots of an array that could include negative values. To deal with this, we might define $sqrt(x)=0$ for $x<0$. Generate an array of random values from [-5, 5) and show how to do this.

Grid/Window/Neighbor Operations

FIXME: the graphic (one below) for this could use a little help. Adding color to denote the segments might be very useful. For example, the image here: scipy-lectures.github.io/_images/numpy_indexing.png :

Also, the LHS of the equation confuses the issue by asking "why are we assigning back into (part of) the original array". A good learning point: the selection we've done reduces the total length by two.

In [ ]:
test = np.random.randint(10, size=(10,))
print "some sample data:"
dumpArray(test)

firstVals  = test[ :-2]  # how many values are in each of these?
secondVals = test[1:-1]
thirdVals  = test[2:  ]

movingWindowAverage = (firstVals + secondVals + thirdVals) / 3.0 # PITFALL:  divide by 3
print "a 3-element moving average"
dumpArray(movingWindowAverage)

# we can write that more compactly as:
# (test[:-2] + test[1:-1] + test[-2:]) / 3.0

Exercise V: Grid Operations

Recall that the derivative of a function $f$ at $x$ can be defined as $\lim_{\Delta h \rightarrow 0} \frac{f(x+\Delta h) - f(x)}{\Delta h}$. We can approximate this very quickly using NumPy. Compute a numerical derivative, $\frac{y_{i+1}-y_{i}}{x_{i+1}-x_{i}}$, for:

  • $f(x)=x+1$
  • $f(x)=e^x$
In [40]:
# FIXME  a solution
xvals = np.linspace(-2, 2, 10)
f1 = xvals + 1
f1_p = (f1[1:] - f1[:-1]) / (xvals[1:] - xvals[:-1])
plt.plot(xvals, f1)
plt.plot(xvals[:-1], f1_p)
Out[40]:
[<matplotlib.lines.Line2D at 0x11b7b84d0>]
In [ ]:
# FIXME a solution
xvals = np.linspace(-2, 2, 10)
f1 = np.e**xvals
f1_p = (f1[1:] - f1[:-1]) / (xvals[1:] - xvals[:-1])
plt.plot(xvals, f1)
plt.plot(xvals[:-1], f1_p) 

# this has high error b/c of large spacing, exp() grows quickly to the right
# and we're using right-end point of the interval for evaluation
# try with 1000 points

Some Other Array Operations

Several useful operations are definied as methods on NumPy arrays. For a full list, see the NumPy docs:

http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html#array-methods

In [ ]:
arr = np.random.randint(0,10, size=(10,))# arange(1,10)
print "arr: ", arr

print "%18s : %s" % ("mean", arr.mean())
print "%18s : %s" % ("variance", arr.var())
print "%18s : %s" % ("std. deviation", arr.std())
print "%18s : %s" % ("cumulative sum", arr.cumsum())
print "%18s : %s" % ("cumulative product", arr.cumprod())
In [ ]:
# two other useful methods for defining predicates based on an array are .any() and .all()
arr = np.array([True, False, False])
print "arr:", arr
print "any true?:", arr.any()
print "all true?:", arr.all()

Some Array Methods

FIXME make a more comprehensive list in reference/here? notes about when to use some of the non-obvious tools? also, a later list has the ufuncs

  • Predicates

    • a.any(), a.all()
  • Reductions

    • a.mean(), a.argmin(), a.argmax(), a.trace(), a.cumsum(), a.cumprod()
  • Manipulation

    • a.argsort(), a.transpose(), a.reshape(...), a.ravel(), a.fill(...), a.clip(...)
  • Complex Numbers

    • a.real, a.imag, a.conj()

Exercise VI

  1. Suppose, poof, arr.all() (and np.all()) just disappeared. Write a function myAll that replaces them.
  2. Define a function noneTrue that returns True when no element of an array is True and False otherwise.

Array Shape and Reshaping

We've used the following to get arrays of different shapes:

In [ ]:
arr = np.arange(24).reshape((3,4,2))
print arr.shape

Let's investigate shapes in a bit more detail. When modifying shapes, -1 can be used as a wildcard that will fill in the shape for one dimension, based on the others.

In [41]:
arr = np.arange(1,11)
dumpArray(arr)

# shape of 5 x ? -> 5 x 2
reshaped = arr.reshape(5, -1)
dumpArray(reshaped)
(10,) array of int64:
[ 1  2  3  4  5  6  7  8  9 10]
(5, 2) array of int64:
[[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]]
In [ ]:
print arr.reshape(-1, 5)
In [ ]:
print arr.reshape(2,5)

Some other shaping utilities include:

  • np.ravel
  • np.flatten
  • np.squeeze

np.flatten, like its name implies, makes a 1-D version of the array. np.ravel behaves similarly, but it will try to avoid making a copy

In [42]:
arr = np.arange(10).reshape(5,2)
print "arr:"
dumpArray(arr)

flat = arr.flatten()
print "flattened (my own data? %s):" % flat.flags.owndata
dumpArray(flat)

rav = arr.ravel()
print "raveled (my own data? %s)" % rav.flags.owndata
dumpArray(rav)
arr:
(5, 2) array of int64:
[[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]
flattened (my own data? True):
(10,) array of int64:
[0 1 2 3 4 5 6 7 8 9]
raveled (my own data? False)
(10,) array of int64:
[0 1 2 3 4 5 6 7 8 9]

np.squeeze will remove any dimensions that have length 1. Occassionally, you might get a length 1 dimension by selection

In [43]:
arr = np.arange(10).reshape(5,1,2,1)
print "arr:"
print arr

print "whittled down:"
print arr.squeeze()
arr:
[[[[0]
   [1]]]


 [[[2]
   [3]]]


 [[[4]
   [5]]]


 [[[6]
   [7]]]


 [[[8]
   [9]]]]
whittled down:
[[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]

Exercises on Array Shaping

Array Meta-data and dtype

Internally, NumPy arrays look like this:

Individual elements (scalars) have lightweight wrappers around them that treat them as single-element arrays.

In [44]:
# arrays have several pieces of meta-data, driven in part by the dtype of the array
def dumpArrayInfo(arr):
    print "%15s: %s" % ("shape", arr.shape)
    print "%15s: %s" % ("dtype", arr.dtype)                         
    print "%15s: %s" % ("size", arr.size)                           
    print "%15s: %s" % ("itemsize", arr.itemsize)                   
    print "%15s: %s" % ("size * itemsize", arr.size * arr.itemsize)

    
arr = np.arange(10)
dumpArray(arr)
dumpArrayInfo(arr)
(10,) array of int64:
[0 1 2 3 4 5 6 7 8 9]
          shape: (10,)
          dtype: int64
           size: 10
       itemsize: 8
size * itemsize: 80
In [ ]:
arr = np.arange(10.0).reshape(5,2)
dumpArrayInfo(arr)

And the dtype itself can be querried for information:

In [ ]:
def dumpDtypeInfo(dt):
    print "%15s: %s" % ("name", dt.name)
    print "%15s: %s" % ("byteorder", dt.byteorder)
    print "%15s: %s" % ("itemsize",  dt.itemsize)
    print "%15s: %s" % ("type", dt.type)

dumpDtypeInfo(arr.dtype)

Usually, we can get the right dtype through inference:

In [ ]:
arr1 = np.array([1,2,3])
arr2 = np.array([1, 2, 3.14150])

print "arr1 type: ", arr1.dtype
print "arr2 type: ", arr2.dtype

We can also manually specify a dtype in many array creation routines

In [ ]:
arr1 = np.array([1,2,3], dtype=np.float32)
dumpArray(arr1)

And we can convert dtypes with array.astype

In [ ]:
arr = np.arange(10)
dumpArray(arr)

converted = arr.astype(np.float_)
print "after converting types:"
dumpArray(converted)

FIXME both of these have weaknesses right now

One other set of information about arrays comes from the flags attribute

In [45]:
arr = np.arange(10)
print arr.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

When a NumPy array is sliced, the slice is created with new dtype, shape, and stride information. However, the underlying data is referenced (not copied). We can determine if an array owns its own data via its array.flags.owndata attribute.

In [46]:
arr1 = np.arange(10)
arr2 = arr1[3:7]

print "arr1 owndata: ", arr1.flags.owndata
print "arr2 owndata: ",arr2.flags.owndata

# and what will happen if we write into arr2?
arr2[:] = 0
print "arr1 (after assigning into arr2):", arr1
arr1 owndata:  True
arr2 owndata:  False
arr1 (after assigning into arr2): [0 1 2 0 0 0 0 7 8 9]

Universal Functions: ufuncs

NumPy ufuncs are functions that operate elementwise on one or more arrays.

When called, ufuncs dispatch to optimized C inner-loops based on the array dtype.

Built-in ufuncs

  • comparison: <, <=, ==, !=, >=, >
  • arithmetic: +, -, *, /, reciprocal, square
  • exponential: exp, expm1, exp2, log, log10, log1p, log2, power, sqrt
  • trig: sin, cos, tan, acsin, arccos, atctan, sinh, cosh, tanh, acsinh, arccosh, atctanh
  • bitwise: &, |, ~, ^, left_shift, right_shift
  • logical operations: and, logical_xor, not, or
  • predicates: isfinite, isinf, isnan, signbit
  • other: abs, ceil, floor, mod, modf, round, sinc, sign, trunc

Axis

Methods that reduce the information in (or summarize) an array (such as sum) take an optional parameter called axis which specifies the dimension over which to perform a reduction.

  • axis=None, the default, reduces over all dimensions
  • axis=0, reduces over the outer-most/zeroth dimension
    • if we think about this dimension as the rows, we can imagine that it produces a new row
  • axis=1, reduces over the first dimension
    • if we think about this dimension as the columns, we can imagine that it produces a new column
In [ ]:
arr = np.arange(15).reshape(3,5)
dumpArray(arr)

grandSum   = arr.sum()
colSums = arr.sum(axis=0)
rowSums    = arr.sum(axis=1)

print "grandSum:", grandSum

print "colSums (a new pseudo-row):", colSums, colSums.shape
print "rowSums (a new pseudo-col):", rowSums, rowSums.shape

One other array creator, np.tile, rewards careful examination of some examples. We'll use the following arr as our base array.

In [47]:
arr = np.arange(1,5).reshape(2,2)
dumpArray(arr)
(2, 2) array of int64:
[[1 2]
 [3 4]]

Tileing over one axis appends "tiles" of data on the ends of the rows

In [48]:
print np.tile(arr,1)
print np.tile(arr,2)
print np.tile(arr,4)
[[1 2]
 [3 4]]
[[1 2 1 2]
 [3 4 3 4]]
[[1 2 1 2 1 2 1 2]
 [3 4 3 4 3 4 3 4]]

Tileing over multipled dimensions creates tiles (of the original) in the specified shape

In [49]:
print "with a (2,2) inner"
print "... and a (2,1) outer"
print np.tile(arr, (2,1)) # 2x1 tileing of a 2x2 array
print "... and a (1,2) outer"
print np.tile(arr, (1,2)) # 1x2 tileing of a 2x2 array
print "... and a (2,2) outer"
print np.tile(arr, (2,2)) # 2x2 tileing of a 2x2 array
with a (2,2) inner
... and a (2,1) outer
[[1 2]
 [3 4]
 [1 2]
 [3 4]]
... and a (1,2) outer
[[1 2 1 2]
 [3 4 3 4]]
... and a (2,2) outer
[[1 2 1 2]
 [3 4 3 4]
 [1 2 1 2]
 [3 4 3 4]]

Things get slightly different in more dimensions, especially when the dimensions of the tile and the tileing differ. The tile is first expanded to the same number of dimensions as the tiling.

In [ ]:
# original 2x2 promoted to 1x2x2 ... then used for 3x1x1 tileing
print np.tile(arr, (3,1,1)) 
In [ ]:
print np.tile(arr, (1,1,3))
In [ ]:
arr = np.tile(np.arange(1,10).reshape(3,3), (3,1,1))
print arr

As mentioned earlier, in greater than two dimensions, you need to be very careful thinking in terms of "rows and columns". Specifically, in the string representation of a 3-D array, the outer-most dimension is no longer the visual rows: it's the different panels. Thus, for sums over different axes, we might talk about:

sum(axis=?) over which dim?visual elementadded elements
axis=0 outer-mostacross panels[1+1+1, 2+2+2, ...]
axis=1 middle across colums[1+4+7, 2+5+8, ...]
axis=2 inner-mostacross rows [1+2+3, 4+5+6, ...]
In [ ]:
print "axis=0"
print arr.sum(axis=0)  

print "axis=1"
print arr.sum(axis=1)  

print "axis=2"
print arr.sum(axis=2)  

print "shapes: ", arr.sum(axis=0).shape, arr.sum(axis=1).shape, arr.sum(axis=2).shape

Exercise VII

In [ ]:
# FIXME: i'd really like something with axis and reduction.  Tilling as well.
# cumsum is referred to later, could be an exercise here
In [ ]:
arr.cumsum()

Array Joining and Splitting

Several functions for combining (concatenating) and splitting up arrays are free functions in the NumPy module. The main functions (warning, these take a single tuple as their argument) are:

  • np.concatenate
  • np.vstack
  • np.column_stack
  • np.hstack

Combining arrays together is not ideal (because we have to reallocate memory - a slow operation). An alternative is to use Python lists (which append quickly on the right) until all (or sufficient) data is gathered and then convert to NumPy array.

Note, if you find yourself constantly needing to restack vectors into 2-D arrays (when doing linear algebra), you may want to look at NumPy's matrix class. Essentially, matrixs keep their 2-D shape throughout operations applied to them. They also define $*$ as matrix multiplication.

In [ ]:
joined = np.concatenate((np.array([1,2,3]),
                         np.array([4,5,6]),
                         np.array([7,8,9])))
print joined
In [ ]:
arr1 = np.arange(4).reshape(2,2)
arr2 = (np.arange(4).reshape(2,2) + 1) * 10

print np.concatenate((arr1, arr2), axis=0)
print np.concatenate((arr1, arr2), axis=1)

Note, for concatenation, all dimensions (except the dimension being concatenated) must have the same size.

In [ ]:
arr1 = np.array([1,   2,  3])
arr2 = np.array([11, 22, 33])

# create a vertical stack
print "vstack:"
print np.vstack((arr1, 
                 arr2))       

# creates a column stack (horizontally)
print "column_stack:"
print np.column_stack((arr1, arr2)) 

# compare np.column_stack with np.hstack()
print "hstack (1D):"
print np.hstack((arr1, arr2)) # the 1-D arrays are not treated as columns
In [ ]:
arr = np.vstack((arr1, arr2))
print "original array:"
print arr

print "hstacking:"
print np.hstack((arr, arr)) # concatenate with 1D, but ok if we have 2D things

print "vstacking:"
print np.vstack((arr, arr))

Exercises: Array Joining (old ex. 5)

  1. Earlier, we computed a moving average across an array. Write a function that produces an array of the moving average values and is padded on the left so it has the same size as the original array.
  2. Write two functions list_append and array_append that take an iterable of data, loop across it, and append the values onto a Python list and a NumPy array. For the NumPy array, use np.concatenate. For the Python list, use append. Time the execution of the two methods on input data of 2000 elements.
In [ ]:
# FIXME  a solution
# some of the information in "subtleties (optional)" (old exercise 5) is useful 
# for understanding why the joiners have their "quirks"
def list_append(itr):
    lst = []
    for i in itr:
        lst.append(i)
    return lst

def array_append(itr):
    arr = np.array([])
    for i in itr:
        arr = np.concatenate((arr, np.array([i])))
    return arr

input_data = np.random.random(2000)
%timeit -n10 list_append(input_data)
%timeit -n10 array_append(input_data)

Bivariate Grids

If you are working with functions of more than one variable, you may need to generate values over a 2-D (or higher) grid. NumPy has some utilities to help in this case. np.meshgrid produces arrays of values that when paired elementwise (or in higher dimensions) gives the Cartesian product (all pairings) of the input values.

In [54]:
x = np.array([1,2,3])
y = np.array([10,20])
# xy-values like:  (1,10), (1,20), (2,10), (2,2), (3,10), (3,20)

xx, yy = np.meshgrid(x, y) # note, x expanded across columns
print "xx:"
dumpArray(xx)
print "yy:"
dumpArray(yy)

# pair xx, yy elementwise to get all possible (x,y)-values from the two base grids
print "sum:"
zz = xx + yy
print zz



# or, zz = np.cos(xx) + np.sin(yy)
xx:
(2, 3) array of int64:
[[1 2 3]
 [1 2 3]]
yy:
(2, 3) array of int64:
[[10 10 10]
 [20 20 20]]
sum:
[[11 12 13]
 [21 22 23]]

If we pass the xy-values around, we may wish to keep them together and de-tuple them at a later point.

In [65]:
dx, dy = 0.15, 0.05
y, x = np.mgrid[slice(-3, 3 + dy, dy), slice(-3, 3 + dx, dx)]
print x
print y
z = (1 - x / 2. + x ** 5 + y ** 3) * np.exp(-x ** 2 - y ** 2)
plt.pcolor(x, y, z)
plt.show()
[[-3.   -2.85 -2.7  ...,  2.85  3.    3.15]
 [-3.   -2.85 -2.7  ...,  2.85  3.    3.15]
 [-3.   -2.85 -2.7  ...,  2.85  3.    3.15]
 ..., 
 [-3.   -2.85 -2.7  ...,  2.85  3.    3.15]
 [-3.   -2.85 -2.7  ...,  2.85  3.    3.15]
 [-3.   -2.85 -2.7  ...,  2.85  3.    3.15]]
[[-3.   -3.   -3.   ..., -3.   -3.   -3.  ]
 [-2.95 -2.95 -2.95 ..., -2.95 -2.95 -2.95]
 [-2.9  -2.9  -2.9  ..., -2.9  -2.9  -2.9 ]
 ..., 
 [ 2.9   2.9   2.9  ...,  2.9   2.9   2.9 ]
 [ 2.95  2.95  2.95 ...,  2.95  2.95  2.95]
 [ 3.    3.    3.   ...,  3.    3.    3.  ]]

Broadcasting (still to come in details) can achieve similar effects (the orientation of the result is slightly different here):

In [64]:
# but note, we can also do this:
x = np.array([1,2,3])
y = np.array([10,20])
# x -> row
x2d, y2d = np.atleast_2d(x,y)
print "shapes: ", x2d.shape, y2d.shape
print x2d
print y2d

x2d = x2d.T
print x2d + y2d # uses broadcasting (more in the next section!)
shapes:  (1, 3) (1, 2)
[[1 2 3]]
[[10 20]]
[[11 21]
 [12 22]
 [13 23]]

Often, we like to specify grids with slice notation, using :. But we can't make a function call with a slice as an argument. For example, this won't work in Python: np.ogrid(1:4, 1:3). So, there is a special NumPy object called ogrid that lets us directly specify grids with slices.

In [ ]:
xx, yy = numpy.ogrid[1:4, 10:21:10] # upper point(s) not included
print "xx:"
dumpArray(xx)

print "yy:"
dumpArray(yy)

print "sum:"
zz = xx + yy
print zz

# mgrid is similar; values are transposed wrt meshgrid
# print np.mgrid[1:4, 8:10]
# print np.meshgrid(range(1,4), range(8,10))

np.searchsorted

FIXME not a huge fan of search sorted. seems low on priority list.

In [66]:
arr = np.concatenate([np.arange(1,6), np.arange(20,26), np.arange(50,55)])
#values = [1]*3 + [2]*3 + 3*[3]
#arr = np.array(values)
print arr

breakpts = np.searchsorted(arr, [5, 20, 30])
print "breakpoints: ", breakpts

print "arr indexed by pairs of breakpoints"
print arr[           :breakpts[0]]
print arr[breakpts[0]:breakpts[1]]
print arr[breakpts[1]:breakpts[2]]
print arr[breakpts[2]:           ]
[ 1  2  3  4  5 20 21 22 23 24 25 50 51 52 53 54]
breakpoints:  [ 4  5 11]
arr indexed by pairs of breakpoints
[1 2 3 4]
[5]
[20 21 22 23 24 25]
[50 51 52 53 54]

Broadcasting

In short, broadcasting lets arrays with different but compatible shapes be arguments to ufuncs.

In [67]:
arr1 = np.arange(5)
print "arr1 + scalar:"
print arr1 + 10

print "arr1 + arr1 (same shape):"
print arr1 + arr1


arr2 = np.arange(5).reshape(5,1) * 10
arr3 = np.arange(5).reshape(1,5) * 100

print "arr2 shape:", arr2.shape
print "arr3 shape:", arr3.shape

print "arr1 + arr2 [ %s + %s --> %s ]:" % (arr1.shape, arr2.shape, (arr1 + arr2).shape)
print arr1 + arr2
print "arr1 + arr3 [ %s + %s --> %s ]:" % (arr1.shape, arr3.shape, (arr1 + arr3).shape)
print arr1 + arr3
arr1 + scalar:
[10 11 12 13 14]
arr1 + arr1 (same shape):
[0 2 4 6 8]
arr2 shape: (5, 1)
arr3 shape: (1, 5)
arr1 + arr2 [ (5,) + (5, 1) --> (5, 5) ]:
[[ 0  1  2  3  4]
 [10 11 12 13 14]
 [20 21 22 23 24]
 [30 31 32 33 34]
 [40 41 42 43 44]]
arr1 + arr3 [ (5,) + (1, 5) --> (1, 5) ]:
[[  0 101 202 303 404]]
In [ ]:
arr1 = np.arange(6).reshape(3,2)
arr2 = np.arange(10, 40, 10).reshape(3,1)

print arr1 + arr2

FIXME: array names

Here, an array of shape (3, 1) is broadcast to an array with shape (3, 2)

What are the rules for broadcasting?

In order for an operation to broadcast, the size of all the trailing dimensions for both arrays must either be equal or be one. Dimensions that are one and dimensions that are missing from the "head" are duplicated to match the larger number. So, we have:

A      (1d array):              3
B      (2d array):          2 x 3
Result (2d array):          2 x 3
A      (2d array):          6 x 1
B      (3d array):      1 x 6 x 4
Result (3d array):      1 x 6 x 4
A      (4d array):  3 x 1 x 6 x 1
B      (3d array):      2 x 1 x 4
Result (4d array):  3 x 2 x 6 x 4

Some other interpretations of compatibility:

  • Tails must be the same, ones are wild.
  • If one shape is shorter than the other, pad the shorter shape on the LHS with 1s.
    • Now, from the right, the shapes must be identical with ones acting as wild cards.
In [ ]:
a1 = np.array([1,2,3])       #   3  --> 1x3
b1 = np.array([[10, 20, 30], # 2x3
               [40, 50, 60]]) 
print a1 + b1
In [ ]:
result = (np.ones((  6,1)) +  # 3rd dimension replicated
          np.ones((1,6,4)))
print result.shape

result = (np.ones((3,6,1)) + 
          np.ones((1,6,4)))   # 1st and 3rd dimension replicated
print result.shape

Sometimes, it is useful to explicitly insert a new dimension in the shape. We can do this with a fancy slice that takes the value np.newaxis.

In [ ]:
arr1 = np.arange(6).reshape((2,3))  # 2x3
arr2 = np.array([10, 100])          #   2

try:
    arr1 + arr2
except ValueError:
    print "fail"

# let's massage the shape
arr3 = arr2[:, np.newaxis] # arr2 -> 2x1
print "arr3 shape:", arr3.shape
print "arr1 + arr3"
print arr1 + arr3
In [ ]:
arr = np.array([10, 100])
print "original shape:", arr.shape

arrNew = arr2[np.newaxis, :]
print "arrNew shape:", arrNew.shape

Fancy Indexing

There are a few forms of fancy indexing in NumPy. The first is indexing an array by other arrays (we've seen one example of this). The result has the same shape as the indexing arrays.

In [ ]:
arr = np.arange(15).reshape((3,5))
print arr

# select 0,0  0,2  1,3   2,4  as a 2x2 array
rows = np.array([[0, 0], 
                 [1, 2]])
cols = np.array([[0, 2], 
                 [3, 4]])

# values are lined up pairwise (from inputs) to positions of output
print arr[rows, cols] # ---> in shape of row,col index matrices

Another form of fancy indexing comes when we use an array of np.bool_. Here, the indexing array must have an element (True/False) for every position in the base array.

In [ ]:
evens = (arr % 2 == 0) # note, arr % 2 is not boolean. 
                       # quick question:  how can we check that?
    
print arr[evens] # boolean array:  we have to say yes or no for each element
                 # compare with indices:  pick elements
In [ ]:
# FIXME:  this is more confusing than helpful
#print "arr % 2"
#dumpArray(arr % 2)

# is it boolean?  no.  so, it will be selecting elements by (triples) of indices [0,1,0]
#print "not what we might expect"
#dumpArray(arr[arr%2]) # only indexing outer-most dimension (each 5 element long)

One other gotcha: Python's bool type is not the same as np.bool. In fact, Python's bools are Python ints (you can check it below). Thus, NumPy will use them as the values 0 and 1 -- which means numerical indexes.

In [ ]:
arr = np.arange(35).reshape(5,7)
print "arr:"
print arr

b = arr>20
print "a boolean selection:"
print arr[b]

print "some of the boolean indices:"
print b[:,5]

# compare, Python's True/False (python bool) are -ints-.  
# prove it: print isinstance(True, int)
# NumPy uses them as 0/1 integers
#      ... unless they are in a np.array of type np.bool
print "raw Python bools as indices"
# select row[0], row[0], row[0], row[1], row[1]
pyBools = [False, False, False, True, True]
print arr[pyBools]

print "np.bools"
npBools = np.array(pyBools, dtype=np.bool)
# broadcast out across columns, so apply selection to each column
print arr[npBools]  

Exercise

We saw these two questions earlier. Can you answer them now? If you answered them before (nice work!), can you explain them now?

  1. In a test array (2x4), replace every row with a particular row (for example, 2, 4, 8, 16)?
  2. In a test array (2x4), replace every column with a particular column (for example, 1,3)?
  3. Let a = np.random.randint(0,1,(10,10))
    1. select the even numbers
    2. select the odd numbers
    3. select the values in the interval (4,7)
  4. In Matlab, if you index a like: a[[1,2],[1,2]], you get a 2x2 cube made up of [[a[1,1], a[1,2]], [a[2,1], a[2,2]]]. If you do that in python, you get: [a[1,1], a[2,2]]. Write a Python function, matdex, which takes a list of indexes, and returns a list of indexes, that when used to index a NumPy, will return the matlab style selection. Note, this is either ridiculous or trivial, depending on how you solve it.
In [ ]:
# META - FIXME:  this is an answer
def matdex1(*indices):
    return np.meshgrid(*indices)

def matdex2(*indices):
    return np.meshgrid(*indices, indexing="ij")

arr = np.arange(15).reshape(3,5)
print "original array:"
print arr

print "desired output:"
print [[arr[1,1], arr[1,2]], [arr[2,1], arr[2,2]]]

print "method 1:"
print arr[matdex1([1,2], [1,2])].T    # you're clever

print "method 2:"
print arr[matdex2([1,2], [1,2])]      # you're super!

Some Additional NumPy Subpackages

  • np.fft — Fast Fourier transforms

  • np.polynomial — Orthogonal polynomials, spline fitting

  • np.linalg — Linear algebra

    • cholesky, det, eig, eigvals, inv, lstsq, norm, qr, svd
  • np.math — C standard library math functions

  • np.random — Random number generation

    • beta, gamma, geometric, hypergeometric, lognormal, normal, poisson, uniform, weibull
    • many others, if you need it NumPy probably has it

FFT

In [68]:
PI = np.pi
t = np.linspace(0, 120, 4000)
nrr = np.random.random

signal  =  12 * np.sin(3 * 2*PI*t)   # 3 Hz
signal +=   6 * np.sin(8 * 2*PI*t)   # 8 Hz
signal += 1.5 * nrr(len(t))          # noise

FFT   = abs(np.fft.fft(signal))
freqs = np.fft.fftfreq(signal.size, t[1] - t[0])

plt.plot(t,signal); plt.xlim(0, 4); plt.show()
plt.plot(freqs, FFT);

Random distributions and sampling: np.random

In [ ]:
np.random.normal(loc = 0, scale = 2, size = (2,10)) # loc = mean, scale = standard deviation
In [ ]:
np.random.permutation(np.arange(10))

Linear Algebra: np.linalg and np.matrix

In [ ]:
import numpy.linalg as la

# Ax = b

A = np.array([[2,1],
              [1,2]])
b = np.array([.5, .75])

soln1 = la.solve(A,b)

Ainv = la.inv(A)
soln2 = np.dot(Ainv, b)  # generally don't want to take explicit inverse

np.allclose(soln1, soln2)
In [ ]:
# note also, matrices --> (a 2d numpy array)
b = np.array([.5, .75])

A = np.matrix('2 1; 1 2') # matlab style
b = np.matrix(b).T        # convert a np.array

# matrix has .I --> this is the matrix inverse (pseudo-inverse)
#  -and- 
# * is matrix multiplication
A.I * b

Compound Data: Structured Arrays / Record Arrays: np.record

NumPy arrays have elements with a single type. But, that type can be a compound type (i.e., a record or a struct).

Two main recommended ways of specifying type codes:

  • b1, i1, i2, i4, i8, u1, u2, u4, u8, f2, f4, f8, c8, c16, a<n> (bytes, ints, unsigned ints, floats, complex and fixed length strings of a given byte lengths)
  • int8,...,uint8,...,float16, float32, float64, complex64, complex128 (similar but with bit sizes)
In [ ]:
# a record with a 4 byte int, a 4 byte float, and 10 bytes of characters (ascii values)
x = np.zeros((2,),dtype=('i4,f4,a10'))
print x
print repr(x)

x[:] = [(1, 5., 'Hello'), (2, 6., 'World')]
print x
print repr(x)

print "a field:"
print x['f1']
print repr(x['f1'])
In [ ]:
%%file patient-records.csv
name,date,weight(kg),height(cm)
mark,2011-01-01,86.1,180
barb,2012-02-03,65.7,167
ethan,2013-04-06,29.45,127
In [ ]:
patient_dtype = [("name", "a10"),
                 ("visit_date", 'datetime64[D]'),   # [D] means sepcify to "Day" units
                 ("weight", np.float),
                 ("height", np.int)]
data = np.loadtxt("patient-records.csv", skiprows=1, delimiter=",", dtype=patient_dtype)

print "first row: ", data[0]
print "all weights: ", data['weight']

# BMI = kg / m**2
print "BMIs:", data['weight'] / (data['height']/100.0)**2

We can also save and load arrays

In [ ]:
#saving / load data
np.savez('data.npz',data=data) # list of arrays to store
dataz = np.load('data.npz')

print dataz.files     # list of arrays stored in this archive
print dataz['data']
In [ ]:
# cleanup
# !rm data.npz

Exercise: Data I/O

Using the code snippet below to grab some stock market data from Yahoo, read the data into an array of appropriately structured records. The data grabbed is for all of 2013. Find the highest high and the lowest low for your ticker symbol. For extra credit, graph the daily closes.

In [ ]:
from urllib import urlretrieve
sym = "GE"
quote_url = "http://ichart.finance.yahoo.com/table.csv?s=%s&d=11&e=31&f=2013&g=d&a=0&b=2&c=2013&ignore=.csv"
urlretrieve(quote_url % sym, sym+".csv")
In [ ]:
# note, on unix-heritage boxes, you can use head and tail to glance at the data
# !head -10 GE.csv
# !tail -10 GE.csv

NumPy Resources

In [3]:
import sys;
In [ ]:
sy