from:
and presented by: Dr. Evil, Ph.D.
Preferred title format, but --to latex --post PDF isn't happy with url referenced images.
We're going to run a few quick commands in ipython to shorten a few names and to make some nice graphics interaction (if you are in ipython notebook).
%pylab inline --no-import-all
import numpy as np
import matplotlib.pyplot as plt
import os.path as osp
block
(back quotes)bold.block
block
I didn't do this with earlier version, but run Cell-> All Output -> Clear before commiting.
cells (should normally) hold one idea, set of comparisons, or thought
/home/mfenner/continuum/repos/presentations/python_for_science/numpy_chapter (served via python -m SimpleHTTPServer)
#!/usr/bin/env bash
cd ~/continuum/repos/presentations/python_for_science/numpy_chapter
python -m SimpleHTTPServer &
chromium localhost:8000 &
cd ~/continuum/classes/2013-12-python_for_finance
ipython notebook &
chromium ~/continuum/workspace/gold_via_matt-t/pfs-2013-aug/PfSClass_20130829/Exercises/index.html &
cd ~/continuum/repos/presentations/modular/numpy_chapter
ipython notebook &
Numpy is a Python library that provides multi-dimensional arrays, matrices, and fast operations on these data structures.
NumPy arrays (numpy.ndarray
) are the fundamental data type in NumPy. They have:
For example:
M, N = 6, 5
arr = np.zeros(shape = (M,N), dtype=float)
print arr
is the NumPy array corresponding to the two-dimensional matrix:
NumPy has both a general N-dimensional array and a specific 2-dimensional matrix data type. NumPy arrays may have an arbitrary number of dimensions.
NumPy arrays support vectorized mathematical operation.
arr = np.arange(15).reshape(3,5)
print "original:"
print arr
print
print "doubled:"
print arr * 2
We are generally going to visually inspect the str
representation of array. Here's how the str
and repr
differ:
arr = np.arange(10).reshape(2,5)
print str(arr)
print repr(arr)
Many array creation functions take a shape parameter. For a 1D array, the shape can be an integer.
print np.zeros(shape=5, dtype=float)
For nD arrays, the shape needs to be given as a tuple.
print np.zeros(shape=(4,3,2), dtype=float)
In the above example, the outer-most dimension (the first item in the tuple passed to shape
, 4) is the dimension that varies the most slowly. The inner-most dimension (the last item in the tuple, 2) is the dimension that varies the most quickly. Visually, this means that for 3D arrays, we might describe the shape as: (panels, rows, columns). Contrast this with the 2D case where we usually discuss shape as: (rows, columns).
We can always ask an array its shape by accessing the shape
attribute (e.g., arr.shape
).
arr = np.zeros(shape=(4,3,2), dtype=float)
print "shape is:", arr.shape
All arrays have a specific type for their associated elements. Every element in the array shares that type. The NumPy terminology for this type is dtype. The basic types are bool, int, uint, float, and complex. The types may be modified by a number indicating their size in bits. Python's built-in types can be used as a corresponding dtype. Note, the generic NumPy types end with an underscore ("_") to differentiate the name from the Python built-in.
META HTML table not supported by latex/pdf converter
Python Type | NumPy dtype |
---|---|
bool | np.bool\_ |
int | np.int\_ |
float | np.float\_ |
complex | np.complex\_ |
META No markdown table (sad face):
+------------+-----------+
+Python Type |NumPy dtype+
==========================
+------------+-----------+
|bool |np.bool_ |
+------------+-----------+
|int |np.int_ |
+------------+-----------+
|float |np.float_ |
+------------+-----------+
|complex |np.complex_|
+------------+-----------+
Here is one example of specifying a dtype:
arr = np.zeros(shape=(5,), dtype=np.float_) # NumPy default sized float
print arr, "->", arr.dtype
And, we can make a clever loop over some possible dtypes and pass them through eval:
# META: FIXME possibly too clever. the trade off is between an ugly cell that loses the
# thread of what's being compared to more complicated Python
for dt in ["float", "np.float_", "np.float64", "np.uint8", "np.int"]:
arr = np.zeros(shape=(5,), dtype=eval(dt))
print "%10s: %s -> %s" % (dt, arr, arr.dtype)
Now, we'll take just a moment to define one quick helper function to show us these details in a pretty format.
def dumpArray(arr):
print "%s array of %s:" % (arr.shape, arr.dtype)
print arr
META Are scattered "in-text" links at all valuable? Should they be aggregated at the end (in an outline: General Links: , Data Types:, etc.
See also: http://docs.scipy.org/doc/numpy/user/basics.types.html
NumPy provides a number of ways to create an array.
np.zeros
and np.ones
¶zrr = np.zeros(shape=(2,3))
dumpArray(zrr)
print np.ones(shape=(2,5))
one_arr = np.ones(shape = (2,2), dtype=int)
dumpArray(one_arr)
np.empty
¶np.empty
is lightning quick because it simply requests some amount of memory from the operating system and then does nothing with it. Thus, the array returned by np.empty
is uninitialized. Consider yourself warned. np.empty
is very useful is you know you are going to fill up all the (used) elements of your array later.
# DANGER! uninitialized array
# (re-run this cell and you will very likely see different values)
err = np.empty(shape=(2,3), dtype=int)
dumpArray(err)
np.arange
¶np.arange
generates sequences of numbers like Python's range
built-in. Non-integer step values may lead to unexpected results: for these cases, you may prefer np.linspace
and see below. (For a quick - and mostly practical - discussion of the perils of floating-point approximations, see https://docs.python.org/2/tutorial/floatingpoint.html).
As with range
, the ending point it not included.
print "int arg: %s\n" % np.arange(10) # python's range/xrange --> stopping
print "float arg: %s\n" % np.arange(10.0) # python's range/xrange --> stopping
print "step: %s\n" % np.arange(0, 12, 2) # end point not included, like python's range()
print "neg. step: %s\n" % np.arange(10, 0, -1.0)
.1 + .2 == .3
np.linspace
and np.logspace
¶np.linspace(BEGIN, END, NUMPT)
generates exactly NUMPT number of points, evenly spaced, on $[BEGIN, END]$. Unlike Python's range
and np.arange
, these functions are inclusive at BEGIN and END (they produce a closed interval).
print "End-points are included:"
print np.linspace(0, 10, 2)
print np.linspace(0, 10, 3)
print np.linspace(0, 10, 4)
For np.logspace(BEGIN, END, NUMPTS)
, the array is closed on $[10^{BEGIN}, 10^{END}]$ with NUMPT points spread evenly on a logarithmic scale.
# (10^BEGIN, 10^END, NUMPT)
print np.logspace(0, 3, 4)
np.eye
and np.diag
¶np.eye(N)
produces an array with shape (N,N) and ones on the diagonal (an NxN identity matrix).
print np.eye(3)
np.diag
produces a diagonal 2D array from an array argument.
print np.diag(np.arange(1,4))
It can also be used to extract a diagonal from a given array.
# the diagonal of an identity matrix ...
print np.diag(np.eye(3))
It is common to create arrays whose elements are samples from a random distribution. For the many options, see:
npr = np.random
print "Uniform on [0,1):"
dumpArray(npr.random((2,5)))
np.random
has some redundancy. It also has some variation in calling conventions.
standard_normal
takes one tuple argumentrandn
(which is very common to see in code) takes n arguments where n is the number of dimensions in the resultprint "std. normal - N(0,1):"
dumpArray(npr.standard_normal((2,5)))
print
dumpArray(npr.randn(2,5)) # one tuple parameter
There are also some differences on discrete distributions:
randint
excludes its upper boundrandom_integers
includes its upper boundprint "Uniform ints on [0,5) - upper open:"
dumpArray(npr.randint(0, 5, (2,5)))
print "\nUniform ints on [0,5] - upper closed:"
dumpArray(npr.random_integers(0, 5, (2,5)))
Note, it is also possible to create NumPy arrays from Python lists and tuples. While this is a nice capability, remember that instantiating a Python list can take relatively long compared to directly using NumPy building blocks. Other containers and iterables will not, generally, give useful results.
dumpArray(np.array([1, 2, 3]))
print
dumpArray(np.array([10.0, 20.0, 3]))
Dimensionality is maintained within nested lists:
dumpArray(np.array([[1, 2, 3],
[4, 5, 6]]))
print
dumpArray(np.array([[1, 2],
[3, 4],
[5, 6]]))
np.r_
and np.c_
¶np.r_
and np.c_
are special objects, then when sliced (think obj[some, :, indexers]
), return NumPy array
s constructed from the contents of the slices. These tools can be used to abbreviate some array constructions.
np.c_[np.arange(10), np.arange(10)*2]
np.r_[np.arange(10), np.arange(10)*2]
[15 minutes to do. 10 minutes to discuss.]
Unless otherwise specified, try to solve these problems using NumPy but not using raw Python.
FIXME The following problem implies giving the students at least one way to time things (and probably two). timeit.timeit/repeat (from a script) and %timeit seem like the best options.
[1.0, 2.0, 3.0, ..., 1E6]
.np.pi
.Items in NumPy arrays may be accessed using a single index composed of multiple values
META Compare several small outputs in one cell to several small cells. [next section has a nice sequence of small cells]. is one preferred all the time? is it case by case? [markfe: slight preference for many small cells]
arr = np.arange(24).reshape(4,6) # random.randint(11, size=(4, 6))
print "the array:"
print arr
print "index [2,2] :", arr[2, 2]
print "index [2] :", arr[2]
# non-idiomatic, creates a copy of arr[2] and then indexes into that copy
print "index [2][2]:", arr[2][2]
Compare this with indexing into a nested Python list
aList = [list(row) for row in arr]
print aList
print aList[2][2]
# META: I (MarkFe) find these to be far more helpful (for students) than a raw traceback.
try:
print aList[2,2]
except TypeError as e:
print "Unhappy with multi-value index"
print "Exception message:", e
We can also use slicing to select entire row and columns at once:
FIXME Images needs some re-love (want consistent row,col ... not i,j).
print "array:"
print arr
print "accessing a row:"
dumpArray(arr[2,:])
print "accessing a column:"
dumpArray(arr[:,2])
#print "a row:", arr[2,:], "has shape:", arr[2,:].shape # third row, "all" columns
#print "a col:", arr[:,2], "has shape:", arr[:,2].shape # "all" rows, at third column
Bear in mind that numerical indexing will reduce the dimensionality of the array. Slicing from index
to index + 1
can be used to keep that dimension if you need it.
print "lost dimension:",
dumpArray(arr[2, 1:4])
print "kept dimension:",
dumpArray(arr[2:3, 1:4])
Multiple slices, as part of an index, can select a region out of an array
print "array:"
print arr
print "a sub-array:"
dumpArray(arr[1:3, 2:4])
Slices are always views of the underlying array. Thus, modifying them modifies the underlying array
print "even elements (at odd indices) of first row:"
print arr[0, ::2] # select every other element from first row
arr[0,::2] = -1 # update is done in-place, no copy
print "after assinging to those:"
print arr
You may have noticed something peculiar in the cell above. There is an assignment of a scalar to an array. This is called broadcasting and in this simple case, it simply expands the scalar value to fill the elements of the target. Here's another example:
arr[0] = 10
print arr
And, here is one more example of broadcasting and slicing: assigning a value to fill a column.
arr[:,2] = 99
print arr
As with Python lists, empty start/end points in a slice represent the beginning/end of the NumPy array.
# fill the visual lower-left box with 0
arr[2:, :2] = 0
print arr
# fill the visual lower-right box with -1
arr[2:, 3:] = -1
print arr
We can also assign sequences, if the shapes on the left-hand side and the right-hand side match
arr[3, :] = [10, 20, 30, 40, 50, 60]
print arr
Sequence assignment extends to multi-dimensional objects
arr[1:3, 3:5] = [[2, 4], [8, 16]]
print arr
As with Python, -index
is the equivalent of len(seq) - 1
.
META What to do with teaching notes like the following? In an empty notebook they make sense. In a filled notebook, they do not makes sense.
-1
(on the RHS) is an end-point so it is -not- included. So, the interpreation is that we "do not include the last element". Thus, on the RHS we lose one element and on the LHS we lose one element. Both sides are the same length - the original length of the array minus one - and it is a legal assignment. But, not all items of arr
have been updated. Which element is unaffected by the assignment?
# shift array
arr = 2**np.arange(10)
print "original:"
dumpArray(arr)
arr[1:] = arr[:-1]
print "after slicing assignment"
dumpArray(arr)
# using a negative stride indicates walking backward
# it may be surprising, but this is still -not- a copy
# the reverse striding still shares data with the underlying array
rev_arr = arr[::-1]
rev_arr[0] = -99
print arr
[Est to do. Est to discuss.]
This is from way later. Makes far more sense here:
Let a = np.arange(200)
Create a sample array with shape (3,4).
These two are more difficult. We won't answer them now (we will revisit them in a bit), but see if you can figure them out.
Math is quite simple -- and this is part of the reason that using NumPy arrays can significantly simplify numerical code. The generic pattern array OP scalar
(or scalar OP array
), applies OP
(with the scalar
value) across elements of array
.
# array OP scalar applies across all elements and creates a new array
arr = np.arange(10)
print " arr:", arr
print " arr + 1:", arr + 1
print " arr * 2:", arr * 2
print "arr ** 2:", arr ** 2
print "2 ** arr:", 2 ** arr
print " arr | 1:", arr | 1 # bit-wise ops (see also np.logical_and, etc.)
print " arr & 1:", arr & 1
# NOTE: arr += 1, etc. for in-place
# FIXME: possibly way too cute/fancy
for expr in ["arr",
"arr + 1", "arr * 2", "arr ** 2", "2 ** arr",
"arr | 1", "arr & 1"]: # bitwise ops (see also: np.logical_and, etc.)
print "%10s: %s" % (expr, eval(expr))
# array OP array works element-by-element and creates a new array
arr1 = np.arange(5)
arr2 = 2 ** arr1 # makes a new array
print arr1, "+", arr2, "=", arr1 + arr2
print arr1, "*", arr2, "=", arr1 * arr2
A number of important mathematical operations on arrays are defined as functions in the NumPy module (not as methods on NumPy arrays). Some operations are even available both ways. Some of the more important mathematical routines include: sin, cos, tan, exp, log
. We can use these as np.sin
, for example. For a complete list, see http://docs.scipy.org/doc/numpy/reference/routines.math.html
arr = np.arange(-np.pi, np.pi, np.pi/4)
print "some multiples of pi:"
print arr
print "... and their cosines:"
print np.cos(arr)
Suppose we wanted to graph several functions on the interval [-2.0, 2.0). matplotlib lets us graph xy-values very quickly using: plt.plot(xvals, yvals)
given in the form [x1, x2, x3, ...], [y1, y2, y3, ...]
(i.e., pairwise sequences). Repeated plt.plot
commands will go to the same graph. Graph the following functions:
Graph a parametric equation over $t$ on $[0,2\pi]$ defined by:
You may want to issue a matplotlib statement: plot.axis('equal')
to ensure you don't get a skewed perspective on your result.
#FIXME a solution
ts = np.linspace(0, 2*np.pi, 1000)
plt.axis('equal')
plt.plot(np.sin(ts), np.cos(ts))
# FIXME: a solution
xvals = np.linspace(-2, 2, 100)
plt.plot(xvals, xvals+1)
plt.plot(xvals, np.e**xvals)
plt.plot(xvals, np.cos(xvals**2) + np.sin(xvals**2))
plt.plot(xvals, np.cos(xvals)**2 + np.sin(xvals)**2)
META What is the name for this in the numpy docs?
Boolean operations on an array proceed elementwise and result in bool values.
# this also holds for comparisons, but results in a boolean array
arr = 2 ** np.arange(5)
print "arr:"
dumpArray(arr)
print "after a boolean test:"
tested = arr > 4
dumpArray(tested)
We can also use the boolean array as an indexer back into the orignal array.
arr[tested]
Sometimes we want to remove elements from an array that don't meet a certain criteria. We can do that with np.where
. np.where
is defined in the NumPy module. It is not a method on NumPy arrays. We will encounter more of these functions shortly. One note: np.where
is designed to work on N-dimensional arrays. Because of that, np.where
insists on returning a tuple of arrays one array of indices per dimension of the original array. This makes perfect sense when the number of dimension is greater than one. However, it can be a gotcha when there is only one dimension. In either case, the indices can be used directly as an array
index.
META Note, this is a first instance of fancy indexing. But, let that slide. We'll get into it more later.
# pick the indices of the even elements of arr
arr = np.arange(10, 20)
indices = np.where(arr % 2 == 0)
print "the indices: ", indices
print "the elements: ", arr[indices]
In the multi-dimensional case from np.where
, the indices are lined up pairwise and used to select elements at that position.
arr = np.arange(20).reshape(5,4)
print "a 2D array:"
print arr
indices = np.where(arr % 2 == 0)
print "the indices: ", indices
print "the elements: ", arr[indices]
np.where
also has an alternate usage pattern: np.where(CONDITION, VALUE_WHEN_TRUE, VALUE_WHEN_FALSE)
. Here, two arrays provide the values when the condition is met or fails. If VALUE_WHEN_TRUE/FALSE
is not of the right shape, it will be broadcast (expanded) into a compatible shape, if possible.
# np.where can also be used to select elements out of arrays (directly)
arr = np.arange(10, 20)
print " original:", arr
print " map evens -> 1.0, odds -> -99.0:", np.where(np.logical_not(arr % 2), 1.0, -99.0)
# either or both VALUEs can be arrays
print "map odds -> 0.0, evens stay same:", np.where(np.logical_not(arr % 2), arr, 0.0)
FIXME: the graphic (one below) for this could use a little help. Adding color to denote the segments might be very useful. For example, the image here: scipy-lectures.github.io/_images/numpy_indexing.png :
Also, the LHS of the equation confuses the issue by asking "why are we assigning back into (part of) the original array". A good learning point: the selection we've done reduces the total length by two.
test = np.random.randint(10, size=(10,))
print "some sample data:"
dumpArray(test)
firstVals = test[ :-2] # how many values are in each of these?
secondVals = test[1:-1]
thirdVals = test[2: ]
movingWindowAverage = (firstVals + secondVals + thirdVals) / 3.0 # PITFALL: divide by 3
print "a 3-element moving average"
dumpArray(movingWindowAverage)
# we can write that more compactly as:
# (test[:-2] + test[1:-1] + test[-2:]) / 3.0
Recall that the derivative of a function $f$ at $x$ can be defined as $\lim_{\Delta h \rightarrow 0} \frac{f(x+\Delta h) - f(x)}{\Delta h}$. We can approximate this very quickly using NumPy. Compute a numerical derivative, $\frac{y_{i+1}-y_{i}}{x_{i+1}-x_{i}}$, for:
# FIXME a solution
xvals = np.linspace(-2, 2, 10)
f1 = xvals + 1
f1_p = (f1[1:] - f1[:-1]) / (xvals[1:] - xvals[:-1])
plt.plot(xvals, f1)
plt.plot(xvals[:-1], f1_p)
# FIXME a solution
xvals = np.linspace(-2, 2, 10)
f1 = np.e**xvals
f1_p = (f1[1:] - f1[:-1]) / (xvals[1:] - xvals[:-1])
plt.plot(xvals, f1)
plt.plot(xvals[:-1], f1_p)
# this has high error b/c of large spacing, exp() grows quickly to the right
# and we're using right-end point of the interval for evaluation
# try with 1000 points
Several useful operations are definied as methods on NumPy arrays. For a full list, see the NumPy docs:
http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html#array-methods
arr = np.random.randint(0,10, size=(10,))# arange(1,10)
print "arr: ", arr
print "%18s : %s" % ("mean", arr.mean())
print "%18s : %s" % ("variance", arr.var())
print "%18s : %s" % ("std. deviation", arr.std())
print "%18s : %s" % ("cumulative sum", arr.cumsum())
print "%18s : %s" % ("cumulative product", arr.cumprod())
# two other useful methods for defining predicates based on an array are .any() and .all()
arr = np.array([True, False, False])
print "arr:", arr
print "any true?:", arr.any()
print "all true?:", arr.all()
FIXME make a more comprehensive list in reference/here? notes about when to use some of the non-obvious tools? also, a later list has the ufuncs
Predicates
a.any(), a.all()
Reductions
a.mean(), a.argmin(), a.argmax(), a.trace(), a.cumsum(), a.cumprod()
Manipulation
a.argsort(), a.transpose(), a.reshape(...), a.ravel(), a.fill(...), a.clip(...)
Complex Numbers
a.real, a.imag, a.conj()
arr.all()
(and np.all()
) just disappeared. Write a function myAll
that replaces them.noneTrue
that returns True
when no element of an array is True
and False
otherwise.We've used the following to get arrays of different shapes:
arr = np.arange(24).reshape((3,4,2))
print arr.shape
Let's investigate shapes in a bit more detail. When modifying shapes, -1
can be used as a wildcard that will fill in the shape for one dimension, based on the others.
arr = np.arange(1,11)
dumpArray(arr)
# shape of 5 x ? -> 5 x 2
reshaped = arr.reshape(5, -1)
dumpArray(reshaped)
print arr.reshape(-1, 5)
print arr.reshape(2,5)
Some other shaping utilities include:
np.ravel
np.flatten
np.squeeze
np.flatten
, like its name implies, makes a 1-D version of the array. np.ravel
behaves similarly, but it will try to avoid making a copy
arr = np.arange(10).reshape(5,2)
print "arr:"
dumpArray(arr)
flat = arr.flatten()
print "flattened (my own data? %s):" % flat.flags.owndata
dumpArray(flat)
rav = arr.ravel()
print "raveled (my own data? %s)" % rav.flags.owndata
dumpArray(rav)
np.squeeze
will remove any dimensions that have length 1. Occassionally, you might get a length 1 dimension by selection
arr = np.arange(10).reshape(5,1,2,1)
print "arr:"
print arr
print "whittled down:"
print arr.squeeze()
Internally, NumPy arrays look like this:
Individual elements (scalars) have lightweight wrappers around them that treat them as single-element arrays.
# arrays have several pieces of meta-data, driven in part by the dtype of the array
def dumpArrayInfo(arr):
print "%15s: %s" % ("shape", arr.shape)
print "%15s: %s" % ("dtype", arr.dtype)
print "%15s: %s" % ("size", arr.size)
print "%15s: %s" % ("itemsize", arr.itemsize)
print "%15s: %s" % ("size * itemsize", arr.size * arr.itemsize)
arr = np.arange(10)
dumpArray(arr)
dumpArrayInfo(arr)
arr = np.arange(10.0).reshape(5,2)
dumpArrayInfo(arr)
And the dtype itself can be querried for information:
def dumpDtypeInfo(dt):
print "%15s: %s" % ("name", dt.name)
print "%15s: %s" % ("byteorder", dt.byteorder)
print "%15s: %s" % ("itemsize", dt.itemsize)
print "%15s: %s" % ("type", dt.type)
dumpDtypeInfo(arr.dtype)
Usually, we can get the right dtype through inference:
arr1 = np.array([1,2,3])
arr2 = np.array([1, 2, 3.14150])
print "arr1 type: ", arr1.dtype
print "arr2 type: ", arr2.dtype
We can also manually specify a dtype in many array creation routines
arr1 = np.array([1,2,3], dtype=np.float32)
dumpArray(arr1)
And we can convert dtypes with array.astype
arr = np.arange(10)
dumpArray(arr)
converted = arr.astype(np.float_)
print "after converting types:"
dumpArray(converted)
FIXME both of these have weaknesses right now
One other set of information about array
s comes from the flags
attribute
arr = np.arange(10)
print arr.flags
When a NumPy array is sliced, the slice is created with new dtype, shape, and stride information. However, the underlying data is referenced (not copied). We can determine if an array owns its own data via its array.flags.owndata attribute.
arr1 = np.arange(10)
arr2 = arr1[3:7]
print "arr1 owndata: ", arr1.flags.owndata
print "arr2 owndata: ",arr2.flags.owndata
# and what will happen if we write into arr2?
arr2[:] = 0
print "arr1 (after assigning into arr2):", arr1
NumPy ufuncs are functions that operate elementwise on one or more arrays.
When called, ufuncs dispatch to optimized C inner-loops based on the array dtype.
<, <=, ==, !=, >=, >
+, -, *, /, reciprocal, square
exp, expm1, exp2, log, log10, log1p, log2, power, sqrt
sin, cos, tan, acsin, arccos, atctan, sinh, cosh, tanh, acsinh, arccosh, atctanh
&, |, ~, ^, left_shift, right_shift
and, logical_xor, not, or
isfinite, isinf, isnan, signbit
abs, ceil, floor, mod, modf, round, sinc, sign, trunc
Methods that reduce the information in (or summarize) an array (such as sum) take an optional parameter called axis which specifies the dimension over which to perform a reduction.
arr = np.arange(15).reshape(3,5)
dumpArray(arr)
grandSum = arr.sum()
colSums = arr.sum(axis=0)
rowSums = arr.sum(axis=1)
print "grandSum:", grandSum
print "colSums (a new pseudo-row):", colSums, colSums.shape
print "rowSums (a new pseudo-col):", rowSums, rowSums.shape
One other array creator, np.tile
, rewards careful examination of some examples. We'll use the following arr
as our base array.
arr = np.arange(1,5).reshape(2,2)
dumpArray(arr)
Tileing over one axis appends "tiles" of data on the ends of the rows
print np.tile(arr,1)
print np.tile(arr,2)
print np.tile(arr,4)
Tileing over multipled dimensions creates tiles (of the original) in the specified shape
print "with a (2,2) inner"
print "... and a (2,1) outer"
print np.tile(arr, (2,1)) # 2x1 tileing of a 2x2 array
print "... and a (1,2) outer"
print np.tile(arr, (1,2)) # 1x2 tileing of a 2x2 array
print "... and a (2,2) outer"
print np.tile(arr, (2,2)) # 2x2 tileing of a 2x2 array
Things get slightly different in more dimensions, especially when the dimensions of the tile and the tileing differ. The tile is first expanded to the same number of dimensions as the tiling.
# original 2x2 promoted to 1x2x2 ... then used for 3x1x1 tileing
print np.tile(arr, (3,1,1))
print np.tile(arr, (1,1,3))
arr = np.tile(np.arange(1,10).reshape(3,3), (3,1,1))
print arr
As mentioned earlier, in greater than two dimensions, you need to be very careful thinking in terms of "rows and columns". Specifically, in the string representation of a 3-D array, the outer-most dimension is no longer the visual rows: it's the different panels. Thus, for sums over different axes, we might talk about:
sum(axis=?) | over which dim? | visual element | added elements |
---|---|---|---|
axis=0 | outer-most | across panels | [1+1+1, 2+2+2, ...] |
axis=1 | middle | across colums | [1+4+7, 2+5+8, ...] |
axis=2 | inner-most | across rows | [1+2+3, 4+5+6, ...] |
print "axis=0"
print arr.sum(axis=0)
print "axis=1"
print arr.sum(axis=1)
print "axis=2"
print arr.sum(axis=2)
print "shapes: ", arr.sum(axis=0).shape, arr.sum(axis=1).shape, arr.sum(axis=2).shape
# FIXME: i'd really like something with axis and reduction. Tilling as well.
# cumsum is referred to later, could be an exercise here
arr.cumsum()
Several functions for combining (concatenating) and splitting up arrays are free functions in the NumPy module. The main functions (warning, these take a single tuple as their argument) are:
np.concatenate
np.vstack
np.column_stack
np.hstack
Combining arrays together is not ideal (because we have to reallocate memory - a slow operation). An alternative is to use Python lists (which append quickly on the right) until all (or sufficient) data is gathered and then convert to NumPy array.
Note, if you find yourself constantly needing to restack vectors into 2-D arrays (when doing linear algebra), you may want to look at NumPy's matrix
class. Essentially, matrixs keep their 2-D shape throughout operations applied to them. They also define $*$ as matrix multiplication.
joined = np.concatenate((np.array([1,2,3]),
np.array([4,5,6]),
np.array([7,8,9])))
print joined
arr1 = np.arange(4).reshape(2,2)
arr2 = (np.arange(4).reshape(2,2) + 1) * 10
print np.concatenate((arr1, arr2), axis=0)
print np.concatenate((arr1, arr2), axis=1)
Note, for concatenation, all dimensions (except the dimension being concatenated) must have the same size.
arr1 = np.array([1, 2, 3])
arr2 = np.array([11, 22, 33])
# create a vertical stack
print "vstack:"
print np.vstack((arr1,
arr2))
# creates a column stack (horizontally)
print "column_stack:"
print np.column_stack((arr1, arr2))
# compare np.column_stack with np.hstack()
print "hstack (1D):"
print np.hstack((arr1, arr2)) # the 1-D arrays are not treated as columns
arr = np.vstack((arr1, arr2))
print "original array:"
print arr
print "hstacking:"
print np.hstack((arr, arr)) # concatenate with 1D, but ok if we have 2D things
print "vstacking:"
print np.vstack((arr, arr))
# FIXME a solution
# some of the information in "subtleties (optional)" (old exercise 5) is useful
# for understanding why the joiners have their "quirks"
def list_append(itr):
lst = []
for i in itr:
lst.append(i)
return lst
def array_append(itr):
arr = np.array([])
for i in itr:
arr = np.concatenate((arr, np.array([i])))
return arr
input_data = np.random.random(2000)
%timeit -n10 list_append(input_data)
%timeit -n10 array_append(input_data)
If you are working with functions of more than one variable, you may need to generate values over a 2-D (or higher) grid. NumPy has some utilities to help in this case. np.meshgrid
produces arrays of values that when paired elementwise (or in higher dimensions) gives the Cartesian product (all pairings) of the input values.
x = np.array([1,2,3])
y = np.array([10,20])
# xy-values like: (1,10), (1,20), (2,10), (2,2), (3,10), (3,20)
xx, yy = np.meshgrid(x, y) # note, x expanded across columns
print "xx:"
dumpArray(xx)
print "yy:"
dumpArray(yy)
# pair xx, yy elementwise to get all possible (x,y)-values from the two base grids
print "sum:"
zz = xx + yy
print zz
# or, zz = np.cos(xx) + np.sin(yy)
If we pass the xy-values around, we may wish to keep them together and de-tuple them at a later point.
dx, dy = 0.15, 0.05
y, x = np.mgrid[slice(-3, 3 + dy, dy), slice(-3, 3 + dx, dx)]
print x
print y
z = (1 - x / 2. + x ** 5 + y ** 3) * np.exp(-x ** 2 - y ** 2)
plt.pcolor(x, y, z)
plt.show()
Broadcasting (still to come in details) can achieve similar effects (the orientation of the result is slightly different here):
# but note, we can also do this:
x = np.array([1,2,3])
y = np.array([10,20])
# x -> row
x2d, y2d = np.atleast_2d(x,y)
print "shapes: ", x2d.shape, y2d.shape
print x2d
print y2d
x2d = x2d.T
print x2d + y2d # uses broadcasting (more in the next section!)
Often, we like to specify grids with slice notation, using :
. But we can't make a function call with a slice as an argument. For example, this won't work in Python: np.ogrid(1:4, 1:3)
. So, there is a special NumPy object called ogrid
that lets us directly specify grids with slices.
xx, yy = numpy.ogrid[1:4, 10:21:10] # upper point(s) not included
print "xx:"
dumpArray(xx)
print "yy:"
dumpArray(yy)
print "sum:"
zz = xx + yy
print zz
# mgrid is similar; values are transposed wrt meshgrid
# print np.mgrid[1:4, 8:10]
# print np.meshgrid(range(1,4), range(8,10))
np.searchsorted
¶FIXME not a huge fan of search sorted. seems low on priority list.
arr = np.concatenate([np.arange(1,6), np.arange(20,26), np.arange(50,55)])
#values = [1]*3 + [2]*3 + 3*[3]
#arr = np.array(values)
print arr
breakpts = np.searchsorted(arr, [5, 20, 30])
print "breakpoints: ", breakpts
print "arr indexed by pairs of breakpoints"
print arr[ :breakpts[0]]
print arr[breakpts[0]:breakpts[1]]
print arr[breakpts[1]:breakpts[2]]
print arr[breakpts[2]: ]
In short, broadcasting lets arrays with different but compatible shapes be arguments to ufuncs.
arr1 = np.arange(5)
print "arr1 + scalar:"
print arr1 + 10
print "arr1 + arr1 (same shape):"
print arr1 + arr1
arr2 = np.arange(5).reshape(5,1) * 10
arr3 = np.arange(5).reshape(1,5) * 100
print "arr2 shape:", arr2.shape
print "arr3 shape:", arr3.shape
print "arr1 + arr2 [ %s + %s --> %s ]:" % (arr1.shape, arr2.shape, (arr1 + arr2).shape)
print arr1 + arr2
print "arr1 + arr3 [ %s + %s --> %s ]:" % (arr1.shape, arr3.shape, (arr1 + arr3).shape)
print arr1 + arr3
arr1 = np.arange(6).reshape(3,2)
arr2 = np.arange(10, 40, 10).reshape(3,1)
print arr1 + arr2
FIXME: array names
Here, an array of shape (3, 1)
is broadcast to an array with shape (3, 2)
In order for an operation to broadcast, the size of all the trailing dimensions for both arrays must either be equal or be one. Dimensions that are one and dimensions that are missing from the "head" are duplicated to match the larger number. So, we have:
A (1d array): 3 B (2d array): 2 x 3 Result (2d array): 2 x 3
A (2d array): 6 x 1 B (3d array): 1 x 6 x 4 Result (3d array): 1 x 6 x 4
A (4d array): 3 x 1 x 6 x 1 B (3d array): 2 x 1 x 4 Result (4d array): 3 x 2 x 6 x 4
Some other interpretations of compatibility:
1
s.a1 = np.array([1,2,3]) # 3 --> 1x3
b1 = np.array([[10, 20, 30], # 2x3
[40, 50, 60]])
print a1 + b1
result = (np.ones(( 6,1)) + # 3rd dimension replicated
np.ones((1,6,4)))
print result.shape
result = (np.ones((3,6,1)) +
np.ones((1,6,4))) # 1st and 3rd dimension replicated
print result.shape
Sometimes, it is useful to explicitly insert a new dimension in the shape. We can do this with a fancy slice that takes the value np.newaxis
.
arr1 = np.arange(6).reshape((2,3)) # 2x3
arr2 = np.array([10, 100]) # 2
try:
arr1 + arr2
except ValueError:
print "fail"
# let's massage the shape
arr3 = arr2[:, np.newaxis] # arr2 -> 2x1
print "arr3 shape:", arr3.shape
print "arr1 + arr3"
print arr1 + arr3
arr = np.array([10, 100])
print "original shape:", arr.shape
arrNew = arr2[np.newaxis, :]
print "arrNew shape:", arrNew.shape
There are a few forms of fancy indexing in NumPy. The first is indexing an array by other arrays (we've seen one example of this). The result has the same shape as the indexing arrays.
arr = np.arange(15).reshape((3,5))
print arr
# select 0,0 0,2 1,3 2,4 as a 2x2 array
rows = np.array([[0, 0],
[1, 2]])
cols = np.array([[0, 2],
[3, 4]])
# values are lined up pairwise (from inputs) to positions of output
print arr[rows, cols] # ---> in shape of row,col index matrices
Another form of fancy indexing comes when we use an array of np.bool_. Here, the indexing array must have an element (True/False) for every position in the base array.
evens = (arr % 2 == 0) # note, arr % 2 is not boolean.
# quick question: how can we check that?
print arr[evens] # boolean array: we have to say yes or no for each element
# compare with indices: pick elements
# FIXME: this is more confusing than helpful
#print "arr % 2"
#dumpArray(arr % 2)
# is it boolean? no. so, it will be selecting elements by (triples) of indices [0,1,0]
#print "not what we might expect"
#dumpArray(arr[arr%2]) # only indexing outer-most dimension (each 5 element long)
One other gotcha: Python's bool
type is not the same as np.bool
. In fact, Python's bool
s are Python int
s (you can check it below). Thus, NumPy will use them as the values 0
and 1
-- which means numerical indexes.
arr = np.arange(35).reshape(5,7)
print "arr:"
print arr
b = arr>20
print "a boolean selection:"
print arr[b]
print "some of the boolean indices:"
print b[:,5]
# compare, Python's True/False (python bool) are -ints-.
# prove it: print isinstance(True, int)
# NumPy uses them as 0/1 integers
# ... unless they are in a np.array of type np.bool
print "raw Python bools as indices"
# select row[0], row[0], row[0], row[1], row[1]
pyBools = [False, False, False, True, True]
print arr[pyBools]
print "np.bools"
npBools = np.array(pyBools, dtype=np.bool)
# broadcast out across columns, so apply selection to each column
print arr[npBools]
We saw these two questions earlier. Can you answer them now? If you answered them before (nice work!), can you explain them now?
2x4
), replace every row with a particular row (for example, 2, 4, 8, 16)?2x4
), replace every column with a particular column (for example, 1,3)?a[[1,2],[1,2]]
, you get a 2x2 cube made up of [[a[1,1], a[1,2]], [a[2,1], a[2,2]]]
. If you do that in python, you get: [a[1,1], a[2,2]]
. Write a Python function, matdex
, which takes a list of indexes, and returns a list of indexes, that when used to index a NumPy, will return the matlab style selection. Note, this is either ridiculous or trivial, depending on how you solve it.# META - FIXME: this is an answer
def matdex1(*indices):
return np.meshgrid(*indices)
def matdex2(*indices):
return np.meshgrid(*indices, indexing="ij")
arr = np.arange(15).reshape(3,5)
print "original array:"
print arr
print "desired output:"
print [[arr[1,1], arr[1,2]], [arr[2,1], arr[2,2]]]
print "method 1:"
print arr[matdex1([1,2], [1,2])].T # you're clever
print "method 2:"
print arr[matdex2([1,2], [1,2])] # you're super!
np.fft
— Fast Fourier transforms
np.polynomial
— Orthogonal polynomials, spline fitting
np.linalg
— Linear algebra
cholesky, det, eig, eigvals, inv, lstsq, norm, qr, svd
np.math
— C standard library math functions
np.random
— Random number generation
beta, gamma, geometric, hypergeometric, lognormal, normal, poisson, uniform, weibull
PI = np.pi
t = np.linspace(0, 120, 4000)
nrr = np.random.random
signal = 12 * np.sin(3 * 2*PI*t) # 3 Hz
signal += 6 * np.sin(8 * 2*PI*t) # 8 Hz
signal += 1.5 * nrr(len(t)) # noise
FFT = abs(np.fft.fft(signal))
freqs = np.fft.fftfreq(signal.size, t[1] - t[0])
plt.plot(t,signal); plt.xlim(0, 4); plt.show()
plt.plot(freqs, FFT);
np.random
¶np.random.normal(loc = 0, scale = 2, size = (2,10)) # loc = mean, scale = standard deviation
np.random.permutation(np.arange(10))
import numpy.linalg as la
# Ax = b
A = np.array([[2,1],
[1,2]])
b = np.array([.5, .75])
soln1 = la.solve(A,b)
Ainv = la.inv(A)
soln2 = np.dot(Ainv, b) # generally don't want to take explicit inverse
np.allclose(soln1, soln2)
# note also, matrices --> (a 2d numpy array)
b = np.array([.5, .75])
A = np.matrix('2 1; 1 2') # matlab style
b = np.matrix(b).T # convert a np.array
# matrix has .I --> this is the matrix inverse (pseudo-inverse)
# -and-
# * is matrix multiplication
A.I * b
np.record
¶NumPy arrays have elements with a single type. But, that type can be a compound type (i.e., a record or a struct).
Two main recommended ways of specifying type codes:
# a record with a 4 byte int, a 4 byte float, and 10 bytes of characters (ascii values)
x = np.zeros((2,),dtype=('i4,f4,a10'))
print x
print repr(x)
x[:] = [(1, 5., 'Hello'), (2, 6., 'World')]
print x
print repr(x)
print "a field:"
print x['f1']
print repr(x['f1'])
%%file patient-records.csv
name,date,weight(kg),height(cm)
mark,2011-01-01,86.1,180
barb,2012-02-03,65.7,167
ethan,2013-04-06,29.45,127
patient_dtype = [("name", "a10"),
("visit_date", 'datetime64[D]'), # [D] means sepcify to "Day" units
("weight", np.float),
("height", np.int)]
data = np.loadtxt("patient-records.csv", skiprows=1, delimiter=",", dtype=patient_dtype)
print "first row: ", data[0]
print "all weights: ", data['weight']
# BMI = kg / m**2
print "BMIs:", data['weight'] / (data['height']/100.0)**2
We can also save and load arrays
#saving / load data
np.savez('data.npz',data=data) # list of arrays to store
dataz = np.load('data.npz')
print dataz.files # list of arrays stored in this archive
print dataz['data']
# cleanup
# !rm data.npz
Using the code snippet below to grab some stock market data from Yahoo, read the data into an array of appropriately structured records. The data grabbed is for all of 2013. Find the highest high and the lowest low for your ticker symbol. For extra credit, graph the daily closes.
from urllib import urlretrieve
sym = "GE"
quote_url = "http://ichart.finance.yahoo.com/table.csv?s=%s&d=11&e=31&f=2013&g=d&a=0&b=2&c=2013&ignore=.csv"
urlretrieve(quote_url % sym, sym+".csv")
# note, on unix-heritage boxes, you can use head and tail to glance at the data
# !head -10 GE.csv
# !tail -10 GE.csv
import sys;
sy