Table Of Contents

Previous topic

Loop

Next topic

PyCUDA/CUDAMat/Gnumpy compatibility

This Page

Using the GPU

One of the Theano’s design goals is to specify computations at an abstract level, so that the internal function compiler has a lot of flexibility about how to carry out those computations. One of the ways we take advantage of this flexibility is in carrying out calculations on an Nvidia graphics card when there is a CUDA-enabled device in your computer.

Setting up CUDA

If you have not done so already, you will need to install Nvidia’s GPU-programming toolchain (CUDA) and configure Theano to use it. We provide installation instructions for Linux, MacOS and Windows.

Testing Theano with GPU

To see if your GPU is being used, cut and paste the following program into a file and run it.

from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
t0 = time.time()
for i in xrange(iters):
    r = f()
print 'Looping %d times took'%iters, time.time() - t0, 'seconds'
print 'Result is', r
print 'Used the','cpu' if numpy.any( [isinstance(x.op,T.Elemwise) for x in f.maker.env.toposort()]) else 'gpu'

The program just computes the exp() of a bunch of random numbers. Note that we use the shared function to make sure that the input x are stored on the graphics device.

If I run this program (in thing.py) with device=cpu, my computer takes a little over 7 seconds, whereas on the GPU it takes just over 0.4 seconds. Note that the results are close but not identical! The GPU will not always produce the exact same floating-point numbers as the CPU. As a point of reference, a loop that calls numpy.exp(x.value) also takes about 7 seconds.

$ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python thing.py
Looping 1000 times took 7.17374897003 seconds
Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753 1.62323285]

$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python thing.py
Using gpu device 0: GeForce GTX 285
Looping 1000 times took 0.418929815292 seconds
Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]

Note that for now GPU operations in Theano require floatX to be float32 (see below also).

Returning a handle to device-allocated data

The speedup is not greater in the example above because the function is returning its result as a numpy ndarray which has already been copied from the device to the host for your convenience. This is what makes it so easy to swap in device=gpu, but if you don’t mind being less portable, you might prefer to see a bigger speedup by changing the graph to express a computation with a GPU-stored result. The gpu_from_host Op means “copy the input from the host to the gpu” and it is optimized away after the T.exp(x) is replaced by a GPU version of exp().

from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)))
t0 = time.time()
for i in xrange(iters):
    r = f()
print 'Looping %d times took'%iters, time.time() - t0, 'seconds'
print 'Result is', r
print 'Numpy result is', numpy.asarray(r)
print 'Used the','cpu' if numpy.any( [isinstance(x.op,T.Elemwise) for x in f.maker.env.toposort()]) else 'gpu'

The output from this program is

Using gpu device 0: GeForce GTX 285
Looping 1000 times took 0.185714006424 seconds
Result is <CudaNdarray object at 0x3e9e970>
Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]

Here we’ve shaved off about 50% of the run-time by simply not copying the resulting array back to the host. The object returned by each function call is now not a numpy array but a “CudaNdarray” which can be converted to a numpy ndarray by the normal numpy casting mechanism.

Running the GPU at Full Speed

To really get maximum performance in this simple example, we need to use an Out instance to tell Theano not to copy the output it returns to us. Theano allocates memory for internal use like a working buffer, but by default it will never return a result that is allocated in the working buffer. This is normally what you want, but our example is so simple that it has the un-wanted side-effect of really slowing things down.

from theano import function, config, shared, sandbox, Out
import theano.tensor as T
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([],
        Out(sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)),
            borrow=True))
t0 = time.time()
for i in xrange(iters):
    r = f()
print 'Looping %d times took'%iters, time.time() - t0, 'seconds'
print 'Result is', r
print 'Numpy result is', numpy.asarray(r)
print 'Used the','cpu' if numpy.any( [isinstance(x.op,T.Elemwise) for x in f.maker.env.toposort()]) else 'gpu'

Running this version of the code takes just under 0.05 seconds, over 140x faster than the CPU implementation!

Using gpu device 0: GeForce GTX 285
Looping 1000 times took 0.0497219562531 seconds
Result is <CudaNdarray object at 0x31eeaf0>
Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]

This version of the code using borrow=True is slightly less safe because if we had saved the r returned from one function call, we would have to take care and remember that its value might be over-written by a subsequent function call. Although borrow=True makes a dramatic difference in this example, be careful! The advantage of borrow=True is much weaker in larger graphs, and there is a lot of potential for making a mistake by failing to account for the resulting memory aliasing.

What can be accelerated on the GPU?

The performance characteristics will change as we continue to optimize our implementations, and vary from device to device, but to give a rough idea of what to expect right now:

  • Only computations with float32 data-type can be accelerated. Better support for float64 is expected in upcoming hardware but float64 computations are still relatively slow (Jan 2010).
  • Matrix multiplication, convolution, and large element-wise operations can be accelerated a lot (5-50x) when arguments are large enough to keep 30 processors busy.
  • Indexing, dimension-shuffling and constant-time reshaping will be equally fast on GPU as on CPU.
  • Summation over rows/columns of tensors can be a little slower on the GPU than on the CPU
  • Copying of large quantities of data to and from a device is relatively slow, and often cancels most of the advantage of one or two accelerated functions on that data. Getting GPU performance largely hinges on making data transfer to the device pay off.

Tips for improving performance on GPU

  • Consider adding floatX = float32 to your .theanorc file if you plan to do a lot of GPU work.
  • Prefer constructors like ‘matrix’ ‘vector’ and ‘scalar’ to ‘dmatrix’, ‘dvector’ and ‘dscalar’ because the former will give you float32 variables when floatX=float32.
  • Ensure that your output variables have a float32 dtype and not float64. The more float32 variables are in your graph, the more work the GPU can do for you.
  • Minimize tranfers to the GPU device by using shared ‘float32’ variables to store frequently-accessed data (see shared()). When using the GPU, ‘float32’ tensor shared variables are stored on the GPU by default to eliminate transfer time for GPU ops using those variables.
  • If you aren’t happy with the performance you see, try building your functions with mode=’PROFILE_MODE’. This should print some timing information at program termination (atexit). Is time being used sensibly? If an Op or Apply is taking more time than its share, then if you know something about GPU programming have a look at how it’s implemented in theano.sandbox.cuda. Check the line like ‘Spent Xs(X%) in cpu Op, Xs(X%) in gpu Op and Xs(X%) transfert Op’ that can tell you if not enought of your graph is on the gpu or if their is too much memory transfert.

Changing the value of shared variables

To change the value of a shared variable, e.g. to provide new data to process, use shared_variable.set_value(new_value). For a lot more detail about this, see Understanding Memory Aliasing for Speed and Correctness.