There are many kinds of bugs that might come up in a computer program. This page is structured as an FAQ. It should provide recipes to tackle common problems, and introduce some of the tools that we use to find problems in our Theano code, and even (it happens) in Theano’s internals, such as Using DebugMode.
You can run your Theano function in a DebugMode(Using DebugMode). This test the Theano optimizations and help to find where NaN, inf and other problem come from.
As of v.0.4.0, Theano has a new mechanism by which graphs are executed on-the-fly, before a theano.function is ever compiled. Since optimizations haven’t been applied at this stage, it is easier for the user to locate the source of some bug. This functionality is enabled through the config flag theano.config.compute_test_value. Its use is best shown through the following example.
# compute_test_value is 'off' by default, meaning this feature is inactive
theano.config.compute_test_value = 'off'
# configure shared variables
W1val = numpy.random.rand(2,10,10).astype(theano.config.floatX)
W1 = theano.shared(W1val, 'W1')
W2val = numpy.random.rand(15,20).astype(theano.config.floatX)
W2 = theano.shared(W2val, 'W2')
# input which will be of shape (5,10)
x = T.matrix('x')
# transform the shared variable in some way. Theano does not
# know off hand that the matrix func_of_W1 has shape (20,10)
func_of_W1 = W1.dimshuffle(2,0,1).flatten(2).T
# source of error: dot product of 5x10 with 20x10
h1 = T.dot(x,func_of_W1)
# do more stuff
h2 = T.dot(h1,W2.T)
# compile and call the actual function
f = theano.function([x], h2)
f(numpy.random.rand(5,10))
Running the above code generates the following error message:
Definition in:
File "/u/desjagui/workspace/PYTHON/theano/gof/opt.py", line 1102, in apply
lopt_change = self.process_node(env, node, lopt)
File "/u/desjagui/workspace/PYTHON/theano/gof/opt.py", line 882, in process_node
replacements = lopt.transform(node)
File "/u/desjagui/workspace/PYTHON/Theano/theano/tensor/blas.py", line 1030, in local_dot_to_dot22
return [_dot22(*node.inputs)]
File "/u/desjagui/workspace/PYTHON/Theano/theano/gof/op.py", line 324, in __call__
self.add_tag_trace(node)
For the full definition stack trace set the Theano flags traceback.limit to -1
Traceback (most recent call last):
File "test.py", line 29, in <module>
f(numpy.random.rand(5,10))
File "/u/desjagui/workspace/PYTHON/theano/compile/function_module.py", line 596, in __call__
self.fn()
File "/u/desjagui/workspace/PYTHON/theano/gof/link.py", line 288, in streamline_default_f
raise_with_op(node)
File "/u/desjagui/workspace/PYTHON/theano/gof/link.py", line 284, in streamline_default_f
thunk()
File "/u/desjagui/workspace/PYTHON/Theano/theano/gof/cc.py", line 1111, in execute
raise exc_type, exc_value, exc_trace
ValueError: ('Shape mismatch: x has 10 cols but y has 20 rows',
_dot22(x, <TensorType(float64, matrix)>), [_dot22.0],
_dot22(x, InplaceDimShuffle{1,0}.0), 'Sequence id of Apply node=4')
Needless to say the above is not very informative and does not provide much in the way of guidance. However, by instrumenting the code ever so slightly, we can get Theano to give us the exact source of the error.
# enable on-the-fly graph computations
theano.config.compute_test_value = 'warn'
...
# input which will be of shape (5,10)
x = T.matrix('x')
# provide Theano with a default test-value
x.tag.test_value = numpy.random.rand(5,10)
In the above, we’re tagging the symbolic matrix x with a special test value. This allows Theano to evaluate symbolic expressions on-the-fly (by calling the perform method of each Op), as they are being defined. Sources of error can thus be identified with much more precision and much earlier in the compilation pipeline. For example, running the above code yields the following error message, which properly identifies line 23 as the culprit.
Traceback (most recent call last):
File "test2.py", line 23, in <module>
h1 = T.dot(x,func_of_W1)
File "/u/desjagui/workspace/PYTHON/Theano/theano/gof/op.py", line 360, in __call__
node.op.perform(node, input_vals, output_storage)
File "/u/desjagui/workspace/PYTHON/Theano/theano/tensor/basic.py", line 4458, in perform
z[0] = numpy.asarray(numpy.dot(x, y))
ValueError: ('matrices are not aligned', (5, 10), (20, 10))
The compute_test_value mechanism works as follows:
compute_test_value can take the following values:
Note
This feature is currently not compatible with Scan and also with Ops which do not implement a perform method.
Theano provides a ‘Print’ Op to do this.
x = theano.tensor.dvector('x')
x_printed = theano.printing.Print('this is a very important value')(x)
f = theano.function([x], x * 5)
f_with_print = theano.function([x], x_printed * 5)
#this runs the graph without any printing
assert numpy.all( f([1,2,3]) == [5, 10, 15])
#this runs the graph with the message, and value printed
assert numpy.all( f_with_print([1,2,3]) == [5, 10, 15])
Since Theano runs your program in a topological order, you won’t have precise control over the order in which multiple Print() Ops are evaluted. For a more precise inspection of what’s being computed where, when, and how, see the How do I step through a compiled function with the WrapLinker?.
Theano provides two functions (theano.pp() and theano.printing.debugprint()) to print a graph to the terminal before or after compilation. These two functions print expression graphs in different ways: pp() is more compact and math-like, debugprint() is more verbose. Theano also provides pydotprint() that creates a png image of the function.
You can read about them in printing – Graph Printing and Symbolic Print Statement.
First, make sure you’re running in FAST_RUN mode. FAST_RUN is the default mode, but make sure by passing mode='FAST_RUN' to theano.function (or theano.make) or by setting config.mode to FAST_RUN.
Second, try the theano ProfileMode. This will tell you which Apply nodes, and which Ops are eating up your CPU cycles.
Tips:
This is not exactly an FAQ, but the doc is here for now... It’s pretty easy to roll-your-own evaluation mode. Check out this one:
class PrintEverythingMode(Mode):
def __init__(self):
def print_eval(i, node, fn):
print i, node, [input[0] for input in fn.inputs],
fn()
print [output[0] for output in fn.outputs]
wrap_linker = theano.gof.WrapLinkerMany([theano.gof.OpWiseCLinker()], [print_eval])
super(PrintEverythingMode, self).__init__(wrap_linker, optimizer='fast_run')
When you use mode=PrintEverythingMode() as the mode for Function or Method, then you should see [potentially a lot of] output. Every Apply node will be printed out, along with its position in the graph, the arguments to the perform or c_code and the output it computed.
>>> x = T.dscalar('x')
>>> f = function([x], [5*x], mode=PrintEverythingMode())
>>> f(3)
>>> # print: 0 Elemwise{mul,no_inplace}(5, x) [array(5, dtype=int8), array(3.0)] [array(15.0)]
>>> # print: [array(15.0)]
Admittedly, this may be a huge amount of output to read through if you are using big tensors... but you can choose to put logic inside of the print_eval function that would, for example, only print something out if a certain kind of Op was used, at a certain program position, or if a particular value shows up in one of the inputs or outputs. Use your imagination :)
This can be a really powerful debugging tool. Note the call to fn inside the call to print_eval; without it, the graph wouldn’t get computed at all!
In the majority of cases, you won’t be executing from the interactive shell but from a set of Python scripts. In such cases, the use of the Python debugger can come in handy, especially as your models become more complex. Intermediate results don’t necessarily have a clear name and you can get exceptions which are hard to decipher, due to the “compiled” nature of functions.
Consider this example script (“ex.py”):
import theano
import numpy
import theano.tensor as T
a = T.dmatrix('a')
b = T.dmatrix('b')
f = theano.function([a,b], [a*b])
# matrices chosen so dimensions are unsuitable for multiplication
mat1 = numpy.arange(12).reshape((3,4))
mat2 = numpy.arange(25).reshape((5,5))
f(mat1, mat2)
This is actually so simple the debugging could be done easily, but it’s for illustrative purposes. As the matrices can’t be element-wise multiplied (unsuitable shapes), we get the following exception:
File "ex.py", line 14, in <module>
f(mat1, mat2)
File "/u/username/Theano/theano/compile/function_module.py", line 451, in __call__
File "/u/username/Theano/theano/gof/link.py", line 271, in streamline_default_f
File "/u/username/Theano/theano/gof/link.py", line 267, in streamline_default_f
File "/u/username/Theano/theano/gof/cc.py", line 1049, in execute ValueError: ('Input dimension mis-match. (input[0].shape[0] = 3, input[1].shape[0] = 5)', Elemwise{mul,no_inplace}(a, b), Elemwise{mul,no_inplace}(a, b))
The call stack contains a few useful informations to trace back the source of the error. There’s the script where the compiled function was called – but if you’re using (improperly parameterized) prebuilt modules, the error might originate from ops in these modules, not this script. The last line tells us about the Op that caused the exception. In thise case it’s a “mul” involving Variables name “a” and “b”. But suppose we instead had an intermediate result to which we hadn’t given a name.
After learning a few things about the graph structure in Theano, we can use the Python debugger to explore the graph, and then we can get runtime information about the error. Matrix dimensions, especially, are useful to pinpoint the source of the error. In the printout, there are also 2 of the 4 dimensions of the matrices involved, but for the sake of example say we’d need the other dimensions to pinpoint the error. First, we re-launch with the debugger module and run the program with “c”:
python -m pdb ex.py
> /u/username/experiments/doctmp1/ex.py(1)<module>()
-> import theano
(Pdb) c
Then we get back the above error printout, but the interpreter breaks in that state. Useful commands here are
Here, for example, I do “up”, and a simple “l” shows me there’s a local variable “node”. This is the “node” from the computation graph, so by following the “node.inputs”, “node.owner” and “node.outputs” links I can explore around the graph.
That graph is purely symbolic (no data, just symbols to manipulate it abstractly). To get information about the actual parameters, you explore the “thunks” objects, which bind the storage for the inputs (and outputs) with the function itself (a “thunk” is a concept related to closures). Here, to get the current node’s first input’s shape, you’d therefore do “p thunk.inputs[0][0].shape”, which prints out “(3, 4)”.