For anyone interested in really flexible differentiable graphs, Chainer is the most flexible convenient library I've used. It's all I use for prototyping neural nets anymore, and I'm surprised not to see more adoption. It feels like working in numpy.
In large part, in the CPU section, it _is_ working in numpy. I think of the neural network libraries, Chainer is made by the people who like actually coding the most.
I mean, for example, lots of the TensorFlow type checking gets done in Eigen, where it's done by C++ template metaprogramming (that's how it worked when I looked at it, anyhow): Chainer type stuff just gets done by runtime inspection.
Which one is faster? TF, by far. Which one would you rather have in _your_ codebase?
Edit: after reading the damned thing, they add in more runtime type stuff. And after looking over TF back again, they still have this hybrid thing going on where it's some Eigen stuff and some runtime stuff. I mean....
Also there is http://pytorch.org - that what started as a fork from Chainer. On a high level it is build for the same purpose - to support dynamic graphs.
The paper is dense and I'm on a train. Can anyone summarize the difference between TensorFlow Fold and Chainer?
Also, self promotion: Gorgonia (https://github.com/chewxy/gorgonia) has support for dynamic computation graphs ala Chainer since day 1... however, batched computation remains difficult to implement.
TensorFlow Fold provides a TensorFlow implementation of the dynamic batching algorithm (described in detail in our paper [1]). Dynamic batching is an execution strategy for computation graphs, you could also implement it in PyTorch or Chainer or any other framework.
Our particular implementation of dynamic batching uses the TF while loop, which means that you don't need to make run-time modification to the actual TF computation graph. At runtime, we essentially encode the computation graph for (let's say) a parse tree as a serialized protocol buffer (tf.string), so instead of varying the computation graph itself we vary the input to a static computation graph instead. This particular implementation strategy is very much a byproduct of how TensorFlow works (static computation graph, heavy lifting happens in ops implemented in C++).
Congratulations on the nice work! It is very elegant to use combinators to formulate and solve this problem. Though my worry with combinators is that they introduce awkwardness with setting up 'long range' DAG edges, you can have to do a lot of shuttling of things around manually through tuples and whatnot, I'm not sure how it is with your framework.
Am I right in thinking that there is no bucketing going on here? In other words, each batch is fixed length, and a short-lived DAG is planned and then simulated with tf.while to accommodate the set of shapes in that particular batch? Are there any problems when the input shapes are wildly different in expense? For example, imagine a size-agnostic convnet. Maybe some of the images in the training set are small, others are large, how would that look in your framework, if it can be done? Is junk padding part of picture, to help match almost-equal tensors to allow them to be batched?
You're absolutely right about combinators and long-range dependencies. We have a special block type in the high-level API (https://github.com/tensorflow/fold/blob/master/tensorflow_fo...) for accumulating results here without having to explicitly shuttle them around; not perfect but very handy in many cases.
Regarding your second question, the equivalent of padding in dynamic batching is "pass through" do-nothing ops, which are introduced transparently but worth being aware of if you want to understand the machinery. The worst case scenario here is chains of wildly varying lengths, where we need to add pass-throughs to the shorter chains to match the length of the longest chain.
From my understanding you cannot do the similar dynamic batching with PyTorch since it gets eagerly executed. Not sure about Chainer. Can you explain a bit more about how this can be applied to other frameworks?
I've never used PyTorch or Chainer, so maybe I'm wildly wrong, but here goes..
Dynamic batching requires you to predefine a set of "batch-wise" operations (node types for computation graphs). This works fine for eager execution as well, because the eager code is inside a function definition. Of course if you want to do dynamic batching you need to introduce a layer of abstraction, e.g. instead of saying a+b you create a dynamic batching operation for addition and add edges to an execution graph, this is not different than in our TF implementation.
Thanks. Yeah, my understanding is that you cannot have eager execution all the way down, some eager executions inside the function is fine, but you shouldn't have these outside the function blocks.
Without this function block abstraction, it doesn't seem like you can implement dynamic batching very far though (or even at all).
The concept seems interesting. I have stopped the close investigation of the stack I use at the "Keras level" and mostly use things below that as a black box. I'm defaulting to Theano since I only have one GPU to work with but as far as I can tell switching to TensorFlow is basically a small config-change. I've only browsed this but since I mostly do NLP (and virtually no image recognition) I suppose it could be worthwhile to switch. I guess I'll need to open the black boxes a bit and see what Theano does :)
Please note that the GitHub page says "not an official Google product", rather than "project". An official Google product would be something like gmail.
The way I see it, TF is about to pull _way_ ahead thanks to XLA JIT/AOT. All of a sudden you get the ability to fuse things at a much more granular level, which could reduce memory bandwidth requirements by a lot. Frameworks like Torch can't do any fusing at all, since their computation is fully imperative. Tactical win for imperative frameworks, I suppose, but strategically functional graph is the way to go. DB people realized this in the 70s, ML people are realizing this now.
TF is way behind on UI, which is why it's making Keras its front-end. It's fairly slow on multi-GPUs compared to Torch and neon. It might pull ahead in performance on GCE, but that's just for lockin.
TF is in a fortunate position of having several UIs at this point. It's a lower level framework with a lot of power. If you don't need all that power, Keras or TFLearn or Slim are pretty great. If you do, it's there for you. I see no evidence that Google's goal with TF is to lock you into anything, and especially GCE. I'm a former Google employee, and I can tell you unequivocally — that's not how Google actually works.