TensorFlow Fold: Deep Learning with Dynamic Computation Graphs

mad44 · on Feb 7, 2017

Here is a summary of the TensorFlow Fold paper. http://muratbuffalo.blogspot.com/2017/01/deep-learning-with-...

imh · on Feb 7, 2017

For anyone interested in really flexible differentiable graphs, Chainer is the most flexible convenient library I've used. It's all I use for prototyping neural nets anymore, and I'm surprised not to see more adoption. It feels like working in numpy.

curuinor · on Feb 7, 2017

In large part, in the CPU section, it _is_ working in numpy. I think of the neural network libraries, Chainer is made by the people who like actually coding the most.

I mean, for example, lots of the TensorFlow type checking gets done in Eigen, where it's done by C++ template metaprogramming (that's how it worked when I looked at it, anyhow): Chainer type stuff just gets done by runtime inspection.

Which one is faster? TF, by far. Which one would you rather have in _your_ codebase?

Edit: after reading the damned thing, they add in more runtime type stuff. And after looking over TF back again, they still have this hybrid thing going on where it's some Eigen stuff and some runtime stuff. I mean....

vvladymyrov · on Feb 7, 2017

Also there is http://pytorch.org - that what started as a fork from Chainer. On a high level it is build for the same purpose - to support dynamic graphs.

chewxy · on Feb 7, 2017

The paper is dense and I'm on a train. Can anyone summarize the difference between TensorFlow Fold and Chainer?

Also, self promotion: Gorgonia (https://github.com/chewxy/gorgonia) has support for dynamic computation graphs ala Chainer since day 1... however, batched computation remains difficult to implement.

moshe · on Feb 7, 2017

TensorFlow Fold provides a TensorFlow implementation of the dynamic batching algorithm (described in detail in our paper [1]). Dynamic batching is an execution strategy for computation graphs, you could also implement it in PyTorch or Chainer or any other framework.

Our particular implementation of dynamic batching uses the TF while loop, which means that you don't need to make run-time modification to the actual TF computation graph. At runtime, we essentially encode the computation graph for (let's say) a parse tree as a serialized protocol buffer (tf.string), so instead of varying the computation graph itself we vary the input to a static computation graph instead. This particular implementation strategy is very much a byproduct of how TensorFlow works (static computation graph, heavy lifting happens in ops implemented in C++).

[1] DEEP LEARNING WITH DYNAMIC COMPUTATION GRAPHS, https://openreview.net/pdf?id=ryrGawqex

taliesinb · on Feb 8, 2017

Congratulations on the nice work! It is very elegant to use combinators to formulate and solve this problem. Though my worry with combinators is that they introduce awkwardness with setting up 'long range' DAG edges, you can have to do a lot of shuttling of things around manually through tuples and whatnot, I'm not sure how it is with your framework.

Am I right in thinking that there is no bucketing going on here? In other words, each batch is fixed length, and a short-lived DAG is planned and then simulated with tf.while to accommodate the set of shapes in that particular batch? Are there any problems when the input shapes are wildly different in expense? For example, imagine a size-agnostic convnet. Maybe some of the images in the training set are small, others are large, how would that look in your framework, if it can be done? Is junk padding part of picture, to help match almost-equal tensors to allow them to be batched?

moshe · on Feb 8, 2017

Thanks, insightful questions.

You're absolutely right about combinators and long-range dependencies. We have a special block type in the high-level API (https://github.com/tensorflow/fold/blob/master/tensorflow_fo...) for accumulating results here without having to explicitly shuttle them around; not perfect but very handy in many cases.

Regarding your second question, the equivalent of padding in dynamic batching is "pass through" do-nothing ops, which are introduced transparently but worth being aware of if you want to understand the machinery. The worst case scenario here is chains of wildly varying lengths, where we need to add pass-throughs to the shorter chains to match the length of the longest chain.

liuliu · on Feb 8, 2017

From my understanding you cannot do the similar dynamic batching with PyTorch since it gets eagerly executed. Not sure about Chainer. Can you explain a bit more about how this can be applied to other frameworks?

moshe · on Feb 8, 2017

I've never used PyTorch or Chainer, so maybe I'm wildly wrong, but here goes..

Dynamic batching requires you to predefine a set of "batch-wise" operations (node types for computation graphs). This works fine for eager execution as well, because the eager code is inside a function definition. Of course if you want to do dynamic batching you need to introduce a layer of abstraction, e.g. instead of saying a+b you create a dynamic batching operation for addition and add edges to an execution graph, this is not different than in our TF implementation.

liuliu · on Feb 9, 2017

Thanks. Yeah, my understanding is that you cannot have eager execution all the way down, some eager executions inside the function is fine, but you shouldn't have these outside the function blocks.

Without this function block abstraction, it doesn't seem like you can implement dynamic batching very far though (or even at all).

iraphael · on Feb 7, 2017

Github link: https://github.com/tensorflow/fold

Paper link: https://openreview.net/pdf?id=ryrGawqex

kyloon · on Feb 7, 2017

This is great news, was just wondering when TensorFlow would support this after reading about PyTorch.

zump · on Feb 7, 2017

They got scooped and pushed to publish the interns project.

superfx · on Feb 7, 2017

I believe the TensorFlow Fold paper came out before PyTorch.

Eliezer · on Feb 8, 2017

And neither Moshe nor Marcello are interns.

kriro · on Feb 8, 2017

The concept seems interesting. I have stopped the close investigation of the stack I use at the "Keras level" and mostly use things below that as a black box. I'm defaulting to Theano since I only have one GPU to work with but as far as I can tell switching to TensorFlow is basically a small config-change. I've only browsed this but since I mostly do NLP (and virtually no image recognition) I suppose it could be worthwhile to switch. I guess I'll need to open the black boxes a bit and see what Theano does :)

superfx · on Feb 7, 2017

Why does the GitHub page say this is not an official google project, yet it's on the google blog?

moshe · on Feb 7, 2017

Please note that the GitHub page says "not an official Google product", rather than "project". An official Google product would be something like gmail.

superfx · on Feb 8, 2017

I guess I was confused because the main TensorFlow page doesn't have that phrase. Made it sound like it won't be officially supported by Google.

congerous · on Feb 8, 2017

The "leading" DL framework is playing catch-up to Chainer, PyTorch and DyNet. Another Google product development bungle.

general_ai · on Feb 8, 2017

The way I see it, TF is about to pull _way_ ahead thanks to XLA JIT/AOT. All of a sudden you get the ability to fuse things at a much more granular level, which could reduce memory bandwidth requirements by a lot. Frameworks like Torch can't do any fusing at all, since their computation is fully imperative. Tactical win for imperative frameworks, I suppose, but strategically functional graph is the way to go. DB people realized this in the 70s, ML people are realizing this now.

congerous · on Feb 8, 2017

TF is way behind on UI, which is why it's making Keras its front-end. It's fairly slow on multi-GPUs compared to Torch and neon. It might pull ahead in performance on GCE, but that's just for lockin.

general_ai · on Feb 8, 2017

TF is in a fortunate position of having several UIs at this point. It's a lower level framework with a lot of power. If you don't need all that power, Keras or TFLearn or Slim are pretty great. If you do, it's there for you. I see no evidence that Google's goal with TF is to lock you into anything, and especially GCE. I'm a former Google employee, and I can tell you unequivocally — that's not how Google actually works.