Notes on TensorFlow

After getting TensorFlow up and running, one would expect the Getting Started and Programmer’s Guide would be enough to start cranking out results. Unfortunately, those expositions are only meant to be overview. These notes are based on the Convolutional Neural Networks Tutorial specifically the cifar10 estimator source code, which uses functionalities available in tf.contrib that are volatile or experimental. If any of this seems too complicated, it is. The solution is PyTorch. Alternative solutions like MXNet and Keras are not as user-friendly.

Raw Data to TFRecords

Currently Docker requires all volumes to be configured when the container starts. The following will create a volume per dataset because containers are very quick to create.

# Create named volume
docker volume create <volume_name>
# Create container
<docker_command> run -d --name <container_name> -e PASSWORD=<your_desired_pw> -v <volume_name>:<abs_dst_path> -p 8888:8888 -p 6006:6006 <image>
# Login to container
docker exec -it tensor bash
apt-get update
apt-get install wget
wget https://raw.githubusercontent.com/tensorflow/models/master/tutorials/image/cifar10_estimator/generate_cifar10_tfrecords.py
# Convert raw data to TFRecord
python generate_cifar10_tfrecords.py --data-dir=<abs_dst_path>

Everything in generate_cifar10_tfrecords.py is specific to parsing CIFAR-10 except for

47
48
49
50
51
52
def _int64_feature(value):
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))


def _bytes_feature(value):
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[str(value)]))
80
81
82
83
84
85
        example = tf.train.Example(features=tf.train.Features(
            feature={
                'image': _bytes_feature(data[i].tobytes()),
                'label': _int64_feature(labels[i])
            }))
        record_writer.write(example.SerializeToString())

The preceding code packs each image and its corresponding label into a single TFRecord. The binary serialization of a TFRecord uses Protobuf and supports only three Feature types: bytes, float, and int64.

Processing TFRecords

cifar10.py serves as a template for parsing TFRecords, preprocessing each image, and batching the results up for execution. The proposed template abstracts away how to scale up batching and instead focuses on what operations to perform on each data item:

72
73
74
75
76
77
78
    filenames = self.get_filenames()
    # Repeat infinitely.
    dataset = tf.data.TFRecordDataset(filenames).repeat()

    # Parse records.
    dataset = dataset.map(self.parser, num_parallel_calls=batch_size)
    dataset.prefetch(2 * batch_size)

As illustrated in the above code snippet, the list of filenames are just blobs that could be stored on a distributed file system.

Modular Network Architecture

model_base.py implements all the variations of a residual block while cifar10_model.py defines a computation graph for the forward propagation using those building blocks. The backward propagation of gradients is handled by TensorFlow’s optimizers using automatic differentiation. However, if a custom operation is not constructed purely out of TensorFlow’s built-in primitives, the gradient of that operation must be provided. Take the following TensorFlow implementation of a sigmoid function as an example.

import tensorflow as tf

with tf.Session() as sess:
    x = tf.constant([3.0])
    k = tf.constant([0.1])
    y = 1 / (1 + tf.exp(-x * k))
    gr = tf.gradients(y, [x, k])
    init_op = tf.global_variables_initializer()

    sess.run(init_op)

    _ = 'x: {}\nk: {}\ny: {}\ndx: {}\ndk: {}'
    print(_.format(x.eval(), k.eval(), y.eval(),
                   gr[0].eval(), gr[1].eval()))

TensorFlow will automatically compute \(\frac{\partial y}{\partial x}\) and \(\frac{\partial y}{\partial k}\). The alternative is to define the sigmoid function as an operation using some experimental features.

import numpy as np
import tensorflow as tf

def py_func(func, inp, Tout, stateful=True, name=None, grad=None):
    unique_name = 'PyFuncGrad' + str(np.random.randint(0, 1E+8))

    tf.RegisterGradient(unique_name)(grad)
    g = tf.get_default_graph()
    with g.gradient_override_map({"PyFunc": unique_name}):
        return tf.py_func(func, inp, Tout, stateful=stateful, name=name)

def tf_sigmoid(x, k, name=None):
    def np_sigmoid(x, k):
        _ = 1 / (1 + np.exp(-x * k))
        return _.astype(np.float32)

    def tf_sigmoid_gradient(op, grad):
        x, k = op.inputs
        f = 1.0 / (1.0 + tf.exp(-x * k))
        df = f * (1 - f)
        dfdx, dfdk = k * df, x * df

        return grad * dfdx, grad * dfdk

    with tf.name_scope(name, "sigmoid", [x, k]) as name:
        _ = py_func(np_sigmoid,
                    [x, k],
                    [tf.float32],
                    name=name,
                    grad=tf_sigmoid_gradient)
        return _[0]

with tf.Session() as sess:
    x = tf.constant([3.0])
    k = tf.constant([0.1])
    y = tf_sigmoid(x, k)
    gr = tf.gradients(y, [x, k])
    init_op = tf.global_variables_initializer()

    sess.run(init_op)

    _ = 'x: {}\nk: {}\ny: {}\ndx: {}\ndk: {}'
    print(_.format(x.eval(), k.eval(), y.eval(),
                   gr[0].eval(), gr[1].eval()))

np_sigmoid should not use any TensorFlow primitives while tf_sigmoid_gradient should be implemented purely in TensorFlow. Otherwise some odd errors may appear.

Tuning Hyperparameters

cifar10_main.py and cifar10_utils.py serve as the glue for the preceding code [CHH+17]. Those provides default values for the initial learning rate, learning rate schedule, and optimizer. The default configuration is able to train on a single host with CPUs or GPUs, and automatically write some summaries for TensorBoard. Training using multiple hosts requires the following code to be added to cifar10_main.py:

372
373
374
375
376
377
378
379
380
381
382
383
  # Cluster setup must be defined before RunConfig
  if replica_type is not None:
      cluster = {'master': ['localhost:2222'],
                 'ps': ['localhost:2223'],
                 'worker': ['localhost:2224']}
      os.environ['TF_CONFIG'] = json.dumps(
          {
              'cluster': cluster,
              'task': {'type': replica_type, 'index': 0},
              'environment': 'cloud'
          }
      )
517
518
519
520
521
522
  parser.add_argument(
      '--replica-type',
      choices=['master', 'worker', 'ps'],
      type=str,
      default=None,
      help='Cluster configuration.')

Note that the default initial learning rate is too large for the full pre-activation residual unit. Make sure to half it before training, otherwise the result will be ERROR:tensorflow:Model diverged with loss = NaN.

Monitor Training Session

cifar10_model.py has been modified to visualize the intermediate outputs between layers.

  def visualize_tensor_as_images(self, x, name, channels=1, max_outputs=1):
    if self._data_format != 'channels_last':
      #convert NCHW -> NHWC
      x = tf.transpose(x, [0, 2, 3, 1])

    # Maps value range to [0, 1]
    x_min = tf.reduce_min(x)
    x_max = tf.reduce_max(x)
    x = (x - x_min) / (x_max - x_min)

    _ = tf.split(x, x.shape[3] // channels, axis=3)
    for i, layer in enumerate(_):
      tf.summary.image('{}-{}'.format(name, i), layer, max_outputs=1)

Even though the tensor summary operations can be called from anywhere, the preceding solution requires direct access to the outputs. An alternative is

x = tf.get_default_graph().get_tensor_by_name('resnet/tower_0/Relu:0')

where ‘resnet/tower_0/Relu:0’ can be found by manual inspection:

for _ in tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES):
  tf.logging.info(_.name)

Correspondingy, when a weight variable is not named, TensorFlow provides a default name under the current variable scope. Consider the first convolutional layer of any network. TensorFlow would specify conv2d as the default name and set the corresponding weight’s name to kernel. Passing scope=’conv2d/kernel’ to tf.get_collection would return a list of variables whose name contains conv2d/kernel. The name of a convolutional layer beyond the first layer takes the form of conv2d_i where \(i\) is a decimal. However, this scheme is not in any specification. Thus,

for _ in tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES):
  tf.logging.info(_.name)

along with should only be used for non-production purposes. Furthermore, visualizing the filter weights seems to have fallen out of fashion after VGGNet.

# Original at https://gist.github.com/kukuruza/03731dc494603ceab0c5
def tensor_to_grid(tensor, padding=1):
  def factorize(n):
    for i in range(int(np.sqrt(n)), 1, -1):
      if n % i == 0: return (i, n // i)
    raise ValueError('Number of output filters cannot be a prime number')

  (grid_Y, grid_X) = factorize(tensor.shape[3].value)

  # Maps value range to [0, 1]
  x_min = tf.reduce_min(tensor)
  x_max = tf.reduce_max(tensor)
  x = (tensor - x_min) / (x_max - x_min)

  # Add padding to each filter
  x = tf.pad(tensor, [[padding, padding], [padding, padding], [0, 0], [0, 0]])
  Y = tensor.shape[0] + 2 * padding
  X = tensor.shape[1] + 2 * padding
  channels = tensor.shape[2]

  # Move number output channels to the 1st dimension
  x = tf.transpose(x, [3, 0, 1, 2])
  # Organize grid on Y-axis
  x = tf.reshape(x, tf.stack([grid_X, Y * grid_Y, X, channels]))

  # Swap X and Y axes
  x = tf.transpose(x, [0, 2, 1, 3])
  # Organize grid on X-axis
  x = tf.reshape(x, tf.stack([1, X * grid_X, Y * grid_Y, channels]))

  # Convert back to (height, width, input channels, output channels) order
  x = tf.transpose(x, [2, 1, 3, 0])

  # Convert to (batch size = 1, height, width, channels)
  x = tf.transpose(x, [3, 0, 1, 2])

  return x

def summarize_tensor(name, tensor):
  dim = len(tensor.shape)
  if dim == 1:
    tf.summary.scalar(name, tensor)
    for i in range(tensor.shape[0]):
      tf.summary.scalar('{}_{}'.format(name, i), tensor[i])
  elif dim == 2:
    tf.summary.histogram(name, tensor)
  else:
    grid = tensor_to_grid(tensor)
    if grid.shape[3] == 3:
      tf.summary.image(name, grid)
    else:
      if grid.shape[3] % 3 == 0:
        _ = tf.split(grid, grid.shape[3] // 3, axis=3)
      else:
        _ = tf.split(grid, grid.shape[3], axis=3)
      for i, grid in enumerate(_):
        tf.summary.image('{}-{}'.format(name, i), grid)

def visualize_weights(name, label):
  _ = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=name)
  for tensor in _:
    summarize_tensor(label, tensor)

def visualize_gradients(name, loss, label, as_scalar=False):
  x = tf.get_default_graph().get_tensor_by_name(name)
  dl_dx = tf.gradients(loss, [x])
  for tensor in dl_dx:
    if as_scalar:
      _ = tf.reshape(tensor, [-1])
    else:
      _ = tf.reshape(tensor, [2, -1])
    summarize_tensor(label, _)

Here tf.gradients returns the gradient with respect to the loss. This means if the loss function is a sum of per-example losses, then the gradient is also the sum of per-example loss gradients. To get per-example gradients, use a batch size of one or loop through each example in the batch.

Transfer Learning

TF-slim contains a lot of pre-trained models that can be extracted as follows:

# Original model at http://download.tensorflow.org/models/inception_resnet_v2_2016_08_30.tar.gz
# Extraction code at https://github.com/tensorflow/models/blob/master/research/slim/nets/inception_resnet_v2.py
import tensorflow as tf
from inception_resnet_v2 import inception_resnet_v2, inception_resnet_v2_arg_scope

width = inception_resnet_v2.default_image_size
height = width
channels = 3
X = tf.placeholder(tf.float32, shape=[None, height, width, channels])

with tf.contrib.slim.arg_scope(inception_resnet_v2_arg_scope()):
  # Specify X as the input to the network.
  last_layer, end_points = inception_resnet_v2(X, num_classes=0, is_training=False)

saver = tf.train.Saver()
with tf.Session() as sess:
  saver.restore(sess, 'inception_resnet_v2_2016_08_30.ckpt')
  saver.save(sess, './my-test-model')

Here tf-slim dynamically creates the graph for the pre-trained model to enable different configurations. The alternative is to explicitly load in a model’s graph:

import tensorflow as tf

sess = tf.Session()
saver = tf.train.import_meta_graph('my-test-model.meta')
saver.restore(sess, tf.train.latest_checkpoint('./'))

# Query the desired op.
_ = 'InceptionResnetV2/Conv2d_7b_1x1/weights:0'
last_conv_layer = tf.get_default_graph().get_tensor_by_name(_)

# Instruct TensorFlow to not change any weights before this op.
last_conv_layer = tf.stop_gradient(last_conv_layer)

# Optionally augment the model

# Feed in new data for evaluation or fine-tuning
width = 299
height = width
channels = 3
image = tf.random_normal([width, height, channels])
resized_image = tf.image.resize_images(image, [height, width])

X = tf.placeholder(tf.float32, shape=[None, height, width, channels])

# Convert Tensor to numpy to avoid
#   ValueError: setting an array element with a sequence.
example = sess.run(resized_image)

# Get the current value of the desired op.
before = sess.run(last_conv_layer)

# Get the value after running 
after = sess.run(last_conv_layer, feed_dict={X: [example]})

Once the model is loaded, there is nothing special about augmenting the existing model.

References

CHH+17

Heng-Tze Cheng, Zakaria Haque, Lichan Hong, Mustafa Ispir, Clemens Mewald, Illia Polosukhin, Georgios Roumpos, D Sculley, Jamie Smith, David Soergel, and others. Tensorflow estimators: managing simplicity vs. flexibility in high-level machine learning frameworks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1763–1771. ACM, 2017.