Heartbeat

Comet is a machine learning platform helping data scientists, ML engineers, and deep learning…

Follow publication

5 TensorFlow techniques to eliminate overfitting in DNNs

Salma Ghoneim
Heartbeat
Published in
7 min readJul 17, 2019

--

Deep neural networks (DNNs) can have tens of thousands of parameters, and in some cases, maybe even millions. This huge number of parameters gives the network a huge amount of freedom and the flexibility to fit a high degree of complexity.

This flexibility is only good up to a certain level. When this level is crossed, the term overfitting is brought to the table.

What is Overfitting?

Overfitting happens when a machine learning model is overtrained on the training portion of the dataset, so that when the model meets new data, it performs much worse than expected.

Source: Wikipedia

In the above figure, the green line is a perfect, too-good-to-be-true model. When such a model is evaluated on the training set, accuracy measures may reach upwards of 98% and even a solid 100%.

But this percentage gives a false impression that the model is (nearly) perfect. In reality, it’s not. Once such model is tested on unseen data, it behaves in an unexpected way. This is because it is trained so heavily on the training set that it nearly memorizes everything about the training data. The problem is, training data usually contains errors and irregularities.

There are various ways to prevent overfitting when dealing with DNNs. In this post, we’ll review these techniques and then apply them specifically to TensorFlow models:

Early Stopping

This technique refers to interrupting data when its performance on the validation set starts dropping.

Source: Deeplearning

With TensorFlow

  1. Evaluate the model on the validation set at regular intervals (e.g. 100 steps)
  2. Save a “best” snapshot if it outperforms the previous “best” score.
  3. If the interval difference since the last best exceeded a limit (e.g. 4000 steps), interrupt the training and go back to the last “best” snapshot.

Early stopping works really well in practice, but better results happen when paired with other techniques.

L1 and L2 Regularization

Regularization is a technique intended to discourage the complexity of a model by penalizing the loss function.

Regularization assumes that simpler models are better for generalization, and thus better on unseen test data. You can use L1 and L2 regularization to constrain a neural network’s connection weights.

Source: Chioka’s blog

L1 is also known as the Least Absolute Deviations. It’s minimizing the sum of the absolute difference between target values and estimates (predicted) values.

L2 is known as Least Square Error. It’s minimizing the square of the sum of the difference between target values and estimates (predicted) values.

With TensorFlow

Many functions that create variables (such as get_variable() or
tf.layers.dense()) accept a *_regularizer argument for each created variable (e.g. kernel_regularizer ). You can pass any function that takes weights as an argument and returns the corresponding regularization loss. The l1_regularizer(), l2_regularizer(), and l1_l2_regularizer() functions return such functions.

my_dense_layer = partial(
tf.layers.dense, activation=tf.nn.relu,
kernel_regularizer=tf.contrib.layers.l1_regularizer(scale))
with tf.name_scope("dnn"):
hidden1 = my_dense_layer(X, n_hidden1, name="hidden1")
hidden2 = my_dense_layer(hidden1, n_hidden2, name="hidden2")
logits = my_dense_layer(hidden2, n_outputs, activation=None,
name="outputs")

This code creates a neural network with two hidden layers and one output layer, and it also creates nodes in the graph to compute the L1 regularization loss corresponding to each layer’s weights. TensorFlow automatically adds these nodes to a special collection containing all the regularization losses. You just need to add these regularization losses to your overall loss, like this:

reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
loss = tf.add_n([base_loss] + reg_losses, name="loss")

Don’t forget to add the regularization losses to your overall loss, or else they will simply be ignored.

Dropout

Dropout is the most common technique to combat model overfitting.

At each training step, every neuron (except the output neurons) has a probability p that it will be temporarily dropped at the current step, meaning it will be totally ignored with the possibility that it may be active the next one. The hyperparameter p is called the dropout parameter, and it’s typically set to 50%. After training, neurons aren’t dropped anymore.

Srivastava, Nitish, et al. ”Dropout: a simple way to prevent neural networks from
overfitting”, JMLR 2014

With TensorFlow

You can simply apply the tf.layers.dropout() function to the
input layer and/or to the output of any hidden layer you want.

During training, the function randomly drops some items and divides the remaining by the keep probability. After training, the function does nothing.

training = tf.placeholder_with_default(False, shape=(), name='training')dropout_rate = 0.5 
X_drop = tf.layers.dropout(X, dropout_rate, training=training)
with tf.name_scope("dnn"):
hidden1 = tf.layers.dense(X_drop, n_hidden1, activation=tf.nn.relu,
name="hidden1")
hidden1_drop = tf.layers.dropout(hidden1, dropout_rate, training=training)
hidden2 = tf.layers.dense(hidden1_drop, n_hidden2, activation=tf.nn.relu,
name="hidden2")
hidden2_drop = tf.layers.dropout(hidden2, dropout_rate, training=training)
logits = tf.layers.dense(hidden2_drop, n_outputs, name="outputs")

tf.nn.dropout() is similar to tf.layers.dropout(), but it doesn’t turn off the dropout when not training, which is usually not what we want.

Max-Norm Regularization

A visulaization for Max-Norm Regularization technique.

For each neuron, max-norm regularization constrains the w of incoming connections such that w ≤ r, where r is the max-norm hyperparameter. We do this by computing w after each training step and clipping w if needed.

With TensorFlow

TensorFlow doesn’t offer an off-the-shelf max-norm regularizer, but it’s easy to implement one.

threshold = 1.0
weights = tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0") ##(1)
clipped_weights = tf.clip_by_norm(weights, clip_norm=threshold, axes=1) ##(2)
clip_weights = tf.assign(weights, clipped_weights) ##(3)
  1. Get the handle on the weights of the first hidden layer.
  2. Use clip_by_norm() to create an operation that weights along the second axis so that each row vector ends up with a maximum norm of 1.0.
  3. The last line creates an assignment operation that will assign the clipped weights to the weights variable.
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
clip_weights.eval()

Then you just apply this operation after each training step.

You do the above for every hidden layer you’ve got.

Data Augmentation

Another way to eliminate overfitting is data augmentation. Data augmentation is the artificial reproduction of new training instances out of existing ones. This boosts the training set size.

The trick is to generate realistic training instances such that a human can’t tell the difference between original ones and the ones you’ve created.

For example, if you’re working on classifying dog breeds, you can simply rotate, shift, and resize every image you have by various amounts and add the resulting images in the training set. Also, you could flip or increase/decrease contrast in each photo.

There are other filters that you could apply such as de-texturizing, de-colorizing, edge enhancement, salient edge mapping, and more.

Source: Data augmentation-assisted deep learning of hand-drawn partially colored sketches for visual search articleWith Trnsorflow

With TensorFlow

No special technique needed for data augmentation with TensorFlow. Just basic image processing tools such as OpenCV.

OpenCV comes in very handy in image processing, here’s a quick read about it.

This has ready-made code snippets of every manipulation technique you’ll need to data-augment your data set,

A lot of the info I wrote here came from the book Hands-On Machine Learning with Scikit-Learn &TensorFlow, which I highly recommend.

Source: A paper titled “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”

Lastly, here’s a paper titled “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” that proved on the MNIST dataset that dropout + max-norm regularization gave the least error score i.e. best results.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Comet is a machine learning platform helping data scientists, ML engineers, and deep learning engineers build better models faster

SDE at Microsoft, Passionate about frontend development, fascinated by artificial intelligence, Interested in game development.

Responses (2)

Write a response