Optimized Deep Learning Pipelines: A Deep Dive into TFRecords and Protobufs (Part 1)

Kelly “Scott” Sims
Heartbeat
Published in
12 min readJul 25, 2023

--

This is part 1 of a 2 part series. To view the second part, see here.

Introduction

When it comes to practicing deep learning at home vs. industry, there’s a huge disconnect. Every course, tutorial, and YouTube video presents you with a nicely prepared dataset to feed any DL algorithm for any DL framework. TensorFlow itself comes with the Dataset API that allows you to simply download and train data with just a couple of lines of code. However, when it comes to real life production work at a company, nowhere on this earth will someone just hand you a pristine dataset ready for consumption. Considerations must be given to things like:

  • File format — Are flat files sufficient, should the data be serialized, etc.
  • File Structure — Should there be a pattern to the directories for separation of training examples vs labels, some hybrid data structure, etc.
  • Data location — Can the data be batch fetched from the cloud or does it need to exist locally
  • Data Processing — Is there another system responsible for collecting and processing the data? And is that system in a completely different framework or programming language? If so, how much effort does it take to go from that system to a deep learning framework-ready system?
  • CPU/GPU — Are you limited to only CPU processing (hopefully not) or is there GPU access

Although not as sexy as model building, these items are important since time is money. Slow input pipelines means slow training time, which has a few consequences. The longer it takes a model to train, the longer engineers must wait between iterations for tweaking and updating. This ties up said engineers from working on other value propositions. If a company is utilizing cloud resources, this means large bills for resource utilization. Also, the longer a model is in development, that’s time lost it could have been in production generating value.

TFRecords

So today, we are going to explore how we can optimize our deep learning pipelines using TensorFlow’s TFRecords. I’ve seen a few blogs on this topic, and all of them fail to adequately describe what TFRecords are. They mostly regurgitate example from docs, which themselves are quite lacking. So today, I’m going to to teach you everything you wanted (and didn’t want) to know about TFRecords.

The TFRecord format is a protobuf-backed format for storing a sequence of binary records. Protobufs are a cross-platform, cross-language library for efficient serialization of structured data. Protocol messages are defined by .proto files, these are often the easiest way to understand a message type.

Protobufs?

So what is a protobuf (aka protocol buffer)? To answer that question, I’m going to mix some technical jargon with some actual examples that explains the jargon.

Protobufs are a language-agnostic, platform-neutral, and extensible mechanism for serializing structured data. They were developed by Google and released as an open-source project. Protocol Buffers are widely used for efficient and reliable data exchange between different systems, especially in scenarios where performance, compactness, and language interoperability are important.

Now all of that may or may not mean anything to you, but let’s explain it by stepping through all the pros of using protobufs. We are going to touch on:

  • Language Interoperability
  • Forward & Backward Compatibility
  • Efficiency
  • Schema Validation

1. Language Interoperability:

Protocol Buffers provide support for generating code in multiple programming languages, enabling different systems written in different languages to communicate seamlessly by sharing a common data structure.

Let’s say I want to create a system that collects social media posts from people so that I can train models with the data. The backend of the site is written in Golang, the web scraper might be written in C++, the data cleaning and preparation might be written in Python. By using protobufs, we can define the schema once, and compile for all the aforementioned languages. This empowers a “write once — use everywhere” type of system which drastically reduce engineering time.

Conversely, if we were to pass around our data as say, JSON or XML, we would have to individually write wrapper classes or objects for each language to consume that data. That opens the doors for a lot of bugs, mistakes and maintenance. For any update to the data schema, you must update a bunch of code across a bunch of different frameworks.

Implementing Protobufs

We start off defining our protobufs by creating a file called “text_data.proto”. We then define the attributes of a “Post” which would be comprised of a body of text and when it was written.

syntax = "proto3";

package text_data;

import "google/protobuf/timestamp.proto";

message Post {
// body of social media post
string body = 1;

// Timestamp when post was created
Timestamp timestamp = 2;
}

message Timestamp {
int64 seconds = 1;
int32 nanos = 2;
}

Notice that we are defining the data types for each attribute. This is because all code generated by the proto file will be strong, statically typed objects. Yes, even in python (we will see this later). Next, a post would belong to a user. That user may have 0 or more posts. So let’s define that.

message User {
// UUID representing user
string userId = 1;

// All posts created by user
repeated Post posts = 2;
}

We define 0 or more posts by using the “repeated” keyword. This signals to the protobuf compiler, at compile time, that the generated object should be an array of some sort that holds objects of type “Post” and nothing else. Finally we just need a way to collect all the users and their posts in a single parent object.

message UserData {
// Users and their posts
repeated User users = 1;
}

Each of these messages defines individual objects that will be created for any language we compile for. The overall proto file should look like this:

Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers

In order to compile, you need to install the protoc command line tool. I won’t go into super details on this part because it’s not overly important to this post. This is just a quick crash course on protobufs and what they are. You won’t actually need to do this when it comes to TFRecords and training models. This just sets the basis for what comes later.

Compiling for Golang

#protoc command to generate golang code from text_data.protoc
protoc -I . --go_out=. --go-grpc_out=. text_data.proto

Again, don’t worry too much about this since you don’t really need to do this, but this just signals to the compiler to generate Golang code using the define protobufs in the proto file. Once we run this, it will generate a file called “text_data.pb.go”.

Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers

What this file is, is the Golang version of what we defined. It uses Golang native data types and structures. Golang doesn’t have classes. Instead it uses C-style structs. And you can see that there is a Post struct, representing the message “Post” we created. In the outline section on the left, you can see it created Structs for all the messages we defined, and a bunch of other methods and goodies. Some of these goodies allows us to represent our object as a string to print to console if we so choose. It also gives us the capabilities to convert our protobufs to and from JSON objects.

It’s easy for ML teams to rack-up their monthly cloud training bills. While teams can’t stop training as a whole, they need to stop spending money on training that isn’t adding value. Learn how Comet can help you do this.

Compiling for Python

#protoc command to generate python code from text_data.proto
#Only generate the protobuf messages and not any gRPC client/server
#code. In order to do this, the argument "--python-grpc_out" is dropped
#from the invocation. We may discuss gRPC client/server definitions later.
protoc -I . --python_out=. text_data.proto

When compiling for Python, you get something different. You don’t get actual Class implementations of our protobufs, you get a Metaclass.

“Metaclasses are deeper magic than 99% of users should ever worry about. If you wonder whether you need them, you don’t (the people who actually need them know with certainty that they need them, and don’t need an explanation about why).”

— Tim Peters

Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers

You can see on line 16 in the Python generated code image above that there is a DESCRIPTOR variable that is being injected with a serialized definition of our proto. Although cut off in the image, that serialized string is extremely long. This description is used by the Meta class to ensure that any instance of our protobufs in Python code strictly adhere to the definition in the proto file. We will circle back and talk about this more.

When we are ready to use our newly compiled protos, we just import them and use them as if they were a native object for the programming language we are working with. The best way to think of protobufs while using them in your codebase is as “value classes”.

For those who don’t come from CS or SWE backgrounds, a “value class” typically refers to a specific type of class or data structure that represents a single value or entity. A value class encapsulates a value and provides operations or methods related to that value, but it does not have identity or mutability. Tangential examples would be Autovalue in Java, Data Class in Kotlin, and the Data Class decorator in Python.

Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers

2. Forward and Backward Compatibility:

Protocol Buffers support versioning and evolution of data structures. You can add new fields to a message without breaking existing code that was built with the previous version of the message. This allows for easier maintenance and updates in distributed systems.

This one is very simple in explanation. If we ever want to change or improve any of our protos we can simply add a new field. If this field is meant to replace an old field, all we have to do is mark that old field as deprecated. Any new code using our protos will be flagged to not use the deprecated field. Any old code that isn’t aware of the new field will still work as intended because we never actually removed the old field

Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers

This is far better than JSON or XML. If you are expecting JSON/XML version ‘X’ but you get version ‘Y’, your code more than likely won’t work. Or It will fail to parse properly because there’s new fields your code isn’t aware of. Or worse, there’s fields that have been removed that your code is expecting to be there. Here, we don’t have that problem. Backwards compatibility will always exists as long as you don’t delete the field from the proto message. There’s also no penalty for not using a field either.

3. Efficiency:

Protocol Buffers are highly efficient in terms of both space and processing time. The serialized data is usually smaller than equivalent XML or JSON representations, resulting in reduced storage and transmission costs. Additionally, the encoding and decoding operations are faster, making it suitable for high-performance systems.

As a demonstration, we will create 1 Million users, each of whom have written a social media message with the maximum character length of 280 characters. We will then write the data both in a serialized binary format as well as JSON format from the proto. As I said earlier, protos afford you the ability to transition back and forth between JSON as long as you adhere to the strict schema. We will then time the write operation, as well as inspect the overall file size written to disk.

Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers
Results of writing 1 million users in JSON format as well as binary format. We can see it took ~0.24 seconds to write the data in a binary format. It took ~3.3 seconds to write the same data to JSON.
# How much better is protobuf to JSON (for our anecdotal example anyway)

| | Write Time (s) | File Size (kb) | File Size (mb) |
|-------------|----------------|----------------|----------------|
| Binary | 0.235 | 355469 | 355.5 |
| JSON | 3.297 | 539063 | 539.1 |
| Performance | 14x faster | | 35% reduction |

4. Schema Validation:

The defined message structures in Protocol Buffers act as a schema that can be used to validate the data being exchanged. It ensures that the received data adheres to the expected structure and type constraints. The reason for Python Metaclasses (as shown earlier) is because protobufs inherently provide type safety — meaning they have defined types that must be obeyed. They are immutable, and the structure of the class must not and can not ever change. I.e. what we defined in the proto file and generated by protoc should be exactly how the class is…..always. No code at runtime is allowed to change the structure of the class, only the data it contains. Python on the other hand, is a dynamic “duck-typing” language that has no true concept of static types. Nor does it have any native access modifiers that make members private or protected. The below examples are problems with Python with respect to protobufs.

class Car:
def __init__(self, doors, cylinders, mpg, hp):
self.doors = doors
self.cylinders = cylinders
self.mpg = mpg
self. hp = hp

camry = Car(doors=4, cylinders=6, mpg=35, hp=280)
"""
Python has no access modifiers such as 'private' or 'protected'
like in Java or C++ where getters and setters are used. Thus there
is no way of preventing the direct access and modification of a class
member. Below is "almost" always legal in Python
"""
camry.doors = 2
"""
Python also doesn't natively provide any means of preventing code from
from updating the original structure of the class
"""
camry.color = "blue" #this member wasn't part of the original definition
"""
Python is not a static-typed language. It does not enforce that a declared
variable of a given type ALWAYS be that type. Although we intended for the
mpg member to be an integer value, nothing prevents us from updating it
to a string value
"""
camry.mpg = "thirty miles per gallon"

Thus, the metaclass ensures we follow the exact structure and types as defined in the proto file. This type safety is what enables the platform agnostic nature of protobufs. We can’t alter or add anything to a protobuf at runtime that wouldn’t be understood by the same protobuf running on a different platform.

Protobufs, TFRecords, Optimizing deep learning pipelines, full-code end-to-end tutorial, Python, Golang, protocol buffers

Enforcing this in static typed languages such as Java, C++, & Go is pretty straightforward. If you defined a variable as some type, it can only ever be that type. These languages also come with access modifiers so that you can make fields private and non accessible from outside the class. This way, as you pass the protos from system to system that utilize different platforms, they still know how to handle the data since we know it adheres to the strict schema of the proto.

Protobuf Conclusion

Overall, Protocol Buffers are a powerful and flexible tool for data serialization and interchange. They are commonly used in various domains, including distributed systems, APIs, communication protocols, and data storage formats. These are only a few benefits of using them since we never even touched on data transmission across networks. Which, as a quick aside to this point — protobufs are the backbone for:

gRPC, Google Remote Procedural Call

Which if you are unfamiliar with gRPC, I highly suggest you checkout the docs. In short, it’s a better framework than REST services. It supports HTTP/2 and enables full duplex communication. This means faster network transfers and less network requests, all powered by protobufs! Something to think about as you’re constantly requesting the next batch of data from remote storage to train your model.

Thanks for making it all the way to the end! Keep reading Part 2 of this series here.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.

--

--