Parallel Programming for Training and Productionization of ML/AI Systems

Published in

Heartbeat

7 min readJun 28, 2022

In today’s fast moving world, software and systems are becoming faster every day. And when it comes to getting results/output through software systems, there is always a desire for better accuracy and faster results. The emerging trend in ML/AI is getting results in a few seconds, so adapting to this trend is essential. So, how can we ensure the result in a few seconds? Yes, you are right, by “Parallel Processing.” This article will go in-depth on the fundamentals of parallel processing, how it may be used in ML/AI systems, and how it can be used for productionization.

What is Parallel Processing?

For computing a simpler or complex task, when it is divided into smaller modules and each module is computed in a different core of the system, it is known as parallel processing. All the smaller modules are being executed at the same time, producing results faster as compared to traditional methods.

In the past, single-processor environments have been used to execute machine learning, which can cause algorithmic bottlenecks to cause significant delays in model processing from training to classification to distance and error calculation and beyond.

Source: https://leonardoaraujosantos.gitbook.io/opencl/performance

Difference Between Parallel Processing and Distributed Computing

Source: https://keetmalin.wixsite.com/keetmalin/post/2017/09/24/an-introduction-to-distributed-systems

The major distinction between parallel processing and distributed computing is that although distributed computing distributes memory over several systems, parallel processing uses a shared memory that is used by all of the system’s cores.

In distributed systems, numerous systems are connected over a high-speed connection. Each system has its own memory, as opposed to parallel processing, which uses a single machine with cores that may conduct tasks simultaneously to produce results. For distributed computing Spark/Hadoop is used.

Parallel Processing for Training ML Systems

Most machine learning algorithms involve matrices, vectors. Let’s begin with one of the simple operations and try to understand how we can parallelize it.

Adding two matrices of the same size

Let’s say we have two matrices, A and B, both of the same size (M*N) i.e having M rows and N columns. We want to make a third matrix C which is the sum of A and B.

So, by using parallel computing, the first row of A can be added to the first row of B using one core of the system and similarly other rows can also be added to the corresponding rows of B but on a different core. This concept is called data parallelism. In this, each core of the system is operating on a different subset of data to produce the result. With this, we can calculate matrix C as the sum of A and B by using all the cores of the system.

Why not add columns instead of rows?

The answer lies in the concept of ROW MAJOR, which is used by most programming languages to store data in the form of arrays where the data from the first row goes first and the data from the second row appends to it to create a long array, providing the solution. Using rows rather than columns for sum evaluation is simpler and computationally less expensive.

While doing parallel processing everything should be brought to the RAM. For this we can create a shared space for all the cores so they can access the shared memory for the computational purpose. Let’s see how we can create shared memory space for all the cores to use as a memory pool.

The first step should be importing all the necessary libraries for creating shared memory spaces. We can create a NumPy array a. Shared_memory object is used to create a shared memory space with parameter create=True and size equals to the size of array a.

Now we can create another matrix b using the same size and shape in the same space by using parameter buffer= shm.buf. And then we can copy the content of A into B and then perform our addition option.

The MLOps difference? Visibility, reproducibility, and collaboration. Learn more about building effective ML teams with our free ebook.

To know about the type of a and b and also the name of the shared memory created, we can use the following command:

For addition operation, we can open a new terminal and add a and b to obtain c. Similarly, we can also use this same method for multiplication and this concept can be extended to vectors and tensors as well.

Decision Trees

For decision trees at every level, we have a new decision to be made, which means at each level the competition is independent of each other which gives us the flexibility to run all these different subtrees into different cores and extend the power of parallel programming. Here is an example of how we can run different sub trees into different cores:

In the case of Random Forest, similar to decision tree we can train different trees on different cores to obtain the result from the Random Forest. Different cores from C1, C2 … C6 can be used to train different trees as T1, T2 …… T6 and so on.

For parallel processing to be more optimized and fast we can use a pool of process, in which, once the processes are created, we can use them again and again and we don’t have to create a new process for the new task. Creating a process from scratch costs us time and money so by this we can reduce expense and time both.

Productionization of ML Systems

Parallelism in production is in huge demand because of the increase in the number of people having access to the web. Every second there is a lot of traffic coming to the internet so there are chances of thousands or millions of requests to be coming to the server at the same time.

Parallelizing and processing these requests becomes very important, otherwise it may break the system. For example, let’s say we are running an e-commerce website and on that website, we have requests from many users. To handle this kind of scenario we can use tools such as Gunicorn. Gunicorn is a WSGI HTTP Server for UNIX, fast clients, and sleepy applications.

Source: https://medium.com/@nhudinhtuan/gunicorn-worker-types-practice-advice-for-better-performance-7a299bb8f929

So this WSGI server takes all the requests from the users and checks which processor is idle at that time and sends the request to that processor to get the response and then sends it back to the user. All the concurrent management of requests/responses is done by this tool.

Conclusion

There is no end to what you can accomplish with parallel processing in machine learning if you have the right data, understanding of how algorithms are implemented, and have the ambition.

If I missed an important detail or you wish to add anything to this blog post, please feel free to ping me. I look forward to hearing your feedback, that’s how we learn 🤗

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.