Why you should use data pipeline in machine learning?

Using data pipeline helps to load huge machine learning dataset to train a deep learning network with limited RAM.
Cinque Terre
Emroj Hossain
4 min read
Thu Feb 13 2020

If you have just started machine learning, you surely follow the following steps to train a machine learning model

1. Load the data/image into a variable or in a NumPy array

2. Process the data such as normalizing by dividing by 255.0 for images

3. Train the machine learning model based on the loaded data.

Advantages and disadvantages of loading full data

The process helps to train the machine learning algorithm fast as all the data is loaded in the RAM beforehand. But there are several drawbacks of the process-

1. It takes a huge amount of time to load the data. Since all the data is loaded at once the training process doesn't start until the full

2. RAM has a limited size. But machine learning model used huge size datasets. It becomes difficult to load huge the dataset in limited memory storage.

3. At a time only a small batch of full dataset defined by batch size is used to train the model. The rest of the data stored in the RAM just stay idle and waste resources to hold the data

Advantages and disadvantages of using data chunks

The limitation of the process of loading data at once can be overcome if instead of loading full data at once we load only a small chunk of data and train on the basis of that and then load the next chunk of data.

But the machine learning model stays idle while loading the data and the loading process stays idle will training with the loaded data. As a result, this process is also not efficient for machine learning.

Data Pipeline: Advantages

Data pipeline solves all the drawbacks of both of the above-mentioned process. It loaded a only small chunk of data at once so there is no burden on the RAM and it takes a relatively small amount of time to load data. While the training is going on data pipeline load and preprocess the next batch of data. This way data pipeline helps to make a machine learning code more efficient by appropriately using hardware resources.

This way data loading does not take much time since only small parts of data is loaded at once and training of a machine learning model starts on the small chanks of data without requiring full data to be loaded at once

Since only a small chunk of data is loaded at once it requires a small amount of RAM to hold the data

The process helps to use the hardware resources efficiently

Since data pre-processing goes on while training the model, no process stays idle.

That's why while stating the machine learning programming you might be using data loading at a single shot. But for more real-world application with large datasets you need to use data pipeline for efficient use of the resources.