If you have just started machine learning, you surely follow the following steps to train a machine learning model
1. Load the data/image into a variable or in a NumPy array
2. Process the data such as normalizing by dividing by 255.0 for images
3. Train the machine learning model based on the loaded data.
Advantages and disadvantages of loading full data
The process helps to train the machine learning algorithm fast as all the data is loaded in the RAM beforehand. But there are several drawbacks of the process-
1. It takes a huge amount of time to load the data. Since all the data is loaded at once the training process doesn't start until the full
2. RAM has a limited size. But machine learning model used huge size datasets. It becomes difficult to load huge the dataset in limited memory storage.
3. At a time only a small batch of full dataset defined by batch size is used to train the model. The rest of the data stored in the RAM just stay idle and waste resources to hold the data
Advantages and disadvantages of using data chunks
The limitation of the process of loading data at once can be overcome if instead of loading full data at once we load only a small chunk of data and train on the basis of that and then load the next chunk of data.
But the machine learning model stays idle while loading the data and the loading process stays idle will training with the loaded data. As a result, this process is also not efficient for machine learning.
Data Pipeline: Advantages
Data pipeline solves all the drawbacks of both of the above-mentioned process. It loaded a only small chunk of data at once so there is no burden on the RAM and it takes a relatively small amount of time to load data. While the training is going on data pipeline load and preprocess the next batch of data. This way data pipeline helps to make a machine learning code more efficient by appropriately using hardware resources.
This way data loading does not take much time since only small parts of data is loaded at once and training of a machine learning model starts on the small chanks of data without requiring full data to be loaded at once
Since only a small chunk of data is loaded at once it requires a small amount of RAM to hold the data
The process helps to use the hardware resources efficiently
Since data pre-processing goes on while training the model, no process stays idle.
That's why while stating the machine learning programming you might be using data loading at a single shot. But for more real-world application with large datasets you need to use data pipeline for efficient use of the resources.