Deep learning without multiplication using AdderNet

How does AdderNet work without requiring any multiplications in deep learning and overcome mathematical complexity of multiplication by replacing it with addition?
Cinque Terre
Emroj Hossain
4 min read
Mon Mar 22 2021

AdderNet provides a way to overcome the requirement of multiplication in a deep neural network by replacing with addition.

Visualization of features in AdderNets and CNNs

Multiplication operation in a computer system is relatively complex and require more computation power compared to the addition operation. But widely used deep learning models use convolution neural network (CNN) and that require large multiplication between float values to find the cross-correlation that measures the similarity between the input features and filters. These large multiplications add to the computation cost of the machine learning algorithm. To solve this problem H. Chen et al have introduced an adder network (AdderNet) to tackle the problem for large multiplications in CNNs and other deep neural networks.

An overview of AdderNets

AdderNets are very similar to convolution neural networks (CNNs). The CNN uses convolutions filter to obtain features from the input data or image. More effective the filter is to find the feature of the input more better the CNNs is. CNNs are trained to find the effective filter to find the input features and the difference between the input features and the convolution filter is calculated to determine how good the filter is to determine the input features. Now to find the difference CNNs use L2-norm that requires massive multiplications between float values and make the computation cost higher. But L2-norm is not the only way to determine the difference between two vectors. (See Different types of vector norms used in machine learning). AdderNets use another type of norm known as L1 norm of the vector to find the difference between input features and the convolutions filters. For better performance. As fully connected layers can be considered as a convolution layers AdderNet has also replaced the multiplications in the fully connected layer with subtraction. Since, L1-norm use subtraction (complement of addition) between the vector components, the computation cost for the AdderNet is very low compared to CNNs. In addition to that special back-propagation approach by investigation of full precision gradient and adaptive learning rate strategy to train the AdderNet to make it more efficient.

Difference between AdderNet and CNN

AdderNets are very similar to CNNs but there is the only difference in the way to finding the dissimilarity between input features and convolution filters. There is only a few fundamental difference between them.

AdderNets use subtractions which are just a complement of additions but CNNs use multiplications

AdderNets use L1 norm and CNNs uses L2 norm to find the difference between to vectors

To illustrate the difference in mathematical terms let us consider an input filter F with kernel size d, input channel c_in, and output channel c_out. The input feature is X with height H, width W. Then the similarity between filter and input feature is defined in terms of output feature (Y) given by

$$Y(m, n, t)=\sum_{i=0}^{d}\sum_{j=0}^{d}\sum_{k=0}^{c_{in}}{S(X(m+i,n+j, k), F(i, j, k, t))}$$

For a cross-correlation, which is used by CNNs S becomes convolution operation of the form

$$S(x, y)=x\times y$$
$$Y(m, n, t)=\sum_{i=0}^{d}\sum_{j=0}^{d}\sum_{k=0}^{c_{in}}{X(m+i,n+j, k)\times F(i, j, k, t)}$$

On the other hand, since, AdderNets L1 norm. The S is of the following form

$$S(x, y)= -(x-y)$$

So, the output feature (Y) for AdderNets are of the following form

$$Y(m, n, t)=-\sum_{i=0}^{d}\sum_{j=0}^{d}\sum_{k=0}^{c_{in}}{|X(m+i,n+j, k)- F(i, j, k, t)|}$$

Subtraction can be converted to addition by taking the complement.

Since CNNs use multiplications between float values they have more computational cost then AdderNets

The CNNs are more used then the AdderNets in this point of time.

Algorithm of AdderNet

AdderNet is a feed-forward network that uses a back-propagation algorithm to train the network. The algorithm for AdderNet is described as below

Input: Consider an uninitialized network N with training set X, corresponding labels Y, global learning rate γ and hyper-parameter η

1. Select a batch of {x, y} form X and Y

2. Run AdderNet on the mini-batch x → N(x)

3. Calculate full-precision derivative for adder filter.

4. Use the chain rule to generate the gradients of the parameters in N

5. Calculate the adaptive learning rate α for each adder layer

6. Update the parameters using stochastic gradient descent

7. repeat step 1 to 6 until the network converges

Result: The result of this training process is a well-trained adder network with almost no multiplications involved.

Experiment verification (benchmark) of AdderNet

Experiments were done H Chen et al, to verify AdderNet on the standard datasets such as MNIST, CIFAR, ImageNet, etc to test the performance of AdderNet and the results were quite remarkable. For all the experiments, the performance of the AdderNets was very close to similar architecture CNNs and can be seen here. The performance difference was <2% for all the result and for many datasets it was very close to CNNs accuracy. For example, the convolutional neural network achieves a 99.4% accuracy with ∼435K multiplications and ∼435K additions. By replacing the multiplications in convolution with additions, the proposed AdderNet achieves a 99.4% accuracy, which is the same as that of CNNs, with ∼870K additions and almost no multiplication.I

Doesn't AdderNet use multiplication at all?

AdderNet uses addition/subtraction for finding cross-correlation between convolution filter and features. As fully connected layers can be considered as a convolution layers AdderNet has also replaced multiplication in the fully connected layer with subtraction.

But does it means that the AdderNet doesn't use multiplications at all?

The answer is No. Though cross-correlation and l1 distance both measure the similarity between input features and the filters, there is some difference between their outputs. The output of a convolution filter, as a weighted summation of values in the input feature map, can be positive or negative, but the output of an adder filter is always negative. Hence, batch normalization is used normalized the output to the appropriate range and this batch normalization process uses multiplications. Although the batch normalization layer involves multiplications, its computational cost is significantly lower than that of the convolutional layers and can be omitted

A successful demonstration AdderNet to replace multiplication by addition/subtraction operation and its results benchmark datasets suggest that it will be a very useful deep learning network architecture for high performance fast deep learning application where computation cost need to be lowered.

Different types of vector norms used in machine learning

Vector norm or distance is the length or magnitude of the vector and they are of many types such as L0, L1 (Manhattan) norm, L2 norm, L-infinity or max norm.