Training vs. Inference

For many industrial applications off-line learning is sufficient, where the neural network is first trained on a set of data, and then shipped to the customer; the network can be periodically taken off-line and retrained. While, today, machine-learning researchers and engineers would especially want an arch that speeds up training, this represents a small market. [1][2]

The fastest computer a year ago let us cut cut down training from a month to 25 hours with four Maxwell GPUs. This year, it will take two hours of training time with eight Pascal GPUs. [GTC'2016]

Papers

Cnvlutin: Ineffectual-Neuron-Free Deep Convolutional Neural Network Computing

on average 44% of the run-time calculated neurons in modern DNNs are zero, this work advocates a value-based approach to accelerating DNNs in hardware and presents the CNV DNN accelerator architecture.

Furthermore, CNV’s design principles and a valued-based approach can be applied in network training, on other hardware and software network implementations, or on other tasks such as Natural Language Processing.

ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars

This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner.

The architecture is not used for in-the-field training; it is only used for inference, which is the dominant operation in several domains (e.g., domains where training is performed once on a cluster of GPUs and those weights are deployed on millions of devices to perform billions of inferences). Adapting ISAAC for in-the-field training would require non-trivial effort and is left for future work.

A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

This paper proposed a novel processing in ReRAM-based main memory design, PRIME, which substantially improves the performance and energy efficiency for neural network (NN) applications, benefiting from both the PIM architecture and the efficiency of ReRAM-based NN computation.

In our work, the training of NN is done off-line so that the inputs of each API are already known (NN param.file). Prior work explored to implement training with ReRAM crossbar arrays [12],[70]–[74], and we plan to further enhance PRIME with the training capability in future work.

EIE: Efficient Inference Engine on Compressed Deep Neural Network

Previously proposed ‘Deep Compression’ makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing.

RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision

Targeting object recognition, we shift early convolutional processing into RedEye’s analog domain, reducing the workload of the analog readout and of the computational system.

Minerva: Enabling Low-Power, High-Accuracy Deep Neural Network Accelerators

This paper presents Minerva, a highly automated co-design approach across the algorithm, architecture, and circuit levels to optimize DNN hardware accelerators.

The parameters (weights) of a DNN are fitted to data in a process called training, which typically runs on a highperformance CPU/GPU platform. One obvious solution is to implement highly customized hardware accelerators for DNN prediction.

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

In this paper, we present a novel dataflow, called rowstationary (RS), that minimizes data movement energy consumption on a spatial architecture. This is realized by exploiting local data reuse of filter weights and feature map pixels, i.e., activations, in the high-dimensional convolutions, and minimizing data movement of partial sum accumulations.

Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory

Neurocube [31] maps CNNs to 3D high-density high-bandwidth memory integrated with logic, forming a mesh of digital processing elements.

Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1 (link)

We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time. At training-time the binary weights and activations are used for computing the parameters gradients. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to substantially improve power-efficiency.

Binaryconnect: Training deep neural networks with binary weights during propagations (link)

In the future, faster computation at both training and test time is likely to be crucial for further progress and for consumer applications on low-power devices.

We introduce BinaryConnect, a method which consists in training a DNN with binary weights during the forward and backward propagations, while retaining precision of the stored weights in which gradients are accumulated.

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks (link)

In Binary-WeightNetworks, the filters are approximated with binary values resulting in 32× memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary.

Once the training finished, there is no need to keep the real-value weights. Because, at inference we only perform forward propagation with the binarized weights.

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding (link)

Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce “deep compression”, a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35× to 49× without affecting their accuracy.

ImageNet Classification with Deep Convolutional Neural Networks (link)

ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.

  1. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning (ASPLOS'14), http://pages.saclay.inria.fr/olivier.temam/files/eval/CDSWWCT14.pdf
  2. DaDianNao: A machine-learning supercomputer (MICRO'14), http://novel.ict.ac.cn/ychen/pdf/DaDianNao.pdf