d417: TPU performance, Machine Learning Chip

TPU performance: https://cloudplatform.googleblog.com/2017/04/quantifying-the-performance-of-the-TPU-our-first-machine-learning-chip.html

Erik Jonker on Tensor Processing Unit: “One of the first dedicated machine learning chip perfoms much better than ordinary GPU’s and CPU’s. Specialised hardware is an important part of progress in AI & machine learning.” [Source]

(ISCA Paper, PDF) Tensor Processing Unit / TPU Performance Analysis In-Datacenter: https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk/view (includes overview of TPU vs CPU, GPU vs TPU results on 8-bit integers)

Abstract

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC—called a Tensor Processing Unit (TPU)— deployed in datacenter since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU’s deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, …) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters’ NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X – 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X – 80X higher. Moreover, using the GPU’s GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

Index terms: DNN, MLP, CNN, RNN, LSTM, neural network, domain-specific architecture, accelerator

Discussion (Hacker News): https://news.ycombinator.com/item?id=14043059
Discussion (Reddit / Machine Learning): https://www.reddit.com/r/MachineLearning/comments/63mne2/d_quantifying_the_performance_of_the_tpu_our/

Share this:

Related