The 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing
Co-located with the 25th IEEE International Symposium on High-Performance Computer Architecture HPCA 2019
description Workshop Objective
As artificial intelligence and other forms of cognitive computing continue to proliferate into new domains, many forums for dialogue and knowledge sharing have emerged. In the proposed workshop, the primary focus is on the exploration of energy efficient techniques and architectures for cognitive computing and machine learning, particularly for applications and systems running at the edge. For such resource constrained environments, performance alone is never sufficient, requiring system designers to carefully balance performance with power, energy, and area (overall PPA metric).
The goal of this workshop is to provide a forum for researchers who are exploring novel ideas in the field of energy efficient machine learning and artificial intelligence for a variety of applications. We also hope to provide a solid platform for forging relationships and exchange of ideas between the industry and the academic world through discussions and active collaborations.
chat Call for Papers
A new wave of intelligent computing, driven by recent advances in machine learning and cognitive algorithms coupled with process technology and new design methodologies, has the potential to usher unprecedented disruption in the way conventional computing solutions are designed and deployed. These new and innovative approaches often provide an attractive and efficient alternative not only in terms of performance but also power, energy, and area. This disruption is easily visible across the whole spectrum of computing systems -- ranging from low end mobile devices to large scale data centers and servers.
A key class of these intelligent solutions is providing real-time, on-device cognition at the edge to enable many novel applications including vision and image processing, language translation, autonomous driving, malware detection, and gesture recognition. Naturally, these applications have diverse requirements for performance,energy, reliability, accuracy, and security that demand a holistic approach to designing the hardware, software, and intelligence algorithms to achieve the best power, performance, and area (PPA).
format_list_bulleted Topics for the Workshop
- Architectures for the edge: IoT, automotive, and mobile
- Approximation, quantization reduced precision computing
- Hardware/software techniques for sparsity
- Neural network architectures for resource constrained devices
- Neural network pruning, tuning and and automatic architecture search
- Novel memory architectures for machine learing
- Communication/computation scheduling for better performance and energy
- Load balancing and efficient task distribution techniques
- Exploring the interplay between precision, performance, power and energy
- Exploration of new and efficient applications for machine learning
- Characterization of machine learning benchmarks and workloads
- Performance profiling and synthesis of workloads
- Simulation and emulation techniques, frameworks and platforms for machine learning
- Power, performance and area (PPA) based comparison of neural networks
- Verification, validation and determinism in neural networks
- Efficient on-device learning techniques
- Security, safety and privacy challenges and building secure AI systems
08:45 - 09:00
Introduction and Opening Remarks
09:00 - 10:00
Quantizing Deep Convolutional Networks for Efficient Inference
Raghuraman Krishnamoorthi, Facebook
We present an overview of techniques for quantizing convolutional neural networks for inference with integer weights and activations. We discuss different quantization schemes and show that simple techniques provide very good performance (4x reduction in model size, 2x speed up in CPUs) for classification use cases, with 1-2% accuracy drop.
Modeling quantization during training can provide further improvements, reducing the gap to floating point to 1% at 8-bit precision. Quantization-aware training also allows for reducing the precision of weights to four bits with accuracy losses ranging from 2% to 10%, with higher accuracy drop for smaller networks.
We recommend that per-channel quantization of weights and per-layer quantization of activations be the preferred quantization scheme for hardware acceleration and kernel optimization. We also propose that future processors and hardware accelerators for optimized inference support:
- precisions of 4, 8 and 16 bits for computation
- Per-channel quantization of weights
- Per layer selection of bit widths for weights and activations
- Support for on the fly weight compression techniques for memory bandwidth efficiency.
Raghuraman Krishnamoorthi is a software engineer in the Pytorch team at Facebook, where he leads the effort to develop and optimize quantized deep networks for inference. Prior to that he was part of the Tensorflow team at google working on quantization for mobile inference as part of TensorflowLite.
From 2001 to 2017, Raghu was at Qualcomm Research, working on several generations of wireless technologies. His work experience also includes computer vision for AR, ultra-low power always on vision and hardware/software co-design for inference on mobile platforms. He is an inventor in more than 90 issued and filed patents. Raghu has a masters in EE from University of Illinois,Urbana Champaign and a Bachelor degree from Indian Institute of Technology, Madras.
10:00 - 11:00
Efficient Machine Learning Architectures
Advances in machine learning (ML) are resulting in highly-accurate recognition (e.g., image and speech recognition). ML models, however, place high computational demand during both training and inference, requiring efficient architectures. The models? computations are fine-grained, regular and highly parallel, have high data reuse, and use low-precision arithmetic for inference (e.g., int8). Modern ML architectures (e.g., GPGPUs, TPU, and FPGA-based) exploit these characteristics to achieve high performance and energy efficiency.
Recently, ML models have been shown to be sparse, prompting creative proposals for sparse architectures. Emerging technology trends of processing-in/near-memory match some of the ML workloads well providing an opportunity for architectural innovation based on these innovative technologies. In this talk, I will explore these exciting aspects of machine learning architectures.
T. N. Vijaykumar is Professor in the School of Electrical and Computer Engineering at Purdue University. His research interests are in computer architecture targeting machine learning architectures, secure high-performance microprocessors, and verifiable architectures. He is also interested in hardware for data center networks and software-programmable microfluidics. His work has been recognized with an NSF CAREER Award in 1999 and IEEE Micros Top Picks in 2003 and 2005. He is listed in the International Symposium on Computer Architecture (ISCA) Hall of Fame at http://research.cs.wisc.edu/arch/www/iscabibhall. With his colleagues, he received the first prize in the 2009 Burton D. Morgan Business Plan Competition for a business plan on commercializing software-programmable lab-on-a-chip technology. He received a Ph.D. in computer science from the University of Wisconsin-Madison in 1997.
11:00 - 12:00
Paper Session #1
Efficient Winograd or Cook-Toom Convolution Kernel Implementation on Widely Used Mobile CPUs
Partha Maji, Andrew Mundy, Ganesh Dasika, Jesse Beu, Matthew Mattina, Robert Mullins
University of Cambridge and ARM ML Research
On Merging MobileNets for Efficient Multitask Inference
Cheng-En Wu, Yi-Ming Chan and Chu-Song Chen
Institute of Information Science, Academia Sinica
13:30 - 14:15
Tensilica DNA 100 Processor: A High-Performance, Power-Efficient DNN Processor for On-Device Inference
Megha Daga, Cadence
Deep learning is inﬂuencing not only the technology itself but also our everyday lives. With the increasing demand on mobile artificial intelligence (AI), conventional hardware solutions face their ordeal because of their low energy efﬁciency on such power-hungry applications. For the past few years, dedicated DNN accelerator inference has been under the spotlight. However, with the rising emphasis on privacy and personalization, the ability to learn on mobile platforms is becoming the second hurdle for “on-device AI.” The Cadence® Tensilica® DNA 100 Processor IP, is the first deep neural-network accelerator (DNA) AI processor IP to deliver both high performance and power efficiency across a full range of compute from 0.5 TeraMAC (TMAC) to 100 TMACs. As a result, the DNA 100 processor is well suited for on-device neural network inference applications spanning autonomous vehicles (AVs), ADAS, surveillance, robotics, drones, augmented reality (AR) /virtual reality (VR), smartphones, smart home, and IoT.
Megha Daga, works at Cadence Design Systems, Inc. as Sr Manager, Product Marketing and Management in the AI group. Megha’s focus and passion is to research latest trends and requirements in AI and to create industry leading solutions on Cadence AI IPs. Megha enjoys learning from customer’s experiences and fellow researchers in AI. Her R&D background coupled with her current marketing role gives her a unique perspective about the AI industry.
14:15 - 15:00
Beyond IPS: Toward A Wholistic Measure of Machine Learning Performance
Saurabh Tangri, Intel
Saurabh Tangri is a senior SW architect at Intel and leads AI enabling efforts across Intel HW for Microsoft solutions. His focus area is to make AI accessible and performant for everyone in a seamless manner. He has been with Intel for nearly 15 years.
15:30 - 17:00
Paper Session #2
Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim
Farzad Farshchi, Qijing Huang and Heechul Yun
University of Kansas, University of California, Berkeley
Bootstrapping Deep Neural Networks from Approximate Image Processing Pipelines
Sek Chai, Kilho Son and Jesse Hostetler
NNBench-X: A Benchmarking Methodology for Neural Network Accelerator Designs
Xinfeng Xie, Xing Hu, Peng Gu, Shuangchen Li, Yu Ji and Yuan Xie
University of California, Santa Barbara
17:00 - 17:30
Hardware Acceleration Opportunities in Bioinformatics and Computational Biology
Leonid Yavits, Technion
Advances in genomics have triggered a revolution in healthcare and our understanding of life. Recent years saw exponential increase in genomic data, far outpacing Moore’s Law. Coupled with prohibitively high computational costs of bioinformatics tasks, it presents a challenge but also a great opportunity for hardware acceleration.
I will describe a typical genomic assembly pipeline, and discuss the latest developments in the field of DNA sequencing, with an emphasis on hardware acceleration opportunities. Afterwards, I will make a brief excurse into the world of existing bioinformatics accelerators. I will end up with the insights from the Accelerator Architecture for Computational Biology and Bioinformatics (AACBB) 2019 workshop.
Leonid Yavits received his MSc (1996) and PhD in Electrical Engineering (2015) from the Technion, Israel Institute of Technology. After graduating the MSc program, he co-founded VisionTech where he co-designed the world’s first single chip MPEG2 codec. Following VisionTech’s acquisition by Broadcom, he managed Broadcom Israel R&D and co-developed a number of video compression products. Later Leonid co-founded Horizon Semiconductors where he co-designed a Set Top Box-on-chip for cable and satellite TV. Horizon’s Set Top Box-on-chip was among world’s earliest heterogeneous MPSoC.
Leonid is a postdoc research fellow in Electrical Engineering at Technion. He co-authored a number of patents and research papers. His research interests include non von Neumann computer architectures; processing in memory and resistive memory based computing; architectures for computational biology and bioinformatics. Leonid’s research work has earned several awards; among them: IEEE Computer Architecture Letter Journal Best Paper Awards for 2015 and 2017 and best poster awards at ISC High Performance in 2017 and ACM/IEEE Supercomputing Conference in 2018.