The 4th Workshop on Energy Efficient Machine Learning and Cognitive Computing
Co-located with the 46th International Symposium on Computer Architecture ISCA 2019
description Workshop Objective
As artificial intelligence and other forms of cognitive computing continue to proliferate into new domains, many forums for dialogue and knowledge sharing have emerged. In the proposed workshop, the primary focus is on the exploration of energy efficient techniques and architectures for cognitive computing and machine learning, particularly for applications and systems running at the edge. For such resource constrained environments, performance alone is never sufficient, requiring system designers to carefully balance performance with power, energy, and area (overall PPA metric).
The goal of this workshop is to provide a forum for researchers who are exploring novel ideas in the field of energy efficient machine learning and artificial intelligence for a variety of applications. We also hope to provide a solid platform for forging relationships and exchange of ideas between the industry and the academic world through discussions and active collaborations.
chat Call for Papers
A new wave of intelligent computing, driven by recent advances in machine learning and cognitive algorithms coupled with process technology and new design methodologies, has the potential to usher unprecedented disruption in the way conventional computing solutions are designed and deployed. These new and innovative approaches often provide an attractive and efficient alternative not only in terms of performance but also power, energy, and area. This disruption is easily visible across the whole spectrum of computing systems -- ranging from low end mobile devices to large scale data centers and servers.
A key class of these intelligent solutions is providing real-time, on-device cognition at the edge to enable many novel applications including vision and image processing, language translation, autonomous driving, malware detection, and gesture recognition. Naturally, these applications have diverse requirements for performance,energy, reliability, accuracy, and security that demand a holistic approach to designing the hardware, software, and intelligence algorithms to achieve the best power, performance, and area (PPA).
format_list_bulleted Topics for the Workshop
- Architectures for the edge: IoT, automotive, and mobile
- Approximation, quantization reduced precision computing
- Hardware/software techniques for sparsity
- Neural network architectures for resource constrained devices
- Neural network pruning, tuning and and automatic architecture search
- Novel memory architectures for machine learing
- Communication/computation scheduling for better performance and energy
- Load balancing and efficient task distribution techniques
- Exploring the interplay between precision, performance, power and energy
- Exploration of new and efficient applications for machine learning
- Characterization of machine learning benchmarks and workloads
- Performance profiling and synthesis of workloads
- Simulation and emulation techniques, frameworks and platforms for machine learning
- Power, performance and area (PPA) based comparison of neural networks
- Verification, validation and determinism in neural networks
- Efficient on-device learning techniques
- Security, safety and privacy challenges and building secure AI systems
09:00 - 09:10
Introduction and Opening Remarks
09:10 - 10:00
Efficient Deep Learning: Quantizing Models Without Using Re-training
Harris Teague, Qualcomm AI Research
In this talk we’ll cover techniques to do post-training quantization that can improve model accuracy for 8-bit quantization significantly. These techniques are especially useful when training/fine-tuning is not possible, a case that arises very frequently in commercial applications. No training pipeline, optimized hyperparameters, nor full training datasets are needed. We show the effectiveness of these techniques for popular models used for inference on resource constrained devices.
Harris Teague is Principal Engineer/Manager at Qualcomm and leads the Platform Systems group at Qualcomm AI Research. Prior to focusing on machine learning, he worked on a range of wireless projects including Bluetooth, WCDMA, ultra-wideband, and an OFDMA-based precursor to LTE called UMB. Harris is an inventor on over 60 granted US patents. He holds a BS degree in Aerospace Engineering from Virginia Tech, and MS and PhD in Aerospace from Stanford University.
10:00 - 10:50
Machine Learning at Scale
Machine learning systems are being widely deployed in production datacenter infrastructure and over billions of edge devices. This talk seeks to address key system design challenges when scaling machine learning solutions to billions of people. What are key similarities and differences between cloud and edge infrastructure? The talk will conclude with open system research directions for deploying machine learning at scale.
Carole-Jean Wu is a Research Scientist at Facebook’s AI Infrastructure Research. She is also a tenured Associate Professor of CSE in Arizona State University. Carole-Jean’s research focuses in Computer and System Architectures. More recently, her research has pivoted into designing systems for machine learning. She is the leading author of “Machine Learning at Facebook: Understanding Inference at the Edge” that presents unique design challenges faced when deploying ML solutions at scale to the edge, from over billions of smartphones to Facebook’s virtual reality platforms. Carole-Jean received her M.A. and Ph.D. from Princeton and B.Sc. from Cornell.
10:50 - 11:10
11:10 - 12:00
Enabling Continuous Learning through Synaptic Plasticity in Hardware
Ever since modern computers were invented, the dream of creating artificial intelligence (AI) has captivated humanity. We are fortunate to live in an era when, thanks to deep learning (DL), computer programs have paralleled, and in many cases even surpassed human level accuracy in tasks like visual perception and speech synthesis. However, we are still far away from realizing general-purpose AI. The problem lies in the fact that the development of supervised learning based DL solutions today is mostly open loop. A typical DL model is created by hand-tuning the deep neural network (DNN) topology by a team of experts over multiple iterations, followed by training over petabytes of labeled data. Once trained, the DNN provides high accuracy for the task at hand; if the task changes, however, the DNN model needs to be re-designed and re-trained before it can be deployed. A general-purpose AI system, in contrast, needs to have the ability to constantly interact with the environment and learn by adding and removing connections within the DNN autonomously, just like our brain does. This is known as synaptic plasticity.
In this talk we present our research efforts towards enabling general-purpose AI. First, we present GeneSys (MICRO 2018), a HW-SW prototype of a closed loop learning system for continuously evolving the structure and weights of a DNN for the task at hand using genetic algorithms, providing 100-10000x higher performance and energy-efficiency over state-of-the-art embedded and desktop CPU and GPU systems. Next, we present a DNN accelerator substrate called MAERI (ASPLOS 2018), built using light-weight, non-blocking, reconfigurable interconnects, that supports efficient mapping of regular and irregular DNNs with arbitrary dataflows, providing ~100% utilization of all compute units, resulting in 3X speedup and energy-efficiency over state-of-the-art DNN accelerators.
Tushar Krishna is an Assistant Professor in the School of Electrical and Computer Engineering at Georgia Tech. He has a Ph.D. in Electrical Engineering and Computer Science from MIT (2014), a M.S.E in Electrical Engineering from Princeton University (2009), and a B.Tech in Electrical Engineering from the Indian Institute of Technology (IIT) Delhi (2007). Before joining Georgia Tech in 2015, Dr. Krishna spent a year as a post-doctoral researcher at Intel, Massachusetts.
Dr. Krishna’s research spans computer architecture, interconnection networks, networks-on-chip (NoC) and deep learning accelerators - with a focus on optimizing data movement in modern computing systems. He has 42 publications in leading conferences and journals, which have amassed over 5000 citations to date. Three of his papers have been selected for IEEE Micro’s Top Picks from Computer Architecture, one more received an honorable mention, and two have won best paper awards. He has received the National Science Foundation (NSF) CRII award, a Google Faculty Award, and a Facebook Faculty Award.
12:00 - 13:30
13:30 - 15:00
Paper Session #1
Run-Time Efficient RNN Compression for Inference on Edge Device
Urmish Thakker, Dibakar Gope, Jesse Beu, Ganesh Dasika and Matthew Mattina
ARM ML Research
PyRTLMatrix: an Object-Oriented Hardware Design Pattern for Prototyping ML Accelerators
Dawit Aboye, Dylan Kupsh, Maggie Lim, Jacqueline Mai, Deeksha Dangwal, Diba Mirza and Timothy Sherwood
UC Santa Barbara
Accelerated CNN Training Through Gradient Approximation
Ziheng Wang, Sree Harsha Nelaturu and Saman Amarasinghe
Massachusetts Institute of Technology and SRM Institute of Science and Technology
15:00 - 15:10
15:10 - 16:00
Structured and Systematic Approach for Energy Efficient DNN Acceleration
Large-scale deep neural networks (DNNs) are both compute and memory intensive. As the size of DNNs continues to grow, it is critical to improve the energy efficiency and performance while maintaining accuracy. In this talk, I first present our principled approaches to performing model compression and acceleration using structured matrices. Compared to unstructured pruning, our methods (CirCNN and PermDNN) achieve significant model storage and computation reduction while maintaining accuracy. Thanks to the regular structure, the accelerators can achieve better performance and energy efficiency compared to the state-of-the-art designs. I will also present a unified solution framework for both unstructured and structured pruning and quantization based on Alternating Direction Method of Multipliers (ADMM). It ensures high solution quality while guaranteeing solution feasibility, consistently outperforming previous results. Finally, I will present HyPar, a systematic approach to search the best tensor partition for a given multi-layer DNN with an accelerator array. It optimizes performance and energy efficiency by reducing data movement between accelerators. We believe that the structured and systematic approach and algorithm/hardware co-design are crucial for designing energy efficient DNN accelerators.
Xuehai Qian is an assistant professor at University of Southern California. His research interests include domain-specific system and architecture with focuses on machine learning and graph processing, performance tuning and resource management of Cloud systems, and parallel computer architecture. He got his Ph.D from University of Illinois Urbana Champaign and was a postdoc at UC Berkeley. He is the recipient of W.J Poppelbaum Memorial Award at UIUC, NSF CRII and CAREER Award, and the inaugural ACSIC (American Chinese Scholar In Computing) Rising Star Award.
16:00 - 16:50
Balancing Efficiency and Flexibility for DNN Acceleration
There has been a significant amount of research on the topic of efficient processing of DNNs, from the design of efficient DNN algorithms to the design of efficient DNN accelerators. The wide range techniques used for efficient DNN algorithm design has resulted in a more diverse set of DNNs; this creates a new challenge for the DNN accelerators, as they now need to be sufficiently flexible to support a wide range of DNN workloads efficiently. However, many of the existing DNN accelerators rely on certain properties of the DNN which cannot be guaranteed (e.g., fixed weight sparsity, large number of channels, large batch size). In this talk, we will briefly discuss recent techniques that have been used to design efficient DNN algorithms and important properties to consider when applying these techniques. We will then present a systematic approach called Eyexam to identify the sources of inefficiencies in DNN accelerator designs for different DNN workloads. Finally, we will introduce a flexible accelerator called Eyeriss v2 that is computationally efficient across a wide range of diverse DNNs.
Vivienne Sze is an Associate Professor at MIT in the Electrical Engineering and Computer Science Department. Her research interests include energy-aware signal processing algorithms, and low-power circuit and system design for portable multimedia applications, including computer vision, deep learning, autonomous navigation, and video process/coding. Prior to joining MIT, she was a Member of Technical Staff in the R&D Center at TI, where she designed low-power algorithms and architectures for video coding. She also represented TI in the JCT-VC committee of ITU-T and ISO/IEC standards body during the development of High Efficiency Video Coding (HEVC), which received a Primetime Engineering Emmy Award. She is a co-editor of the book entitled “High Efficiency Video Coding (HEVC): Algorithms and Architectures” (Springer, 2014).
Prof. Sze received the B.A.Sc. degree from the University of Toronto in 2004, and the S.M. and Ph.D. degree from MIT in 2006 and 2010, respectively. In 2011, she received the Jin-Au Kong Outstanding Doctoral Thesis Prize in Electrical Engineering at MIT. She is a recipient of the 2018 Facebook Faculty Award, the 2018 & 2017 Qualcomm Faculty Award, the 2018 & 2016 Google Faculty Research Award, the 2016 AFOSR Young Investigator Research Program (YIP) Award, the 2016 3M Non-Tenured Faculty Award, the 2014 DARPA Young Faculty Award, the 2007 DAC/ISSCC Student Design Contest Award, and a co-recipient of the 2017 CICC Outstanding Invited Paper Award, the 2016 IEEE Micro Top Picks Award and the 2008 A-SSCC Outstanding Design Award.
For more information about research in the Energy-Efficient Multimedia Systems Group at MIT visit: http://www.rle.mit.edu/eems/