The 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing

Co-located with the 33rd Conference on Neural Information Processing Systems NeurIPS 2019

Friday, December 13, 2019
Vancouver BC, Canada
Room: TBD
Full Day

Acceptance Notification Date Extended: The notification date for papers and posters has been extended to October 01, 2019.

description Workshop Objective

As artificial intelligence and other forms of cognitive computing continue to proliferate into new domains, many forums for dialogue and knowledge sharing have emerged. In the proposed workshop, the primary focus is on the exploration of energy efficient techniques and architectures for cognitive computing and machine learning, particularly for applications and systems running at the edge. For such resource constrained environments, performance alone is never sufficient, requiring system designers to carefully balance performance with power, energy, and area (overall PPA metric).

The goal of this workshop is to provide a forum for researchers who are exploring novel ideas in the field of energy efficient machine learning and artificial intelligence for a variety of applications. We also hope to provide a solid platform for forging relationships and exchange of ideas between the industry and the academic world through discussions and active collaborations.

chat Call for Papers

A new wave of intelligent computing, driven by recent advances in machine learning and cognitive algorithms coupled with process technology and new design methodologies, has the potential to usher unprecedented disruption in the way conventional computing solutions are designed and deployed. These new and innovative approaches often provide an attractive and efficient alternative not only in terms of performance but also power, energy, and area. This disruption is easily visible across the whole spectrum of computing systems -- ranging from low end mobile devices to large scale data centers and servers.

A key class of these intelligent solutions is providing real-time, on-device cognition at the edge to enable many novel applications including vision and image processing, language translation, autonomous driving, malware detection, and gesture recognition. Naturally, these applications have diverse requirements for performance,energy, reliability, accuracy, and security that demand a holistic approach to designing the hardware, software, and intelligence algorithms to achieve the best power, performance, and area (PPA).

format_list_bulleted Topics for the Workshop

  • Architectures for the edge: IoT, automotive, and mobile
  • Approximation, quantization reduced precision computing
  • Hardware/software techniques for sparsity
  • Neural network architectures for resource constrained devices
  • Neural network pruning, tuning and and automatic architecture search
  • Novel memory architectures for machine learing
  • Communication/computation scheduling for better performance and energy
  • Load balancing and efficient task distribution techniques
  • Exploring the interplay between precision, performance, power and energy
  • Exploration of new and efficient applications for machine learning
  • Characterization of machine learning benchmarks and workloads
  • Performance profiling and synthesis of workloads
  • Simulation and emulation techniques, frameworks and platforms for machine learning
  • Power, performance and area (PPA) based comparison of neural networks
  • Verification, validation and determinism in neural networks
  • Efficient on-device learning techniques
  • Security, safety and privacy challenges and building secure AI systems
Grand Keynote

Yann LeCun, New York University and Facebook link

Yann LeCun is VP & Chief AI Scientist at Facebook and Silver Professor at NYU affiliated with the Courant Institute of Mathematical Sciences & the Center for Data Science. He was the founding Director of Facebook AI Research and of the NYU Center for Data Science. He received an Engineering Diploma from ESIEE (Paris) and a PhD from Sorbonne Université. After a postdoc in Toronto he joined AT&T Bell Labs in 1988, and AT&T Labs in 1996 as Head of Image Processing Research. He joined NYU as a professor in 2003 and Facebook in 2013. His interests include AI machine learning, computer perception, robotics and computational neuroscience. He is a member of the National Academy of Engineering and the recipient of the 2018 ACM Turing Award (with Geoffrey Hinton and Yoshua Bengio) for “conceptual and engineering breakthroughs that have made deep neural networks a a critical component of computing”.

Keynote

Cheap, Fast, and Low Power Deep Learning: I need it now!

Edward Delp, Purdue University link

In this talk I will describe the need for low power machine learning systems. I will motivate this by describing several current projects at Purdue University that have a need for energy efficient deep learning and in some cases the real deployment of these methods will not be possible without lower power solutions. The applications include precision farming, health care monitoring, and edge-based surveillance.

Edward J. Delp is currently The Charles William Harrison Distinguished Professor of Electrical and Computer Engineering and Professor of Biomedical Engineering at Purdue University. His research interests include image and video processing, image analysis, computer vision, machine learning, image and video compression, multimedia security, medical imaging, multimedia systems, communication and information theory. Dr. Delp is a Life Fellow of the IEEE. In 2004 Dr. Delp received the Technical Achievement Award from the IEEE Signal Processing Society for his work in image and video compression and multimedia security. In 2008 Dr. Delp received the Society Award from the IEEE Signal Processing Society.

Keynote

Efficient Computing for AI and Robotics

Vivienne Sze, MIT link

Computing near the sensor is preferred over the cloud due to privacy and/or latency concerns for a wide range of applications including robotics/drones, self-driving cars, smart Internet of Things, and portable/wearable electronics. However, at the sensor there are often stringent constraints on energy consumption and cost in addition to the throughput and accuracy requirements of the application. In this talk, we will describe how joint algorithm and hardware design can be used to reduce energy consumption while delivering real-time and robust performance for applications including deep learning, computer vision, autonomous navigation/exploration and video/image processing. We will show how energy-efficient techniques that exploit correlation and sparsity to reduce compute, data movement and storage costs can be applied to various tasks including image classification, depth estimation, super-resolution, localization and mapping.

Vivienne Sze is an Associate Professor at MIT in the Electrical Engineering and Computer Science Department. Her research interests include energy-aware signal processing algorithms, and low-power circuit and system design for portable multimedia applications, including computer vision, deep learning, autonomous navigation, and video process/coding. Prior to joining MIT, she was a Member of Technical Staff in the R&D Center at TI, where she designed low-power algorithms and architectures for video coding. She also represented TI in the JCT-VC committee of ITU-T and ISO/IEC standards body during the development of High Efficiency Video Coding (HEVC), which received a Primetime Engineering Emmy Award. She is a co-editor of the book entitled “High Efficiency Video Coding (HEVC): Algorithms and Architectures” (Springer, 2014).

Prof. Sze received the B.A.Sc. degree from the University of Toronto in 2004, and the S.M. and Ph.D. degree from MIT in 2006 and 2010, respectively. In 2011, she received the Jin-Au Kong Outstanding Doctoral Thesis Prize in Electrical Engineering at MIT. She is a recipient of the 2018 Facebook Faculty Award, the 2018 & 2017 Qualcomm Faculty Award, the 2018 & 2016 Google Faculty Research Award, the 2016 AFOSR Young Investigator Research Program (YIP) Award, the 2016 3M Non-Tenured Faculty Award, the 2014 DARPA Young Faculty Award, the 2007 DAC/ISSCC Student Design Contest Award, and a co-recipient of the 2018 Symposium on VLSI Circuits Best Student Paper Award, the 2017 CICC Outstanding Invited Paper Award, the 2016 IEEE Micro Top Picks Award and the 2008 A-SSCC Outstanding Design Award.

For more information about research in the Energy-Efficient Multimedia Systems Group at MIT visit: http://www.rle.mit.edu/eems/

Invited Talk

Adaptive Multi-Task Neural Networks for Efficient Inference

Rogerio Feris, IBM Research link

Very deep convolutional neural networks have shown remarkable success in many computer vision tasks, yet their computational expense limits their impact in domains where fast inference is essential. While there has been significant progress on model compression and acceleration, most methods rely on a one-size-fits-all network, where the same set of features is extracted for all images or tasks, no matter their complexity. In this talk, I will first describe an approach called BlockDrop, which learns to dynamically choose which layers of a deep network to execute during inference, depending on the image complexity, so as to best reduce total computation without degrading prediction accuracy. Then, I will show how this approach can be extended to design compact multi-task networks, where a different set of layers is executed depending on the task complexity, and the level of feature sharing across tasks is automatically determined to maximize both the accuracy and efficiency of the model. Finally, I will conclude the talk presenting an efficient multi-scale neural network model, which achieves state-of-the art results in terms of accuracy and FLOPS reduction on standard benchmarks such as the ImageNet dataset.

Rogerio Schmidt Feris is the head of computer vision and multimedia research at IBM T.J. Watson Research Center. He joined IBM in 2006 after receiving a Ph.D. from the University of California, Santa Barbara. He has also worked as an Affiliate Associate Professor at the University of Washington and as an Adjunct Associate Professor at Columbia University. His work has not only been published in top AI conferences, but has also been integrated into multiple IBM products, including Watson Visual Recognition, Watson Media, and Intelligent Video Analytics. He currently serves as an Associate Editor of TPAMI, has served as a Program Chair of WACV 2017, and as an Area Chair of conferences such as NeurIPS, CVPR, and ICCV.

Invited Talk

Efficient Algorithms to Accelerate Deep Learning on Edge Devices

Song Han, MIT link

Efficient deep learning computing requires algorithm and hardware co-design to enable specialization. However, the extra degree of freedom creates a much larger design space. We propose AutoML techniques to architect efficient neural networks. We investigate automatically designing small and fast models (ProxylessNAS), auto channel pruning (AMC), and auto mixed-precision quantization (HAQ). We demonstrate such learning-based, automated design achieves superior performance and efficiency than rule-based human design. Moreover, we shorten the design cycle by 200× than previous work to efficiently search efficient models, so that we can afford to design specialized neural network models for different hardware platforms. We accelerate computation-intensive AI applications including (TSM) for efficient video recognition and PVCNN for efficient 3D recognition on point clouds. Finally, we’ll describe scalable distributed training and the potential security issues of efficient deep learning [1] [2]

Song Han is an assistant professor at MIT EECS. Dr. Han received the Ph.D. degree in Electrical Engineering from Stanford advised by Prof. Bill Dally. Dr. Han’s research focuses on efficient deep learning computing. He proposed “Deep Compression” and “ EIE Accelerator” that impacted the industry. His work received the best paper award in ICLR’16 and FPGA’17. He was the co-founder and chief scientist of DeePhi Tech which was acquired by Xilinx.

Invited Talk

Abandoning the Dark Arts: New Directions in Efficient DNN Design

Kurt Keutzer, UC Berkeley link

Deep Neural Net models have provided the most accurate solutions to a very wide variety of problems in vision, language, and speech; however, the design, training, and optimization of efficient DNNs typically requires resorting to the “dark arts” of ad hoc methods and extensive hyperparameter tuning. In this talk we present our progress on abandoning these dark arts by using Differential Neural Architecture Search to guide the design of efficient DNNs and by using Hessian-based methods to guide the processes of training and quantizing those DNNs.

Kurt Keutzer’s research at University of California, Berkeley, focuses on computational problems in Deep Learning. In particular, Kurt has worked to reduce the training time of ImageNet to minutes and, with the SqueezeNet family, to develop a family of Deep Neural Networks suitable for mobile and IoT applications.

Before joining Berkeley as a Full Professor in 1998, Kurt was CTO and SVP at Synopsys. Kurt’s contributions to Electronic Design Automation were recognized at the 50th Design Automation Conference where he was noted as a Top 10 most cited author, as an author of a Top 10 most cited paper, and as one of only three people to have won four Best Paper Awards at that conference. Kurt was named a Fellow of the IEEE in 1996.

Invited Talk

Putting the “Machine” Back in Machine Learning: The Case for Hardware-ML Model Co-design

Diana Marculescu, Carnegie Mellon University link

Machine learning (ML) applications have entered and impacted our lives unlike any other technology advance from the recent past. Indeed, almost every aspect of how we live or interact with others relies on or uses ML for applications ranging from image classification and object detection, to processing multi‐modal and heterogeneous datasets. While the holy grail for judging the quality of a ML model has largely been serving accuracy, and only recently its resource usage, neither of these metrics translate directly to energy efficiency, runtime, or mobile device battery lifetime. This talk will uncover the need for building accurate, platform‐specific power and latency models for convolutional neural networks (CNNs) and efficient hardware-aware CNN design methodologies, thus allowing machine learners and hardware designers to identify not just the best accuracy NN configuration, but also those that satisfy given hardware constraints. Our proposed modeling framework is applicable to both high‐end and mobile platforms and achieves 88.24% accuracy for latency, 88.34% for power, and 97.21% for energy prediction. Using similar predictive models, we demonstrate a novel differentiable neural architecture search (NAS) framework, dubbed Single-Path NAS, that uses one single-path over-parameterized CNN to encode all architectural decisions based on shared convolutional kernel parameters. Single-Path NAS achieves state-of-the-art top-1 ImageNet accuracy (75.62%), outperforming existing mobile NAS methods for similar latency constraints (∼80ms) and finds the final configuration up to 5,000× faster compared to prior work. Combined with our quantized CNNs (Flexible Lightweight CNNs or FLightNNs) that customize precision level in a layer-wise fashion and achieve almost iso-accuracy at 5-10x energy reduction, such a modeling, analysis, and optimization framework is poised to lead to true co-design of hardware and ML model, orders of magnitude faster than state of the art, while satisfying both accuracy and latency or energy constraints.

Diana Marculescu is the David Edward Schramm Professor of Electrical and Computer Engineering at Carnegie Mellon University and the incoming Chair of Department of Electrical and Computer Engineering at University of Texas at Austin (starting December 2019). Diana is the Founding Director of the College of Engineering Center for Faculty Success at Carnegie Mellon University (since 2015) and has served as Associate Department Head for Academic Affairs in Electrical and Computer Engineering (2014-2018). She received the Dipl.Ing. degree in computer science from the Polytechnic University of Bucharest, Bucharest, Romania (1991), and the Ph.D. degree in computer engineering from the University of Southern California, Los Angeles, CA (1998). Her research interests include energy- and reliability-aware computing, hardware aware machine learning, and computing for sustainability and natural science applications. Diana was a recipient of the National Science Foundation Faculty Career Award (2000-2004), the ACM SIGDA Technical Leadership Award (2003), the Carnegie Institute of Technology George Tallman Ladd Research Award (2004), and several best paper awards. She was an IEEE Circuits and Systems Society Distinguished Lecturer (2004-2005) and the Chair of the Association for Computing Machinery (ACM) Special Interest Group on Design Automation (2005-2009). Diana chaired several conferences and symposia in her area and is currently an Associate Editor for IEEE Transactions on Computers. She was selected as an ELATE Fellow (2013-2014), and is the recipient of an Australian Research Council Future Fellowship (2013-2017), the Marie R. Pistilli Women in EDA Achievement Award (2014), and the Barbara Lazarus Award from Carnegie Mellon University (2018). Diana is an IEEE Fellow and an ACM Distinguished Scientist.

Invited Talk

Advances and Prospects for In-memory Computing

Naveen Verma, Princeton University link

Edge AI applications retain the need for high-performing inference models, while driving platforms beyond their limits of energy efficiency and throughput. Digital hardware acceleration, enabling 10-100x gains over general-purpose architectures, is already widely deployed, but is ultimately restricted by data-movement and memory accessing that dominates deep-learning computations. In-memory computing, based on both SRAM and emerging memory, offers fundamentally new tradeoffs for overcoming these barriers, with the potential for 10x higher energy efficiency and area-normalized throughput demonstrated in recent designs. But, those tradeoffs instate new challenges, especially affecting scaling to the level of computations required, integration in practical heterogeneous architectures, and mapping of diverse software. This talk examines those tradeoffs to characterize the challenges. It then explores recent research that provides promising paths forward, making in-memory computing more of a practical reality than ever before.

Naveen Verma received the B.A.Sc. degree in Electrical and Computer Engineering from the UBC, Vancouver, Canada in 2003, and the M.S. and Ph.D. degrees in Electrical Engineering from MIT in 2005 and 2009 respectively. Since July 2009 he has been a faculty member at Princeton University. His research focuses on advanced sensing systems, exploring how systems for learning, inference, and action planning can be enhanced by algorithms that exploit new sensing and computing technologies. This includes research on large-area, flexible sensors, energy-efficient statistical-computing architectures and circuits, and machine-learning and statistical-signal-processing algorithms. Prof. Verma has served as a Distinguished Lecturer of the IEEE Solid-State Circuits Society, and currently serves on the technical program committees for ISSCC, VLSI Symp., DATE, and IEEE Signal-Processing Society (DISPS).

Invited Talk

Algorithm-Accelerator Co-Design for Neural Network Specialization

Zhiru Zhang, Cornell University link

In recent years, machine learning (ML) with deep neural networks (DNNs) has been widely deployed in diverse application domains. However, the growing complexity of DNN models, the slowdown of technology scaling, and the proliferation of edge devices are driving a demand for higher DNN performance and energy efficiency. ML applications have shifted from general-purpose processors to dedicated hardware accelerators in both academic and commercial settings. In line with this trend, there has been an active body of research on both algorithms and hardware architectures for neural network specialization.

This talk presents our recent investigation into DNN optimization and low-precision quantization, using a co-design approach featuring contributions to both algorithms and hardware accelerators. First, we review static network pruning techniques and show a fundamental link between group convolutions and circulant matrices – two previously disparate lines of research in DNN compression. Then we discuss channel gating, a dynamic, fine-grained, and trainable technique for DNN acceleration. Unlike static approaches, channel gating exploits input-dependent dynamic sparsity at run time. This results in a significant reduction in compute cost with a minimal impact on accuracy. Finally, we present outlier channel splitting, a technique to improve DNN weight quantization by removing outliers from the weight distribution without retraining.

Zhiru Zhang is an Associate Professor in the School of ECE at Cornell University. His current research investigates new algorithms, design methodologies, and automation tools for heterogeneous computing. His research has been recognized with a Google Faculty Research Award (2018), the DAC Under-40 Innovators Award (2018), the Rising Professional Achievement Award from the UCLA Henry Samueli School of Engineering and Applied Science (2018), a DARPA Young Faculty Award (2015), and the IEEE CEDA Ernest S. Kuh Early Career Award (2015), an NSF CAREER Award (2015), the Ross Freeman Award for Technical Innovation from Xilinx (2012), and multiple best paper awards and nominations. Prior to joining Cornell, he was a co-founder of AutoESL, a high-level synthesis start-up later acquired by Xilinx.

Invited Talk

Configurable Cloud-Scale DNN Processor for Real-Time AI

Bita Rouhani, Microsoft link

Growing computational demands from deep neural networks (DNNs), coupled with diminishing returns from general-purpose architectures, have led to a proliferation of Neural Processing Units (NPUs). In this talk, we will discuss Project Brainwave, a production-scale system for real-time (low latency and high throughput) DNN inference. Brainwave NPU is reconfigurable and deployed in scale production. This reconfigurability, in turn, eliminates costly silicon updates to accommodate evolving state-of-the-art models while enabling orders of magnitude performance improvement compared to highly optimized software solutions.

Bita Rouhani is a senior researcher at Microsoft Azure AI and Advanced Architecture group. Bita received her Ph.D. in Computer Engineering from the University of California San Diego in 2018. Bita’s research interests include algorithm and hardware co-design for succinct and assured deep learning, real-time data analysis, and safe machine learning. Her work has been published at top-tier computer architecture, electronic design, machine learning, and security conferences and journals including ISCA, ASPLOS, ISLPED, DAC, ICCAD, FPGA, FCCM, SIGMETRICS, S&P magazine, and ACM TRETS.

Invited Talk

Jinwon Lee, Qualcomm link

Jinwon Lee is a Senior Staff Engineer at the Qualcomm AI Research lab where he designs state-of-the-art deep learning models for the edge devices. He received his Ph.D in Computer Science from Korea Advanced Institute of Science and Technology (KAIST) in 2009. Jinwon is currently focused on deep learning model optimizations for the edge devices including kernel/graph compiler optimization, model compression/quantization, and HW-aware neural architecture search. Previously, he developed deep learning-based on-device speech enhancement/recognition engine for Qualcomm SoC. Also, he developed low-power context-aware engines for mobile use cases based on GPS, WiFi, and motion sensors.

Paper Session #1

AutoSlim: Towards One-Shot Architecture Search for Channel Numbers

Jiahui Yu and Thomas Huang
University of Illinois at Urbana-Champaign

YOLO Nano: a Highly Compact You Only Look Once Convolutional Neural Network for Object Detection

Alexander Wong, Mahmoud Famuori, Mohammad Javad Shafiee, Francis Li, Brendan Chwyl and Jonathan Chung
University of Waterloo and DarwinAI Corp

Progressive Stochastic Binarization of Deep Networks

David Hartmann and Michael Wand
Johannes Gutenberg-University of Mainz

Exploring Bit-Slice Sparsity in Deep Neural Networks for Efficient ReRAM-Based Deployment

Jingyang Zhang, Huanrui Yang, Fan Chen, Yitu Wang and Hai Li
Duke University and Fudan University

Trained Rank Pruning for Efficient Deep Neural Networks

Yuhui Xu, Yuxi Li, Shuai Zhang, Wei Wen, Botao Wang, Wenrui Dai, Yingyong Qi, Yiran Chen, Weiyao Lin and Hongkai Xiong

Q8BERT: Quantized 8Bit BERT

Ofir Zafrir, Guy Boudoukh, Peter Izsak and Moshe Wasserblat
Intel AI Lab

Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization

Meng Li, Yilei Li, Pierce Chuang, Liangzhen Lai and Vikas Chandra
Facebook

Poster Session #1

Bit Efficient Quantization for Deep Neural Networks

Prateeth Nayak, David Zhang and Sek Chai
SRI International and Latent AI

Supported-BinaryNet: Bitcell Array-based Weight Supports for Dynamic Accuracy-Latency Trade-offs in SRAM-based Binarized Neural Network

Shamma Nasrin, Srikanth Ramakrishna, Theja Tulabandhula and Amit Trivedi
University of Illinois at Chicago

Dynamic Channel Execution: on-device Learning Method for Finding Compact Networks

Simeon Spasov and Pietro Lio
University of Cambridge

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond and Thomas Wolf
Hugging Face

QPyTorch: A Low-Precision Arithmetic Simulation Framework

Tianyi Zhang, Zhiqiu Lin, Guandao Yang and Christopher De Sa.
Cornell University

Separable Convolutions for Multiscale Dense Networks for Efficient Anytime Image Classification

Sven Peter, Nasim Rahaman, Ferran Diego and Fred Hamprecht
Heidelberg University and Telefonica Research

Paper Session #2

Training Compact Models for Low Resource Entity Tagging using Pre-trained Language Models

Peter Izsak, Shira Guskin and Moshe Wasserblat
Intel AI Lab

Algorithm-hardware Co-design for Deformable Convolution

Qijing Huang, Dequan Wang, Yizhao Gao, Yaohui Cai, Zhen Dong, Bichen Wu, Kurt Keutzer and John Wawrzynek
UC Berkeley, Peking University and University of Chinese Academy of Science

Instant Quantization of Neural Networks using Monte Carlo Methods

Gonçalo Mordido, Matthijs Van Keirsbilck and Alexander Keller
Hasso Plattner Institute and NVIDIA

Spoken Language Understanding on the Edge

Alaa Saade, Alice Coucke, Alexandre Caulier, Joseph Dureau, Adrien Ball, Théodore Bluche, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril and Maël Primet
Snips

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond and Thomas Wolf
Hugging Face

Energy-Aware Neural Architecture Optimization With Splitting Steepest Descent

Dilin Wang, Lemeng Wu, Meng Li, Vikas Chandra and Qiang Liu
UT Austin and Facebook

Doubly Sparse: Sparse Mixture of Sparse Experts for Efficient Softmax Inference

Shun Liao, Ting Chen, Tian Lin, Denny Zhou and Chong Wang
University of Toronto, Google and ByteDance

Poster Session #2

Pushing the limits of RNN Compression

Urmish Thakker, Igor Fedorov, Jesse Beu, Dibakar Gope, Chu Zhou, Ganesh Dasika and Matthew Mattina
Arm ML Research and AMD Research

On hardware-aware probabilistic frameworks for resource constrained embedded applications

Laura Isabel Galindez Olascoaga, Wannes Meert, Nimish Shah, Guy Van den Broeck and Marian Verhelst
KU Leuven, and UCLA

Neural Networks Weights Quantization: Target None-retraining Ternary (TNT)

Tianyu Zhang, Lei Zhu, Qian Zhao and Kilho Shin
WeBank, Harbin Engineering University, University of Hyogo and Gakushuin University

Regularized Binary Network Training

Sajad Darabi, Mouloud Belbahri, Matthieu Courbariaux and Vahid Partovi Nia
UCLA, Université de Montréal and Huawei

Neuron ranking – an informed way to condense convolutional neural networks architecture

Kamil Adamczewski and Mijung Park
Max Planck Institute for Intelligent Systems

Fully Quantized Transformer for Improved Translation

Gabriele Prato, Ella Charlaix and Mehdi Rezagholizadeh
Université de Montréal and Huawei

description EMC2 Model Compression Challenge (EMCC)

Deep learning has recently pushed the state of the art boundaries in many computer vision tasks. However, existing deep learning models are both computationally and memory intensive, making their deployment difficult on devices with low compute and memory resources. To fit these emerging models on such devices, novel compression techniques are needed without significantly decreasing the model accuracy.

The EMC2 Model Compression Challenge (EMCC) aims to identify the best technology in deep learning model compression. To win a prize in EMCC, the solution will be evaluated to meet the two metrics in two tracks below:

  • Achieve highest accuracy within the target model size. The submission will not be evaluated if the model size is outside of the target range.
  • Achieve smallest model size within the target accuracy. The submission will not be evaluated if the accuracy is outside of the target range.

A participant (or a team) can submit a single model in Tensorflow (https://www.tensorflow.org/) or PyTorch (https://pytorch.org/). Final scores will be computed after submission closes.

There are three tracks, each participant can submit to either or all tracks.

  • ImageNet Classification: This category focuses image classification models.
  • COCO Object Detection: This category focuses on object detection models.
  • PASCAL Object Segmentation: This category focuses on object segmentation models.

1. Data

There are three tracks, each participant can submit to either or all tracks.

2. Evaluation

Submissions are evaluated based on the following metrics:

  • Highest accuracy with target model size. The target size is based on the state-of-the-art DL model’s size with allowed 5% additional size budget below:
    • ImageNet 2012 Classification: Top-1 accuracy for image classification under a targeted model size (3.5M Bytes). 3.5M is estimated from 8-bit quantized MobilenetV2 model.
    • COCO 2017 Object Detection: COCO metrics, and the target model size is 6.2M Bytes, which is estimated from 8-bit quantized MobileNetV2-SSD model.
    • PASCAL 2012 Object Segmentation: mIOU, and the target model size is 2.1M Bytes, which is based on 8-bit quantized MobileNetV2-DeepLab model.
  • Smallest model size with target accuracy. The target accuracy is based on the state-of-the-art DL model’s accuracy with allowed 5% additional accuracy budget below:
    • ImageNet 2012 Classification: Smallest mode size with target Top-1 accuracy (70%). 3.5M is estimated from 8-bit quantized MobilenetV2 model.
    • COCO 2017 Object Detection: Smallest mode size with target mAP is 26% (COCO metrics), which is estimated from 8-bit quantized MobileNetV2-SSD model.
    • PASCAL 2012 Object Segmentation: Smallest model size with target mIOU 70%, which is based on 8-bit quantized MobileNetV2-DeepLab model.

3. Benchmark Environment and Input

The submissions will be interpreted using the Tensorflow 1.14.0 or PyTorch 1.1.0. The input is ImageNet/COCO/PASCAL images.

4. Output and Model Conversion

  • ImageNet Classification: The output must be a tensor encoding probabilities of the 1000 classes. (Labels are avaialble here.
  • COCO Object Detection: The output must be the bounding box and probabilities of the 80 classes.
  • PASCAL Object Segmentation:The output must be the segmentation mask.

5. Timeline and Submission

Please see the submision page for dates and details.

Workshop Sponsors

Qualcomm Cadence