Virtualization Technology News and Information
Article
RSS
OctoML Announces Faster Machine Learning Model Inferencing on Apple's New M1 Than Apple's Core ML 4 Provides
OctoML, the MLOps automation company for superior model performance, portability and productivity, demonstrated better model performance on Apple's M1 chip than Apple's core inferencing engine.  OctoML's results showcased lower model latency than any of Apple's own developed software, ranging from a 30% improvement against Apple's latest Core ML 4 inference engine to a 13x improvement on Apple's standard Core ML 3. All comparisons were based on the BERT-base model, a common deep learning model used widely for natural language processing tasks, and conducted on both the Mac Mini CPU and GPU.

"Apple is great at showcasing their newest products for the most cutting-edge ML uses," said Luis Ceze, co-founder and CEO of OctoML. "But in practice, machine learning engineers can struggle to achieve good performance and may spend months trying to manually debug issues. In contrast, OctoML's work shows how an automated model optimization service can effortlessly add new hardware and immediately deliver superior model performance."

Apple's latest Core ML 4 resulted in 139 milliseconds of latency on the CPU and 59 milliseconds in latency on GPU. In contrast, OctoML's work delivered model latency of 108 milliseconds on the CPU and 42 milliseconds on the GPU. These performance improvements represent a 22% improvement on the CPU and nearly 30% improvement on the GPU and are especially notable because they were produced automatically and only weeks after Apple's public launch of the M1 chip.

Other performance comparisons included Keras with MLCompute and TensorFlow with Graphdef. For Keras, the Apple M1's latency on CPU was 579 milliseconds and on GPU was 1,767 milliseconds. For TensorFlow, the M1 demonstrated 512 milliseconds of latency on the CPU and 543 milliseconds on the GPU.

"How did we get better results than Apple's Core ML 4 in just a few weeks? Two reasons," said Bing Xu, Principal Engineer at OctoML. "First, the Apache TVM compiler uses a machine learning-based auto-scheduler to search out CPU and GPU code optimization. Second, the Apache TVM compiler is able to automatically fuse qualified subgraphs and directly generate code. Even the best engineers can't anticipate the interactions between model architectures, computation workloads and hardware target resource availability."

Published Wednesday, December 16, 2020 3:20 PM by David Marshall
Filed under:
Comments
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
Calendar
<December 2020>
SuMoTuWeThFrSa
293012345
6789101112
13141516171819
20212223242526
272829303112
3456789