OctoML, the MLOps automation company for
superior model performance, portability and productivity, demonstrated better model performance on Apple's M1 chip than Apple's core
inferencing engine. OctoML's results showcased lower model
latency than any of Apple's own developed software, ranging from a 30%
improvement against Apple's latest Core ML 4 inference engine to a 13x
improvement on Apple's standard Core ML 3. All comparisons were based on
the BERT-base model, a common deep learning model used widely for natural
language processing tasks, and conducted on both the Mac Mini CPU and GPU.
"Apple is great at showcasing their newest products for the most cutting-edge
ML uses," said Luis Ceze, co-founder and CEO of OctoML. "But in
practice, machine learning engineers can struggle to achieve good
performance and may spend months trying to manually debug issues. In contrast,
OctoML's work shows how an automated model optimization service can
effortlessly add new hardware and immediately deliver superior model
performance."
Apple's latest Core ML 4 resulted in 139 milliseconds of latency on the CPU and
59 milliseconds in latency on GPU. In contrast, OctoML's work
delivered model latency of 108 milliseconds on the CPU and 42 milliseconds
on the GPU. These performance improvements represent a 22% improvement on
the CPU and nearly 30% improvement on the GPU and are especially notable
because they were produced automatically and only weeks after Apple's public launch
of the M1 chip.
Other performance comparisons included Keras with MLCompute and TensorFlow with
Graphdef. For Keras, the Apple M1's latency on CPU was 579 milliseconds
and on GPU was 1,767 milliseconds. For TensorFlow, the M1 demonstrated 512
milliseconds of latency on the CPU and 543 milliseconds on the GPU.
"How did we get better results than Apple's Core ML 4 in just a few weeks? Two
reasons," said Bing Xu, Principal Engineer at OctoML. "First, the
Apache TVM compiler uses a machine learning-based auto-scheduler to search
out CPU and GPU code optimization. Second, the Apache TVM compiler is able
to automatically fuse qualified subgraphs and directly generate code. Even
the best engineers can't anticipate the interactions between model
architectures, computation workloads and hardware target resource
availability."