Alluxio announced the
latest enhancements in Alluxio Enterprise AI. Version 3.5 showcases the
platform's capability to accelerate AI model training and streamline operations
with features such as a new Cache Only Write Mode, advanced cache management,
and enhanced Python SDK integrations. These updates empower organizations to
train models faster, handle massive datasets more efficiently, and streamline
the complexity of AI infrastructure operations.
AI-driven workloads face
significant challenges in managing the sheer volume and complexity of data,
which can lead to inefficiencies and increased training times. Ensuring fast,
prioritized access to critical data and seamless integration with common AI
frameworks is essential for optimizing performance and accelerating model
development.
"The latest release of Alluxio
Enterprise AI is packed with new capabilities designed to further accelerate AI
workload performance," said Haoyuan (HY) Li, Founder and CEO of Alluxio. "Our
customers are training AI models with enormous datasets that often span
billions of files. Alluxio Enterprise AI 3.5 was built to ensure workloads
perform at peak performance while also simplifying management and operations of
AI infrastructure."
Alluxio
Enterprise AI version 3.5 includes the following key features:
- New caching mode accelerates AI checkpoints - Alluxio's new CACHE_ONLY Write Mode significantly improves the
performance of write operations, such as writing checkpoint files during AI
model training. When enabled, this mode writes data exclusively to the Alluxio
cache instead of the underlying file system (UFS). By bypassing the UFS, write
performance is enhanced by eliminating bottlenecks typically associated with
underlying storage systems. This feature is experimental.
- Advanced cache eviction policies provide fine-grained control - Alluxio's TTL Cache Eviction Policies allow administrators to
enforce time-to-live (TTL) settings on cached data, ensuring less frequently
accessed data is automatically evicted based on defined policies. Alluxio's
priority-based cache eviction policies enable administrators to define caching
priorities for specific data that override Alluxio's default Least Recently
Used (LRU) algorithm, ensuring critical data remains in cache even if it would
otherwise be evicted. This is ideal for workloads requiring consistent
low-latency access to key datasets. Both TTL and Priority-based Cache Eviction
Policies are generally available.
- Python SDK integrations enhance AI framework compatibility - Alluxio's Python SDK now supports leading AI frameworks,
including PyTorch, PyArrow, and Ray. These integrations provide a unified
Python filesystem interface, enabling applications to interact seamlessly with
various storage backends. This simplifies the adoption of Alluxio Enterprise AI
for Python applications, particularly those handling data-intensive workloads
and AI model training, by facilitating quick and repeated access to both local
and remote storage systems. This feature is experimental.
This
release also introduces several enhancements to Alluxio's S3 API, which are
immediately available:
- Support
for HTTP persistent connections (HTTP keep-alive) - Alluxio
now supports HTTP persistent connections, which maintain a single TCP
connection for multiple requests. This reduces the overhead of opening new
connections for each request and decreases latency by approximately 40% for 4KB
S3 ReadObject requests.
- TLS
encryption for enhanced security - Communication between the
Alluxio S3 API and the Alluxio worker now supports TLS encryption, ensuring
secure data transmission.
- Multipart
upload (MPU) support - The Alluxio S3 API now supports multipart
upload, which splits files into multiple parts and uploads each part
separately. This feature simplifies the upload process and improves throughput
for large files.
Other enhancements included in version 3.5 are:
- The Alluxio Index Service - A new caching
service that improves the performance of directory listings for directories
storing hundreds of millions of files and subdirectories. The Index Service
ensures scalability and delivers 3-5x faster results by serving directory
listing details from the cache, compared to listing directories on the UFS.
This enhancement is experimental.
- UFS read rate limiter - Administrators
can now set a rate limit to control the maximum bandwidth an individual Alluxio
Worker can read from the UFS. By configuring the UFS Read Rate Limiter,
administrators ensure optimized resource utilization while maintaining system
stability. Alluxio supports rate limiting for various UFS types, including S3,
HDFS, GCS, OSS, and COS. This enhancement is generally available.
- Support for heterogeneous worker nodes - Alluxio now supports clusters with worker nodes that have
heterogeneous resource configurations (CPU, memory, disk, and network). This
enhancement provides administrators greater flexibility in configuring clusters
and offers improved opportunities to optimize resource allocation. This
enhancement is generally available.
Availability
Alluxio Enterprise AI version 3.5 is available for download
here: https://www.alluxio.io/demo