Benchmark: Spark vs. Ray Data vs. Daft on Multimodal Workloads (www.daft.ai)

0 points 3 hours ago ago | visit original

🤖 AI Summary

Daft published head‑to‑head benchmarks against Ray Data and Spark on four realistic multimodal pipelines (audio transcription, PDF embedding, image classification, video object detection) run on identical AWS clusters (8 g6.xlarge nodes with NVIDIA L4 GPUs). Daft completed every job reliably and ran 2–7× faster than Ray Data and 4–18× faster than Spark—examples include audio: 6m22s (Daft) vs 29m20s (Ray) and 25m46s (Spark); image classification: 4m23s vs 23m30s vs 45m7s; video detection: 11m46s vs 25m54s vs 3h36m. All benchmark code and logs are open‑sourced for reproducibility. The significance is practical: multimodal pipelines (large blobs, decode inflation, CPU+GPU bursts) break assumptions behind traditional data engines. Spark and Ray Data tend to fuse operations into large in‑memory partitions or rely on object stores, causing OOMs, disk spill, idle GPUs, and heavy manual tuning (executor cores, batch sizes, object store sizing). Daft’s Swordfish engine instead streams bounded batches through a single worker that controls the machine, applying backpressure and dynamically shrinking batches on memory‑heavy ops. That design avoids full partition materialization, keeps CPU/GPU/network saturated together, and minimizes user tuning—making multimodal data processing both faster and more reliable at scale.

Loading comments...

loading comments...