🤖 AI Summary
Apple researchers published "Using LLMs for Late Multimodal Sensor Fusion for Activity Recognition," showing that large language models can effectively perform "late fusion" of audio and motion signals to recognize everyday activities. Using curated 20-second clips from the Ego4D first‑person dataset spanning 12 activities (e.g., cooking, vacuuming, workout, playing sports), the team fed text outputs from smaller audio captioning/classifier and IMU (accelerometer/gyroscope) models into LLMs (Gemini‑2.5‑pro and Qwen‑32B). Without task‑specific training, the LLMs achieved zero‑ and one‑shot 12‑class F1 scores significantly above chance, with one‑shot examples improving accuracy; evaluations included both closed‑set (given label list) and open‑ended settings.
Technically, this late‑fusion approach means the LLM ingests modality‑specific textual summaries rather than raw waveforms or dense shared embeddings, enabling multimodal temporal applications where aligned training data or heavy multimodal models are scarce. That can reduce memory/computation needs for deployment and simplify integrating heterogeneous sensors. Apple published reproducibility assets (segment IDs, prompts, one‑shot examples). Practical implications include more precise activity and health/context awareness in devices, but also raise privacy and robustness questions—note the LLMs never saw raw audio in the study, only generated captions, so safeguards and further evaluation will be essential before real‑world adoption.
Loading comments...
login to comment
loading comments...
no comments yet