🤖 AI Summary
Lanturn is a hackathon prototype that links Google’s Gemini Live multimodal API to an ESP32 Atoms3r‑CAM, demonstrating real‑time voice conversations with vision on a tiny microcontroller. The device captures mic audio and camera frames, streams Opus audio and H.264 video via WebRTC to a Pipecat SmallWebRTC server for signaling, and forwards media to Gemini Live for multimodal inference. Voice interactions are functional; vision is integrated but currently a work‑in‑progress. The project shows how cloud multimodal LLMs can be used to give constrained edge hardware voice + vision capabilities, opening doors for low‑cost, battery‑sensitive IoT assistants and accessible multimodal demos.
Technically, Lanturn uses an M5Stack ESP32 Atoms3r‑CAM (GC0308 0.3MP) with PSRAM (8 MB required), ESP‑IDF 5.5, and espressif/esp_h264 to encode QVGA (320×240) RGB565→I420 to H.264 baseline at ≈1 FPS and ~200 kbps (IDR every frame) while audio uses Opus. The architecture is: ESP32 <—WiFi/WebRTC—> Pipecat server (/api/offer) <—server‑to‑server—> Gemini Live. Expect ~1–2 s visual latency; video‑out to ESP32 is disabled. Known constraints include low frame rate, power/init issues for the camera, and potential hallucination when no camera feed is present. The repo documents build requirements (Python 3.13+, ESP toolchain) and invites contributions to improve camera stability and throughput, illustrating the tradeoffs and real-world viability of multimodal AI on microcontrollers.
Loading comments...
login to comment
loading comments...
no comments yet