Gemini Robotics-ER 1.5 (ai.google.dev)

🤖 AI Summary
Google announced Gemini Robotics‑ER 1.5, a purpose‑built vision‑language model that extends Gemini’s agentic reasoning to physical robots. The model can interpret images, video and audio, reason spatially and temporally, and natively call tools or robot APIs to sequence behaviors for long‑horizon tasks. That means developers can give natural‑language instructions like “put the apple in the bowl” and Gemini Robotics‑ER 1.5 will decompose the task, plan subtasks, propose grasps/trajectories, and orchestrate function calls to existing robot controllers—enabling more autonomous, adaptive robot behavior in open‑ended environments. Technically, the model returns structured JSON outputs (points and 2D bounding boxes) with coordinates normalized to a 0–1000 [y, x] grid, supports multi‑frame tracking, and can output labeled instances uniquely (e.g., several breads). It exposes a generate_content API (preview model id gemini‑robotics‑er‑1.5‑preview) with configurable thinking_budget to trade latency for accuracy, plus temperature controls. Examples show Python and REST integrations and instructions for calling user robot functions or third‑party VLA modules. Google highlights safety: the model was designed with safety in mind, but developers remain responsible for safe physical deployment and mitigation of model errors. Overall, Gemini Robotics‑ER 1.5 packages robust perception, spatial reasoning, and agentic orchestration into a single VLM tailored for real‑world robotic applications.
Loading comments...
loading comments...