Visual-Language-Guided Task Planning for Horticultural Robots

Naveen Kumar Uppalapati

University of Illinois at Urbana Champaign (UIUC)

In Submission





Abstract

Crop monitoring is essential for precision agriculture, but current systems lack high-level reasoning. We introduce a novel, modular framework that uses a Visual Language Model (VLM) to guide robotic task planning, interleaving input queries with action primitives. We contribute a comprehensive benchmark for short- and long-horizon crop monitoring tasks in monoculture and polyculture environments. Our main results show that VLMs perform robustly for short-horizon tasks (comparable to human success), but exhibit significant performance degradation in challenging long-horizon tasks. Critically, the system fails when relying on noisy semantic maps, demonstrating a key limitation in current VLM context grounding for sustained robotic operations. This work offers a deployable framework and critical insights into VLM capabilities and shortcomings for complex agricultural robotics.



Simulation Environments. Three different agricultural environments were used to evaluate the proposed system.

Graphical Abstract. A high level overview of the system including a sample walkthrough of the proposed pipeline.

Overall System Pipeline. Given context and the assigned task, the agricultural agent selects from a library of perception and action primitives to achieve the task. The agent can switch between active vision and navigation modes. Once the task is complete or the agent requires help, a human can be prompted to intervene.


Simulation Video

Real World Demonstration Video