Overview of our system: An RGB-D camera mounted on the robot's wrist captures visual data of objects to be grasped. Our proposed LGGD generates 4-DoF grasp poses based on the RGB image and language query during the inference process. These generated grasp poses are then utilized by the control module to plan and execute robot trajectories for pick-and-place tasks.
Grasping is one of the most fundamental challenging capabilities in robotic manipulation, especially in unstructured, cluttered, and semantically diverse environments. Recent researches have increasingly explored language-guided manipulation, where robots not only perceive the scene but also interpret task-relevant natural language instructions. However, existing language-conditioned grasping methods typically rely on shallow fusion strategies, leading to limited semantic grounding and weak alignment between linguistic intent and visual grasp reasoning. In this work, we propose Language-Guided Grasp Detection (LGGD) with a coarse-to-fine learning paradigm for robotic manipulation. LGGD leverages CLIP-based visual and textual embeddings within a hierarchical cross-modal fusion pipeline, progressively injecting linguistic cues into the visual feature reconstruction process. This design enables fine-grained visual-semantic alignment and improves the feasibility of the predicted grasps with respect to task instructions. In addition, we introduce a language-conditioned dynamic convolution head (LDCH) that mixes multiple convolution experts based on sentence-level features, enabling instruction-adaptive coarse mask and grasp predictions. A final refinement module further enhances grasp consistency and robustness in complex scenes. Experiments on the OCID-VLG and Grasp-Anything++ datasets show that LGGD surpasses existing language-guided grasping methods, exhibiting strong generalization to unseen objects and diverse language queries. Moreover, deployment on a real robotic platform demonstrates the practical effectiveness of our approach in executing accurate, instruction-conditioned grasp actions. The code will be soon released publicly.
The overview of our proposed LGGD framework. Given an RGB image and a natural-language command, a CLIP based image encoder and text encoder extract visual features and word/sentence embeddings. The Dual Cross Vision-Language Fusion (DCVLF) bottleneck aligns the two modalities, after which hierarchical language-guided upsampling progressively refines spatial details according to the textual intent. A coarse mask and grasp prediction head outputs the segmentation mask, grasp quality, angle, and gripper width. Finally, the mask refinement and grasp refinement modules sharpen boundaries and stabilize grasp poses, producing accurate, instruction-consistent grasp poses.
We evaluate the proposed system in an interactive grasping setting, where a user issues an instruction specifying a target object and the robot must localize the correct target and execute a pick-and-place operation. Our experiments are conducted on a real robotic platform using a KUKA LBR iiwa 14 R820 manipulator equipped with a Robotiq 2F-85 adaptive gripper. Model inference and the control pipeline are executed on a workstation with an NVIDIA RTX 4090 (24GB) GPU and an AMD Ryzen Threadripper Pro 7955WX CPU.
Isolated
Scattered
Cluttered
@article{jiang2025language,
title={Language-Guided Grasp Detection with Coarse-to-Fine Learning for Robotic Manipulation},
author={Jiang, Zebin and Jin, Tianle and Yao, Xiangtong and Knoll, Alois and Cao, Hu},
journal={arXiv preprint arXiv:2512.21065},
year={2025}
}
Thanks to Keunhong Park for the website template .