Abstract: Most prior robot learning methods focus on image-based observations, limiting their capability in 3-D robotic manipulation. Voxel representation naturally delivers rich spatial features but ...