Multimodal Learning for Visual Perception and Robotic Action