Grounding natural language phrases in images and video