Multimodal Learning from Videos: Self-supervised Pre-training, Post-training Alignment, and Benchmarks