Vision + text + audio in one model — how multimodal AI works, from patch tokenization and spectrograms to cross-attention and real-world architectures.