Current unimodal AI models that interpret either text or images/videos already benefit physicians by summarizing electronic health records 1, identifying high-risk patients for cancers 2, and ...