Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
CVPR 2024(2024)
Key words
Multimodal Model,Benchmark,Natural Language,Autoregressive Model,Bounding Box,Image Generation,Tokenized,Robot Manipulator,Natural Language Understanding,Denoising,Input Image,Object Detection,Data Augmentation,Diffusion Model,Patch Size,Changes In Architecture,Line Of Work,Language Model,Efficient Implementation,Input Modalities,Vision Transformer,Text Generation,Pre-training Data,Input Text,Audio Segments,Special Token,View Synthesis,Text Output,Sparse Structure,Output Image
AI Read Science
Must-Reading Tree
Example

Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined