Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Jiasen Lu,Christopher Clark,Sangho Lee,Zichen Zhang,Savya Khosla,Ryan Marten,Derek Hoiem,Aniruddha Kembhavi

CVPR 2024（2024）

Cited 150|Views130

Key words

Multimodal Model,Benchmark,Natural Language,Autoregressive Model,Bounding Box,Image Generation,Tokenized,Robot Manipulator,Natural Language Understanding,Denoising,Input Image,Object Detection,Data Augmentation,Diffusion Model,Patch Size,Changes In Architecture,Line Of Work,Language Model,Efficient Implementation,Input Modalities,Vision Transformer,Text Generation,Pre-training Data,Input Text,Audio Segments,Special Token,View Synthesis,Text Output,Sparse Structure,Output Image

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined