MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices

Yang Zhao,Zhisheng Xiao,Yanwu Xu,Haolin Jia,Tingbo Hou

Computing Research Repository (CoRR)（2024）

Google

Cited 74|Views1365

Abstract

The deployment of large-scale text-to-image diffusion models on mobile devices is impeded by their substantial model size and slow inference speed. In this paper, we propose \textbf{MobileDiffusion}, a highly efficient text-to-image diffusion model obtained through extensive optimizations in both architecture and sampling techniques. We conduct a comprehensive examination of model architecture design to reduce redundancy, enhance computational efficiency, and minimize model's parameter count, while preserving image generation quality. Additionally, we employ distillation and diffusion-GAN finetuning techniques on MobileDiffusion to achieve 8-step and 1-step inference respectively. Empirical studies, conducted both quantitatively and qualitatively, demonstrate the effectiveness of our proposed techniques. MobileDiffusion achieves a remarkable \textbf{sub-second} inference speed for generating a $512\times512$ image on mobile devices, establishing a new state of the art.

Translated text

Key words

Texture Synthesis,Transfer Learning,Image Inpainting,Image Synthesis

Bibtex

AI Read Science

Video&Figures

论文作者介绍

The authors of this paper include Yang Zhao, employed at Google, whose research interests involve representation learning, diffusion models, transfer learning, semi-supervised learning, and unsupervised learning; Zhisheng Xiao, employed at the Committee on Computational and Applied Mathematics at the University of Chicago, whose research directions include deep learning, optimization, conditional generative models, image restoration, and texture synthesis; Yanwu Xu, employed at the Department of Electrical and Computer Engineering at Boston University, whose research focuses on generative models, particularly in exploring denoising diffusion models and generative adversarial learning, and developing large-scale multi-modal generative models; Haolin Jia, employed at Google; and Tingbo Hou, holding a Bachelor's degree from the University of Science and Technology of China, a Master's degree from the Chinese Academy of Sciences, and a Ph.D. in Computer Science from Stony Brook University, whose research interests include computer graphics, diffusion geometry, and differential geometry processing and analysis.

文献大纲

1. Introduction
- Applications and challenges of text-to-image generation models
- Limitations of existing research
- Objective of this paper: MobileDiffusion model
2. Related Work
- Architectural efficiency
- Sampling efficiency
3. Mobile Diffusion
- Overview of diffusion models
- UNet Optimization
- Optimization of Transformer blocks
- Optimization of convolutional blocks
- Architectural candidates
- Sampling efficiency
4. Experiments
- Training Details
- UNet architecture
- Input SD
- Text encoder
- Datasets
- Optimization
- Training cost
- Evaluation metrics
- Main Results
- Quantitative evaluation
- Qualitative comparison
- Mobile device benchmarking
- Applications
- Lightweight controllable adapter
- LoRA fine-tuning
5. Conclusion

关键问题

Q: What specific research methods were used in the paper?
- Architecture Optimization:
  - Conducted an in-depth analysis of the UNet architecture and optimized the Transformer blocks and convolutional blocks, including:
    - Placing more Transformer blocks in the middle layers of UNet and reducing the channel dimensionality.
    - Decoupling the self-attention layer from the cross-attention layer, retaining the cross-attention layer only at low resolutions.
    - Sharing key-value projections in the self-attention layer to reduce the number of parameters.
    - Replacing the GELU activation function with SWISH to improve computational efficiency.
    - Tuning the softmax function to ReLU to further reduce computational costs.
    - Trimming the expansion ratio of the feedforward layer to reduce the number of parameters.
  - Optimized the convolutional blocks, including:
    - Using depth-wise separable convolutions to replace standard convolutional layers to reduce the number of parameters.
    - Trimming redundant residual blocks to improve computational efficiency.
- Sampling Efficiency Improvement:
  - Applied cfg-aware distillation technology to reduce the number of sampling steps to 8.
  - Applied UFOGen finetuning technology to reduce the number of sampling steps to 1.
Q: What are the main research findings and achievements?
- Architecture Optimization Effective: Through the optimization of the UNet architecture, the MobileDiffusion model significantly reduced the number of parameters and computational load while maintaining the quality of generated images, making it more suitable for mobile devices.
- Significant Sampling Efficiency Improvement: Through cfg-aware distillation and UFOGen finetuning technologies, the number of sampling steps in the MobileDiffusion model was significantly reduced, leading to a significant increase in the speed of image generation, achieving sub-second generation speed on mobile devices.
- Huge Application Potential: The MobileDiffusion model has shown good performance in controllable text-to-image generation and LoRA finetuning downstream tasks, demonstrating its broad application potential.
Q: What are the current limitations of this research?
- Model Size Limitation: In order to ensure computational efficiency, the number of parameters in the MobileDiffusion model is limited to 400 million, which may limit the complexity and detail level of the generated images.
- Cost of Reducing Sampling Steps: Although reducing the number of sampling steps can significantly improve generation speed, it may also lead to a decrease in the quality of the generated images.
- Performance Variability of Mobile Devices: The performance variability of different mobile devices may affect the generation speed and image quality of the MobileDiffusion model.

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices

Q: What specific research methods were used in the paper?

Q: What are the main research findings and achievements?

Q: What are the current limitations of this research?