MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices
Computing Research Repository (CoRR)(2024)
The authors of this paper include Yang Zhao, employed at Google, whose research interests involve representation learning, diffusion models, transfer learning, semi-supervised learning, and unsupervised learning; Zhisheng Xiao, employed at the Committee on Computational and Applied Mathematics at the University of Chicago, whose research directions include deep learning, optimization, conditional generative models, image restoration, and texture synthesis; Yanwu Xu, employed at the Department of Electrical and Computer Engineering at Boston University, whose research focuses on generative models, particularly in exploring denoising diffusion models and generative adversarial learning, and developing large-scale multi-modal generative models; Haolin Jia, employed at Google; and Tingbo Hou, holding a Bachelor's degree from the University of Science and Technology of China, a Master's degree from the Chinese Academy of Sciences, and a Ph.D. in Computer Science from Stony Brook University, whose research interests include computer graphics, diffusion geometry, and differential geometry processing and analysis.
1. Introduction
- Applications and challenges of text-to-image generation models
- Limitations of existing research
- Objective of this paper: MobileDiffusion model
2. Related Work
- Architectural efficiency
- Sampling efficiency
3. Mobile Diffusion
- Overview of diffusion models
- UNet Optimization
- Optimization of Transformer blocks
- Optimization of convolutional blocks
- Architectural candidates
- Sampling efficiency
4. Experiments
- Training Details
- UNet architecture
- Input SD
- Text encoder
- Datasets
- Optimization
- Training cost
- Evaluation metrics
- Main Results
- Quantitative evaluation
- Qualitative comparison
- Mobile device benchmarking
- Applications
- Lightweight controllable adapter
- LoRA fine-tuning
5. Conclusion
Q: What specific research methods were used in the paper?
Architecture Optimization:
- Conducted an in-depth analysis of the UNet architecture and optimized the Transformer blocks and convolutional blocks, including:
- Placing more Transformer blocks in the middle layers of UNet and reducing the channel dimensionality.
- Decoupling the self-attention layer from the cross-attention layer, retaining the cross-attention layer only at low resolutions.
- Sharing key-value projections in the self-attention layer to reduce the number of parameters.
- Replacing the GELU activation function with SWISH to improve computational efficiency.
- Tuning the softmax function to ReLU to further reduce computational costs.
- Trimming the expansion ratio of the feedforward layer to reduce the number of parameters.
- Optimized the convolutional blocks, including:
- Using depth-wise separable convolutions to replace standard convolutional layers to reduce the number of parameters.
- Trimming redundant residual blocks to improve computational efficiency.
- Conducted an in-depth analysis of the UNet architecture and optimized the Transformer blocks and convolutional blocks, including:
Sampling Efficiency Improvement:
- Applied cfg-aware distillation technology to reduce the number of sampling steps to 8.
- Applied UFOGen finetuning technology to reduce the number of sampling steps to 1.
Q: What are the main research findings and achievements?
- Architecture Optimization Effective: Through the optimization of the UNet architecture, the MobileDiffusion model significantly reduced the number of parameters and computational load while maintaining the quality of generated images, making it more suitable for mobile devices.
- Significant Sampling Efficiency Improvement: Through cfg-aware distillation and UFOGen finetuning technologies, the number of sampling steps in the MobileDiffusion model was significantly reduced, leading to a significant increase in the speed of image generation, achieving sub-second generation speed on mobile devices.
- Huge Application Potential: The MobileDiffusion model has shown good performance in controllable text-to-image generation and LoRA finetuning downstream tasks, demonstrating its broad application potential.
Q: What are the current limitations of this research?
- Model Size Limitation: In order to ensure computational efficiency, the number of parameters in the MobileDiffusion model is limited to 400 million, which may limit the complexity and detail level of the generated images.
- Cost of Reducing Sampling Steps: Although reducing the number of sampling steps can significantly improve generation speed, it may also lead to a decrease in the quality of the generated images.
- Performance Variability of Mobile Devices: The performance variability of different mobile devices may affect the generation speed and image quality of the MobileDiffusion model.
