Paper notes — TAIYI-DIFFUSION-XL: ADVANCING BILINGUAL TEXT-TO-IMAGE GENERATION WITH LARGE VISION-LANGUAGE MODEL SUPPORT
Introduction
Recent developments in diffusion models have showcased their potential in generating high-quality images from text descriptions. However, it is important to note that the majority of current open-source text-to-image models predominantly support English, with very few offering bilingual support for both Chinese and English.
In particular, works such as Taiyi-Diffusion, Pai-Diffusion and Alt-Diffusion have made significant strides in adapting text-to-image models for Chinese scenarios, demonstrating the feasibility and importance of native language support in such models.
However, these models often obtain Chinese understanding capabilities by replacing multi-language text encoders and retaining unet while this methodology will discard the original English understanding capabilities.
Building on these advancements, Taiyi-Diffusion-XL (Taiyi-XL), specifically focuses on augmenting these models for Chinese text-to-image generation while preserving the original English ability, addressing the unique linguistic and cultural aspects of the bilingual language.
- Efficient Algorithms for Bilingual Expansion: We develop algorithms for expanding vocabulary and position encoding in text-to-image models tailored for bilingual contexts.
- Enrichment of Text Prompts by Large Vision-Language Models
Methodology
Dataset
- Curating a dataset consisting of high-quality image-text pairs (X, Y), where X represents an image, and Y is a descriptive text.
- Employing vision-language large models (Lyrics) to generate synthetic captions that more accurately describe the images.
- Using images, web crawl caption, and instructions for generating description as inputs for the Lyrics.
In Chinese, we select “請詳細描述圖片內容。” as the instruction
In English, we select “Write a detailed description of the given image.” as the instruction.
CLIP training
The foundation of our model is a vision-language large model, similar to CLIP, which aligns images and text representations effectively. We start with the pre-trained English-only CLIP model and extend its training to accommodate bilingual adaptation and the requirements of high-quality image-text data.
- First stage of training involves processing a large-scale, bilingual dataset, including Laion and Wukong, with a focus on data cleaning and quality enhancement.
- Second stage continues with training on our enriched dataset, emphasizing the diverse perspectives and details captured in high-quality image-text pairs.
TAIYI-XL Training
Each data instance is represented as a pair (x,y), where x is an image and y is its corresponding textual descriptor. For the training phase at mix resolution of 512 × 512 and 1024 × 1024, we define a loss function L to guide the image denoising process: