Paper notes — TAIYI-DIFFUSION-XL: ADVANCING BILINGUAL TEXT-TO-IMAGE GENERATION WITH LARGE VISION-LANGUAGE MODEL SUPPORT

3 min readFeb 1, 2024

Introduction

Recent developments in diffusion models have showcased their potential in generating high-quality images from text descriptions. However, it is important to note that the majority of current open-source text-to-image models predominantly support English, with very few offering bilingual support for both Chinese and English.

In particular, works such as Taiyi-Diffusion, Pai-Diffusion and Alt-Diffusion have made significant strides in adapting text-to-image models for Chinese scenarios, demonstrating the feasibility and importance of native language support in such models.

However, these models often obtain Chinese understanding capabilities by replacing multi-language text encoders and retaining unet while this methodology will discard the original English understanding capabilities.

Building on these advancements, Taiyi-Diffusion-XL (Taiyi-XL), specifically focuses on augmenting these models for Chinese text-to-image generation while preserving the original English ability, addressing the unique linguistic and cultural aspects of the bilingual language.

Efficient Algorithms for Bilingual Expansion: We develop algorithms for expanding vocabulary and position encoding in text-to-image models tailored for bilingual contexts.
Enrichment of Text Prompts by Large Vision-Language Models

Methodology

Dataset

Curating a dataset consisting of high-quality image-text pairs (X, Y), where X represents an image, and Y is a descriptive text.
Employing vision-language large models (Lyrics) to generate synthetic captions that more accurately describe the images.
Using images, web crawl caption, and instructions for generating description as inputs for the Lyrics.

In Chinese, we select “請詳細描述圖片內容。” as the instruction

In English, we select “Write a detailed description of the given image.” as the instruction.

CLIP training

The foundation of our model is a vision-language large model, similar to CLIP, which aligns images and text representations effectively. We start with the pre-trained English-only CLIP model and extend its training to accommodate bilingual adaptation and the requirements of high-quality image-text data.

First stage of training involves processing a large-scale, bilingual dataset, including Laion and Wukong, with a focus on data cleaning and quality enhancement.
Second stage continues with training on our enriched dataset, emphasizing the diverse perspectives and details captured in high-quality image-text pairs.

TAIYI-XL Training

Each data instance is represented as a pair (x,y), where x is an image and y is its corresponding textual descriptor. For the training phase at mix resolution of 512 × 512 and 1024 × 1024, we define a loss function L to guide the image denoising process: