MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale

Abstract

A scalable framework for region-aware, fully-editable visual layer generation.

Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content—analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present MRT, a 20B-parameter masked region diffusion model tailored for multi-layer transparent image generation and editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts.

To fully leverage this scale, we make two key technical contributions. First, we unify three complementary tasks—text-to-layers, image-to-layers, and layers-to-layers— within a shared masked region diffusion framework, where selective token masking enables flexible layer-wise generation and editing. Second, to enable overflow layer generation, we introduce an overflow-aware canvas layer that handles boundary inconsistencies and supports semi-transparent background synthesis, producing complete editable layers that extend beyond visible canvas boundaries. Additionally, we apply diffusion distillation to achieve 8-step real-time multi-layer generation with minimal quality degradation.

Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches, including various commercial systems, across all three tasks, establishing a new benchmark for multi-layer transparent image generation. Notably, our model significantly outperforms the concurrent Qwen-Image-Layered in image-to-layers quality according to user-study results, while achieving 10–100× faster inference and saving 50–90% activation GPU memory.

20B
Diffusion transformer built on Qwen-Image

10M+
Multilingual layered design samples

43M+
Unique transparent layers

3-in-1
T2L · I2L · L2L unified

8 steps
Distilled real-time inference

10–100×
Faster than Qwen-Image-Layered on I2L

Results

A single MRT model handles all three layered generation & editing tasks.

Qualitative results on text-to-layers. MRT generates multi-layer transparent designs directly from text prompts, producing fully-editable, semi-transparent layers with high-fidelity multilingual typography and varied aspect ratios.

User study comparison with previous SOTA on text-to-layers. MRT significantly outperforms ART across overall preference, aesthetics, typography, and layout dimensions.

T2L visualization with prompt, layout, generated, and individual layers

Text-to-layers generation examples. For each row we visualize the input prompt, predicted layout, generated merged composition, and the individual transparent RGBA layers—showing strong alignment between textual descriptions and layered output.

Text-to-layers with overflow layer generation. MRT uniquely generates complete full-size RGBA layers that extend beyond the background boundary, preserving editability and reusability that previous methods sacrifice by truncating pixels at canvas boundaries.

Text-to-layers with multilingual support. Trained on over 10M multilingual designs, MRT renders visually-grounded text in multiple languages including Chinese, handling typography across different writing systems while maintaining design quality.

Beyond flat designs. Our model generates richly-textured images along with high-quality transparent layers, going well beyond simple poster-style flat designs.

T2L diverse layouts and layer decompositions for one prompt (1)

T2L diverse layouts and layer decompositions for one prompt (2)

Diverse generations from a single prompt. Each panel takes one text prompt and shows three sampled layouts (A/B/C), the corresponding generated compositions, and the resulting individual transparent layers—demonstrating that MRT produces varied, well-composed multi-layer designs from the same text input.

I2L qualitative comparison with LayerD, Lovart, RoboNeo, Qwen-Image-Layered

Image→Layers (I2L) qualitative comparison. Each panel's top-left shows the composed image with its decomposed layers. MRT outperforms all baselines: Lovart shows poor decomposition quality, RoboNeo exhibits artifacts, and LayerD & Qwen-Image-Layered produce overly-grouped layers.

I2L user study vs SOTA and commercial systems

User study on image-to-layers. Blind evaluations against the strongest open-source and commercial systems across (i) quality (semantic correctness + transparency), (ii) integrity (faithful reconstruction), and (iii) granularity (appropriate decomposition).

Detailed comparison with Qwen-Image-Layered on AI-generated images

Detailed comparison with Qwen-Image-Layered. Left: input image generated with Nano-Banana-Pro. Middle: Qwen-Image-Layered with 4 layers (official recommended setting) and matched to our layer count. Right: decomposition from MRT.

I2L generalization on real natural images

Generalization to out-of-domain natural images. Despite being trained exclusively on poster-style flat designs, MRT generalizes well to natural scenes and still outperforms Qwen-Image-Layered.

Attention-map visualization. Left: input composite image and its layout. Right: decomposed layers (top) with the corresponding attention maps overlaid on the input image (bottom). Red regions indicate high activation, showing that the model's attention is semantically selective—accurately aligning with text, foreground objects, and background patterns.

I2L decomposition with 16 layers. MRT scales gracefully to high layer counts (2–50 layers), producing coherent decompositions without architectural modifications.

Additional I2L comparison with Qwen-Image-Layered (1)

Additional I2L comparison with Qwen-Image-Layered (2)

Additional comparisons with Qwen-Image-Layered. Two more sets of test designs showing consistent advantages of MRT in layer quality, integrity, and granularity.

More I2L results on out-of-distribution AI-generated images (Pinterest-style, left; Qwen-Image-Layered test set, right). MRT maintains high decomposition quality on designs that lie outside the training distribution.

Image→Layers: merged image vs. layout visualization. Three examples showing the correspondence between the input raster image and the extracted layer layout structure (bounding boxes + z-order) that guides decomposition.

Layers→Layers (L2L). MRT takes an existing layered design and produces a regenerated, edited, or restyled version—supporting layer addition, restylization, and multi-image fusion.

Layer addition. MRT seamlessly integrates user-supplied images or assets into existing designs, producing harmonized layouts with consistent lighting and typography.

Restylization. Given a layered design plus a style reference, MRT regenerates each foreground layer in the target style while keeping the original layout.

Generation quality of distilled models. For each design we compare the baseline (50 NFE) against DMD2-distilled variants at 16 NFE and 8 NFE. We achieve up to 6× speedup without sacrificing image quality or fidelity, enabling real-time multi-layer synthesis on a single GPU.

Efficiency vs. Qwen-Image-Layered

Regional diffusion avoids the K× full-resolution token blow-up that grows linearly with the number of layers.

108.5×
Peak speedup at ~20 layers vs. Qwen-Image-Layered

~5 s
1K image, ~20 layers, single H100 GPU

~3 s
Same workload on 4× H100 GPUs

10.5–23.6×
Peak GPU memory reduction (scales with layer count)

Inference efficiency comparison between MRT and Qwen-Image-Layered. (a) Latency scaling with number of layers. MRT maintains near-constant latency (~5 s) while Qwen-Image-Layered scales linearly, yielding up to 108.5× speedup at ~20 layers. (b) MRT inference time vs. token count on H200 and B200 GPUs, demonstrating linear scaling behavior. (c) Peak GPU memory consumption across varying layer configurations — the shaded region indicates the baseline memory allocated to model weights. MRT reduces memory consumption by 10.5× → 23.6×, with efficiency gains scaling proportionally with the number of layers. All results are measured over 100 samples on a single GPU with identical layer numbers.

MRT: Masked Region Transformer
for Layered Image Generation and Editing at Scale

Abstract

Method

Overflow-Aware Canvas Layer

Results

Efficiency vs. Qwen-Image-Layered

BibTeX