CVPR 2026 · Poster

MRT: Masked Region Transformer
for Layered Image Generation and Editing at Scale

A unified 20B-parameter diffusion framework for text-to-layers, image-to-layers, and layers-to-layers synthesis & editing.

Zhicong Tang*, Zhao Zhang*, Jingye Chen, Mohan Zhou, Yifan Pu, Yuchi Liu, Yalong Bai, Ethan Smith, Yuhui Yuan✉

Canva Research

*Equal contribution  ·  ✉ Corresponding author: ryanyuan@canva.com

MRT teaser figure showing four task capabilities
Overview of Masked Region Transformer capabilities. Our framework supports four tasks within a unified model: (1) text-to-layers generation, (2) image-to-layers decomposition, (3) layer addition with new user-specified elements, and (4) layer restylization with user-provided assets.

Abstract

A scalable framework for region-aware, fully-editable visual layer generation.

Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content—analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present MRT, a 20B-parameter masked region diffusion model tailored for multi-layer transparent image generation and editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts.

To fully leverage this scale, we make two key technical contributions. First, we unify three complementary tasks—text-to-layers, image-to-layers, and layers-to-layers— within a shared masked region diffusion framework, where selective token masking enables flexible layer-wise generation and editing. Second, to enable overflow layer generation, we introduce an overflow-aware canvas layer that handles boundary inconsistencies and supports semi-transparent background synthesis, producing complete editable layers that extend beyond visible canvas boundaries. Additionally, we apply diffusion distillation to achieve 8-step real-time multi-layer generation with minimal quality degradation.

Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches, including various commercial systems, across all three tasks, establishing a new benchmark for multi-layer transparent image generation. Notably, our model significantly outperforms the concurrent Qwen-Image-Layered in image-to-layers quality according to user-study results, while achieving 10–100× faster inference and saving 50–90% activation GPU memory.

20B
Diffusion transformer built on Qwen-Image
10M+
Multilingual layered design samples
43M+
Unique transparent layers
3-in-1
T2L · I2L · L2L unified
8 steps
Distilled real-time inference
10–100×
Faster than Qwen-Image-Layered on I2L

Method

A masked region transformer unifies three layered-generation tasks through selective token masking.

MRT framework diagram
The MRT framework. Multi-layer transparent images are represented as a canvas layer, a semi-transparent background layer, and K foreground RGBA layers. A WAN-2.1-VAE encoder produces regional latents, and an anonymous region transformer with 20B parameters (built on Qwen-Image) performs full attention jointly across canvas, background, and foreground tokens. An adaptive masking mechanism selects whether each layer is initialized from clean latents or noise, allowing a single model to handle text→layers, image→layers, and layers→layers tasks.

Overflow-Aware Canvas Layer

Over 60% of professional designs contain layers that extend beyond the canvas—cropping them destroys reusability.

Overflow layer demo
Why overflow layers matter. Row 1 visualizes the canvas layer with a fully transparent background, exposing pixels that extend beyond the visible region. Rows 2–3 compare multi-layer generation without overflow support (baseline) and with overflow support (ours). Full-size overflow generation is essential for complete editability and reusability—layers that would otherwise be truncated at background boundaries remain fully editable in downstream design workflows.

Results

A single MRT model handles all three layered generation & editing tasks.

Qualitative T2L results
Qualitative results on text-to-layers. MRT generates multi-layer transparent designs directly from text prompts, producing fully-editable, semi-transparent layers with high-fidelity multilingual typography and varied aspect ratios.
T2L user study vs. ART
User study comparison with previous SOTA on text-to-layers. MRT significantly outperforms ART across overall preference, aesthetics, typography, and layout dimensions.
T2L visualization with prompt, layout, generated, and individual layers
Text-to-layers generation examples. For each row we visualize the input prompt, predicted layout, generated merged composition, and the individual transparent RGBA layers—showing strong alignment between textual descriptions and layered output.
T2L with overflow layer generation
Text-to-layers with overflow layer generation. MRT uniquely generates complete full-size RGBA layers that extend beyond the background boundary, preserving editability and reusability that previous methods sacrifice by truncating pixels at canvas boundaries.
T2L multilingual examples
Text-to-layers with multilingual support. Trained on over 10M multilingual designs, MRT renders visually-grounded text in multiple languages including Chinese, handling typography across different writing systems while maintaining design quality.
Rich T2L results beyond flat designs
Beyond flat designs. Our model generates richly-textured images along with high-quality transparent layers, going well beyond simple poster-style flat designs.
T2L diverse layouts and layer decompositions for one prompt (1) T2L diverse layouts and layer decompositions for one prompt (2)
Diverse generations from a single prompt. Each panel takes one text prompt and shows three sampled layouts (A/B/C), the corresponding generated compositions, and the resulting individual transparent layers—demonstrating that MRT produces varied, well-composed multi-layer designs from the same text input.
I2L qualitative comparison with LayerD, Lovart, RoboNeo, Qwen-Image-Layered
Image→Layers (I2L) qualitative comparison. Each panel's top-left shows the composed image with its decomposed layers. MRT outperforms all baselines: Lovart shows poor decomposition quality, RoboNeo exhibits artifacts, and LayerD & Qwen-Image-Layered produce overly-grouped layers.
I2L user study vs SOTA and commercial systems
User study on image-to-layers. Blind evaluations against the strongest open-source and commercial systems across (i) quality (semantic correctness + transparency), (ii) integrity (faithful reconstruction), and (iii) granularity (appropriate decomposition).
Detailed comparison with Qwen-Image-Layered on AI-generated images
Detailed comparison with Qwen-Image-Layered. Left: input image generated with Nano-Banana-Pro. Middle: Qwen-Image-Layered with 4 layers (official recommended setting) and matched to our layer count. Right: decomposition from MRT.
I2L generalization on real natural images
Generalization to out-of-domain natural images. Despite being trained exclusively on poster-style flat designs, MRT generalizes well to natural scenes and still outperforms Qwen-Image-Layered.
Attention map visualization for I2L
Attention-map visualization. Left: input composite image and its layout. Right: decomposed layers (top) with the corresponding attention maps overlaid on the input image (bottom). Red regions indicate high activation, showing that the model's attention is semantically selective—accurately aligning with text, foreground objects, and background patterns.
I2L decomposition with 16 layers
I2L decomposition with 16 layers. MRT scales gracefully to high layer counts (2–50 layers), producing coherent decompositions without architectural modifications.
Additional I2L comparison with Qwen-Image-Layered (1) Additional I2L comparison with Qwen-Image-Layered (2)
Additional comparisons with Qwen-Image-Layered. Two more sets of test designs showing consistent advantages of MRT in layer quality, integrity, and granularity.
I2L on Pinterest-style designs I2L on Qwen-Image-Layered OOD designs
More I2L results on out-of-distribution AI-generated images (Pinterest-style, left; Qwen-Image-Layered test set, right). MRT maintains high decomposition quality on designs that lie outside the training distribution.
I2L input vs layout (1) I2L input vs layout (2) I2L input vs layout (3)
Image→Layers: merged image vs. layout visualization. Three examples showing the correspondence between the input raster image and the extracted layer layout structure (bounding boxes + z-order) that guides decomposition.
Layers-to-Layers results
Layers→Layers (L2L). MRT takes an existing layered design and produces a regenerated, edited, or restyled version—supporting layer addition, restylization, and multi-image fusion.
L2L layer addition
Layer addition. MRT seamlessly integrates user-supplied images or assets into existing designs, producing harmonized layouts with consistent lighting and typography.
L2L restylization
Restylization. Given a layered design plus a style reference, MRT regenerates each foreground layer in the target style while keeping the original layout.
Generation quality of distilled models
Generation quality of distilled models. For each design we compare the baseline (50 NFE) against DMD2-distilled variants at 16 NFE and 8 NFE. We achieve up to 6× speedup without sacrificing image quality or fidelity, enabling real-time multi-layer synthesis on a single GPU.

Efficiency vs. Qwen-Image-Layered

Regional diffusion avoids the K× full-resolution token blow-up that grows linearly with the number of layers.

108.5×
Peak speedup at ~20 layers vs. Qwen-Image-Layered
~5 s
1K image, ~20 layers, single H100 GPU
~3 s
Same workload on 4× H100 GPUs
10.5–23.6×
Peak GPU memory reduction (scales with layer count)
Inference efficiency comparison between MRT and Qwen-Image-Layered
Inference efficiency comparison between MRT and Qwen-Image-Layered. (a) Latency scaling with number of layers. MRT maintains near-constant latency (~5 s) while Qwen-Image-Layered scales linearly, yielding up to 108.5× speedup at ~20 layers. (b) MRT inference time vs. token count on H200 and B200 GPUs, demonstrating linear scaling behavior. (c) Peak GPU memory consumption across varying layer configurations — the shaded region indicates the baseline memory allocated to model weights. MRT reduces memory consumption by 10.5× → 23.6×, with efficiency gains scaling proportionally with the number of layers. All results are measured over 100 samples on a single GPU with identical layer numbers.

BibTeX

If you find this work useful, please consider citing us.

@inproceedings{tang2026mrt,
  title     = {MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale},
  author    = {Tang, Zhicong and Chen, Jingye and Zhang, Zhao and Zhou, Mohan and
               Liu, Yuchi and Pu, Yifan and Bai, Yalong and Smith, Ethan and Yuan, Yuhui},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}