A unified 20B-parameter diffusion framework for text-to-layers, image-to-layers, and layers-to-layers synthesis & editing.
Canva Research
*Equal contribution · ✉ Corresponding author: ryanyuan@canva.com
A scalable framework for region-aware, fully-editable visual layer generation.
Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content—analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present MRT, a 20B-parameter masked region diffusion model tailored for multi-layer transparent image generation and editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts.
To fully leverage this scale, we make two key technical contributions. First, we unify three complementary tasks—text-to-layers, image-to-layers, and layers-to-layers— within a shared masked region diffusion framework, where selective token masking enables flexible layer-wise generation and editing. Second, to enable overflow layer generation, we introduce an overflow-aware canvas layer that handles boundary inconsistencies and supports semi-transparent background synthesis, producing complete editable layers that extend beyond visible canvas boundaries. Additionally, we apply diffusion distillation to achieve 8-step real-time multi-layer generation with minimal quality degradation.
Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches, including various commercial systems, across all three tasks, establishing a new benchmark for multi-layer transparent image generation. Notably, our model significantly outperforms the concurrent Qwen-Image-Layered in image-to-layers quality according to user-study results, while achieving 10–100× faster inference and saving 50–90% activation GPU memory.
A masked region transformer unifies three layered-generation tasks through selective token masking.
Over 60% of professional designs contain layers that extend beyond the canvas—cropping them destroys reusability.
A single MRT model handles all three layered generation & editing tasks.
Regional diffusion avoids the K× full-resolution token blow-up that grows linearly with the number of layers.
If you find this work useful, please consider citing us.
@inproceedings{tang2026mrt,
title = {MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale},
author = {Tang, Zhicong and Chen, Jingye and Zhang, Zhao and Zhou, Mohan and
Liu, Yuchi and Pu, Yifan and Bai, Yalong and Smith, Ethan and Yuan, Yuhui},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}