From Inpainting to Layer Decomposition:
Repurposing Generative Inpainting Models for Image Layer Decomposition

1University of Maryland, 2Amazon

* This work was done when Jingxi Chen was an applied scientist intern at Amazon Prime Video team.


Key Insight: Adapting Pre-trained Inpainting Models for Layer Decomposition

alt text

The core idea of our method is that image layer decomposition can be reformulated as a combination of inpainting and outpainting tasks. Rather than designing a model from scratch, we show that a single inpainting model can be efficiently fine-tuned to handle this task.


Abstract

Images can be viewed as layered compositions, foreground objects over background, with potential occlusions. This layered representation enables independent editing of elements, offering greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightweight finetuning. To further preserve detail in the latent space, we introduce a novel multi-modal context fusion module with linear attention complexity. Our model is trained purely on a synthetic dataset constructed from open-source assets and achieves superior performance in object removal and occlusion recovery, unlocking new possibilities in downstream editing and creative applications.


Detailed Diagram of the Proposed Adaptation Method

alt text

Detailed diagram of the key components in our proposed adaptation method: Light blue boxes denote components from the original pre-trained inpainting DiT model, while orange boxes represent our modifications or additions. Our approach efficiently incorporates both Image-Mask Context and Multi-Modal Context tokens to guide generation. After adaptation, the model can simultaneously output an extracted and outpainted foreground along with a clean, object-removed background.




Object Removal Task: Comparison with Baseline Methods

alt text

Object Removal Task: Comparison with Baseline Methods. We present examples comparing our method against baselines on our collected real-world image test set for the object removal task. These qualitative results highlight the visual differences in foreground removal accuracy, background reconstruction quality, and consistency across various challenging scenes. Please zoom in for the best viewing quality.




Foreground Extraction Task: Comparison with Baseline Methods

alt text

Foreground Extraction Task: Comparison with Baseline Methods. We present additional comparisons of foreground extraction using two matting methods, Matting-Anything and DiffMatte, both of which produce RGBA foreground layers.




Layer-based Image Editing Examples

alt text

Layer-based Image Editing Examples. Layer decomposition enables manipulation of elements. Left: the original image, right: the edited image.




Training Data Curation from Public Sources

alt text

Training Data Curation from Public Sources. We propose a hybrid data strategy that combines two complementary foreground sources:1) Real foregrounds with rich detail but incomplete shapes, 2) Generated foregrounds with complete shapes but limited texture fidelity. By merging these two types, we construct a more balanced and effective training dataset.


ACKNOWLEDGMENT

We acknowlege and appreciate the inspiration of prior work in the generative image layer decomposition, especially the work of Generative Image Layer Decomposition with Visual Effects by Jinrui Yang et al., which has significantly influenced our research direction and methodology. We also thank the open-source community for providing valuable resources and datasets that have facilitated our research.