Figure 1. Design Compositing with GIST: We introduce GIST (Grounded Identity-preserving Stylized composiTion), a compositing stage that transforms and harmonizes visual elements, images and SVGs, at their predicted layout positions rather than naively pasting them, while preserving their semantic identity. Note how the image is stylized along with retaining the dining room scene and feels cohesive with the surrounding elements. The SVGs are similarly transformed for visual harmony. This is in contrast to existing methods which focus on layout prediction (spatial arrangement) and typography (font, color, and placement of text), without transforming the content itself.
Graphic design creation involves harmoniously assembling multimodal components such as images, text, logos, and other visual assets collected from diverse sources, into a visually-appealing and cohesive design. Recent methods have largely focused on layout prediction or complementary element generation, while retaining input elements exactly, implicitly assuming that provided components are already stylistically harmonious. In practice, inputs often come from disparate sources and exhibit visual mismatch, making this assumption limiting. We argue that identity-preserving stylization and compositing of input elements is a critical missing ingredient for truly harmonized components-to-design pipelines. To this end, we propose GIST, a training-free, identity-preserving image compositor that sits between layout prediction and typography generation, and can be plugged into any existing components-to-design or design-refining pipeline without modification. We demonstrate this by integrating GIST with two substantially different existing methods, LaDeCo [33] and Design-o-meter [15]. GIST shows significant improvements in visual harmony and aesthetic quality across both pipelines, as validated by LLaVA-OV and GPT-4V on aspect-wise ratings and pairwise preference over naive pasting.
Figure 2. Overview of the Components-to-Design pipeline with GIST. GIST sits between layout prediction and typography as a plug-and-play compositing stage. Given a foreground element and its predicted position, the background canvas is inverted via Flow Matched Euler Discrete Sampling to initialize the denoising latent. In parallel, Emu-2's frozen visual encoder produces identity tokens from the foreground, corrected via our CA-guided token injection before conditioning SDXL's UNet alongside the generative tokens from Emu-2's LLaMA decoder. This process repeats sequentially for each element; the final composed backing image is passed to typography prediction to yield the complete design.
Figure 4. Qualitative comparison on Crello test samples. Columns: Input components, LaDeCo (paste-only), Ours (LaDeCo w/ GIST), and Ground Truth. In paste-only results, elements appear as flat cut-outs with visible edge discontinuities and color mismatches. Our method harmonizes lighting, color palette, and texture across elements, producing compositions that are visually cohesive and closer to GT.
Content Relevance, Graphics & Imagery, Innovation & Originality, and overall mean across all five aspects (including Design & Layout and Typography & Color).
| Method | Content | Graphics | Innovation | Mean ↑ |
|---|---|---|---|---|
| FlexDM | 5.29 | 5.09 | 4.54 | 5.13 |
| GPT-4o | 6.49 | 6.27 | 5.69 | 6.32 |
| LaDeCo | 7.96 | 7.74 | 6.93 | 7.77 |
| Ours (w/ LaDeCo) | 7.89 | 7.83 | 7.05 | 7.79 |
| GT | 8.17 | 7.93 | 7.15 | 7.95 |
Author's note
If you like this work and are interested in collaborating, please reach out to me at abhinavm@andrew.cmu.edu. I am doing my Master's in Computer Vision at the Robotics Institute (CMU). I like working with generative models and 3D vision, and in other areas as well, I have papers in NLP/recommender systems, and robotics, too.
I am also actively applying for PhD programs; if you know of any opportunities that might be a fit, please let me know!