← Back to feed
2026-05-25visionmultimodalcode

Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

Shuhong Zheng, Aashish Kumar Misraa, Yu-Teng Li, Yu-Jhe Li, Igor Gilitschenski

PDF preview for Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation
Read on arXiv →

Key claim

Improved identity preservation in subject-driven image generation.

This paper presents a new method for subject-driven image generation that effectively preserves identity while following textual instructions. By conditioning diffusion models on Multimodal Large Language Models and incorporating a VAE-based identity conditioning, the approach mitigates common issues like copy-paste artifacts. The key result shows significant improvement in human preference for generated images.

In plain English

This paper presents a new method for subject-driven image generation that effectively preserves identity while following textual instructions. By conditioning diffusion models on Multimodal Large Language Models and incorporating a VAE-based identity conditioning, the approach mitigates common issues like copy-paste artifacts. The key result shows significant improvement in human preference for generated images.

Novelty
8.0/10

The paper introduces a novel approach that combines multimodal models with diffusion models for improved identity preservation in image generation.

Reliability
7.5/10

The claims are supported by extensive experiments demonstrating superior performance, although specific baseline comparisons could be more robust.

Deep reliability assessment

The methodology supports improved identity preservation and multimodal understanding through the proposed Dual Layer Aggregation (DLA) and multi-stage denoising strategies, but claims of superior performance may overstate the generalizability across all subject-driven generation tasks.

Reproducibility

Yes, the paper mentions the use of a public dataset (UNO-1M) for training and provides a project website for further details.

Discussion questions

  1. What assumptions underlie the effectiveness of the DLA module in all multimodal contexts?
  2. How can builders leverage the findings to enhance existing subject-driven generation models?
  3. What experimental conditions would lead to a failure of the proposed identity preservation claims?

Key figure

Figure 1 illustrates the benefits of leveraging MLLMs for subject-driven generation, highlighting the reduction of copy-paste issues and improved multimodal understanding.

Benchmark results

DreamBenchDINO-I: 0.7482vs UNO+0.01SOTA
DreamBenchCLIP-I: 0.8443vs UNO+0.01SOTA
DreamBenchCLIP-T: 0.301vs UNO+0.00SOTA
Codelink
zsh2000.github.io/squeeze-mllm-subject-genOfficial
Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation — Frontier Papers