大娱乐-我观察了14年才发现，那些很努力却没有成就的人都有一个

Guanning Zeng¹^* ?Xiang Zhang² ?Zirui Wang³ ?Haiyang Xu²
Zeyuan Chen² ?Bingnan Li² ?Zhuowen Tu²
¹Tsinghua University ?²UC San Diego ?³UC Berkeley

Abstract

百度高举这两面旗帜，既有深刻的历史继承性，又有鲜明的时代必然性，是我们党把统一战线的原则性和灵活性有机结合的又一成功范例。

We propose YOLO-Count, a differentiable open-vocabulary object counting model that tackles both general counting challenges and enables precise quantity control for text-to-image (T2I) generation. A core contribution is the ‘cardinality’ map, a novel regression target that accounts for variations in object size and spatial distribution. Leveraging representation alignment and a hybrid strong-weak supervision scheme, YOLO-Count bridges the gap between open-vocabulary counting and T2I generation control. Its fully differentiable architecture facilitates gradient-based optimization, enabling accurate object count estimation and fine-grained guidance for generative models. Extensive experiments demonstrate that YOLO-Count achieves state-of-the-art counting accuracy while providing robust and effective quantity control for T2I systems.

Figure 1: Demonstrations of YOLO-Count’s object quantity controllability for text-to-image generation. Incorporating YOLO-Count as a differentiable guidance module over a strong baseline (SDXL?[41]) substantially improves alignment between text-specified object quantities and the generated images.

^*^*footnotetext: Work done during internship at UC San Diego.

1 Introduction

Text-to-image (T2I) generative models have achieved remarkable success in producing high-fidelity images from natural language descriptions. However, ensuring precise alignment with textual specifications, particularly regarding object quantity, remains a significant challenge. While prior research has improved adherence to object layout, attributes, and style through conditional training and guidance mechanisms, accurately controlling the number of objects synthesized within an image remains difficult. Unlike localized attributes, object quantity constitutes a global constraint, requiring models to establish numerical correspondence between language tokens and compositional objects. Consequently, conventional conditional training approaches such as ControlNet?[52] are ill-suited for explicit quantity control. Moreover, the stochastic nature of the denoising process in T2I models introduces ambiguity in object differentiation, further complicating count consistency. Recent conditional guidance methods, such as BoxDiff?[46] and Ranni?[12], address aspects of spatial layout, object attributes, and semantic panel conditioning. However, these methods lack a direct and principled mechanism for precise quantity control, leaving a critical gap in bridging linguistic numeracy and visual synthesis.

In this work, we propose YOLO-Count, an open-vocabulary object counting model built on the YOLO architecture. YOLO-Count is a fully differentiable, regression-based model that demonstrates high accuracy, computational efficiency, and open-vocabulary capabilities. A key contribution is the introduction of the cardinality map, a novel representation that encodes object quantity while preserving awareness of object size and spatial location. Unlike traditional density maps, which apply Gaussian kernels at object centers, the cardinality map distributes quantity scores across object instances, improving accuracy and robustness to scale variation. Furthermore, YOLO-Count leverages representation alignment and a hybrid strong-weak supervision strategy, enabling the use of large-scale instance segmentation datasets without reliance on computationally expensive pre-trained visual encoders.

Beyond generic object counting, we are motivated to apply YOLO-Count for precise control of object quantities in text-to-image (T2I) generation. This is achieved by employing YOLO-Count as a differentiable guidance module?[5], where gradient signals from the counting model steer the generative process toward numerical consistency. While prior research has predominantly focused on guidance algorithms for attributes and layout, explicit quantity control remains underexplored. We argue that an ideal object counting model for T2I applications should possess four key properties: (1) full differentiability w.r.t. the input image; (2) open-vocabulary capability for diverse object categories; (3) cross-scale generalization to varying object sizes; (4) computational efficiency for practical deployment.

Constructing such a model introduces several challenges. First, state-of-the-art counting approaches?[35, 2] are often detection-based, producing outputs that preclude gradient propagation. Second, existing counting datasets such as FSC147?[37] or CARPK?[18] are limited in scale and category diversity, hindering open-vocabulary generalization. Third, while large-scale vision encoders (e.g., CLIP?[36] or GroundingDINO?[32, 38]) can alleviate data limitations, they impose significant computational overhead.

To address these issues, we integrate YOLO-Count with textual inversion?[13, 50] to achieve precise quantity control in T2I generation. Extensive experiments demonstrate that YOLO-Count achieves state-of-the-art accuracy on counting benchmarks, outperforms density-based and detection-based counting models, and substantially improves object quantity controllability in T2I generation.

Our contributions are summarized as follows:

?

We introduce the cardinality map, a novel regression target that improves object counting accuracy compared to density maps.
?

We develop YOLO-Count, an efficient, open-vocabulary, and fully differentiable counting model that achieves state-of-the-art performance and enhances quantity control for T2I generation.
?

We propose hybrid strong-weak supervision with representation alignment, enabling effective training using large-scale segmentation datasets without reliance on heavy visual encoders.

2 Related Works

2.1 Object Counting Models and Datasets

Object counting models can be broadly classified according to their category scope into fixed-category counting models?[40, 44, 14] and open-vocabulary counting models?[2, 35, 11]. For controlling object quantities in generative tasks, open-vocabulary counting is essential, as it supports arbitrary object categories without retraining. Based on the type of supervision or guidance, counting models can be further divided into text-guided models?[47, 58], visual-exemplar-guided models?[37, 20], multimodal-guided models?[2], and reference-less models?[17, 31, 45]. For T2I integration, a purely text-guided counting model is preferable to ensure compatibility with prompt-driven generation.

From a methodological perspective, counting models are typically divided into detection-based and regression-based approaches. Detection-based models?[18, 2, 34] rely on explicit object detection, filtering instances via thresholds and enumerating discrete counts, which inherently produce non-differentiable integer outputs. In contrast, regression-based models?[3, 4, 27] predict continuous-valued maps such as density maps?[10, 33] that represent pixel-wise contributions to the final count. This direct differentiability makes regression-based models particularly suitable for gradient-based control in generative pipelines.

Finally, training datasets for object counting are categorized into fixed-category datasets?[18, 43, 21] and open-vocabulary datasets?[37, 1]. Open-vocabulary datasets provide images containing diverse object categories and instance counts, but are expensive to collect and annotate?[37]. For example, the widely used FSC147 dataset includes only 3,659 training images, which limits scale and diversity. To address this, recent works?[22, 2] incorporate large-scale pre-trained visual backbones (e.g., CLIP?[36] and GroundingDINO?[32]) and fine-tune them on smaller counting datasets to enhance open-vocabulary generalization.

Refer to caption — Figure 2: YOLO-Count Model Overview. YOLO-Count comprises a YOLO backbone, CLIP text encoder, vision-language path aggregation network (VLPAN), cardinality regression head and classification head. Built upon the YOLO-World?[9] architecture, the cardinality head predicts a cardinality map. The final object quantity is obtained by summing over the cardinality map.

2.2 Controllable Text-to-Image Generation

Controllable text-to-image (T2I) generation methods can be broadly categorized into two paradigms: training-based methods?[52, 54, 19] and guidance-based methods?[53, 5, 49]. Training-based approaches, such as ControlNet?[52], IP-Adapter?[48], and GLIGEN?[28], inject conditional inputs directly into the generative model through additional network branches or adapters. While effective, these methods rely on large-scale training datasets annotated with the corresponding conditions. In contrast, guidance-based approaches, including BoxDiff?[46], Attend-and-Excite?[8], and Separate-and-Enhance?[6], control generation by manipulating the diffusion process at inference time, eliminating the need for retraining. Many of these methods exploit the interpretability of cross-attention mechanisms?[30] to steer image synthesis. However, cross-attention is primarily effective for distinguishing object categories rather than differentiating multiple instances of the same category. As a result, existing controllable T2I techniques excel at localized attribute binding?[15, 55] and layout control?[57, 56], but struggle with enforcing global constraints such as precise object quantity.

2.3 Object Quantity Control for T2I Models

Research on explicit object quantity control in text-to-image (T2I) models remains limited. [25] pioneers the use of universal diffusion guidance for quantity control, representing the first attempt to directly address this challenge. [7] introduces an attention-based representation for counting objects, but their approach is constrained to controlling small quantities (ranging from 1 to 10). More recently, prompt-tuning approaches?[50, 42] have been proposed to incorporate numerical cues into the text embedding space, enabling limited quantity control without modifying the underlying diffusion model. However, these methods still struggle with accurate control over larger counts.

3 Methods

3.1 Model Overview

Our proposed YOLO-Count builds upon the YOLO-World architecture?[9] and consists of three primary components: (1) a vision backbone, (2) a vision-language path aggregation network (VLPAN), and (3) prediction heads. Fig.?2 illustrates the overall pipeline and highlights our key architectural modifications.

Vision Backbone.

The vision backbone in YOLO-Count follows the design of YOLOv8l?[23] and YOLO-World-L?[9]. It comprises five stages of convolutional modules (ConvModules) and cross-stage partial layers (CSPLayers). Given an input image $I\in\mathbb{R}^{640\times 640\times 3}$ , the backbone extracts multiscale visual features at three resolutions:

f^{0}=[f_{80\times 80},f_{40\times 40},f_{20\times 20}]=\mathrm{VisualBackbone}(I)

Vision-Language Path Aggregation Network (VLPAN).

The VLPAN is designed to fuse visual features with textual semantics and aggregate information across scales. Inheriting from YOLO-World, it employs both top-down and bottom-up pathways, but with key enhancements: (1) T-CSPLayers: standard CSPLayers are replaced by T-CSPLayers, which integrate sigmoid attention blocks to modulate visual features based on precomputed CLIP text embeddings?[36]. (2) Extended Top-Down Fusion: to better preserve fine-grained spatial details, an additional top-down pathway is introduced following the initial bidirectional aggregation, maximizing high-resolution feature utilization, which is critical for accurate counting regression. The enhanced VLPAN is formulated as:

[f^{1},f^{2}]=\mathrm{VLPAN}(f^{0},f_{\mathrm{T}})

where $f_{\mathrm{T}}$ denotes the CLIP text embedding of the category, $f^{1}$ and $f^{2}$ represent multimodal features for classification and counting regression, respectively.

Prediction Heads.

Following the VLPAN, several ConvModules are applied to text-aware visual features to aggregate multi-scale signals into a unified $80\times 80$ resolution. The prediction stage then produces two parallel outputs: (1) a cardinality regression head, which predicts a dense cardinality map for differentiable counting, and (2) a classification head, trained with contrastive supervision to ensure robust open-vocabulary capability. These two outputs jointly enable YOLO-Count to provide accurate, differentiable count estimates while maintaining strong category generalization, as shown on the right side of Fig.?2.

百度