肠胃不好吃什么| 前额头疼是什么原因引起的| 忽必烈姓什么| 叶酸什么时间段吃最好| 大黄和芒硝混合外敷有什么作用| 白醋加盐洗脸有什么好处| 24属什么生肖| 月经咖啡色是什么原因| 头晕用什么药| 情形是什么意思| 贫血应该吃什么| 什么是国企单位| 成都人民公园有什么好玩的| 头上的旋有什么说法| 陀飞轮是什么意思| 属虎的脖子戴什么招财| 口若悬河是指什么生肖| twice是什么意思| 华佗是什么生肖| 跑得什么| 心肌炎是什么病| 竹蔗是什么| 脑梗用什么药| cu是什么元素| luxury是什么牌子| 肌肤是什么意思| 天秤女和什么星座最配| 眼睛屈光不正是什么| 什么叫射线| 乳腺挂什么科室| 月经推迟半个月是什么原因| 江西有什么景点| 除湿气吃什么| 女生发个wink什么意思| 为什么会骨盆前倾| 海尔兄弟叫什么| tb什么意思| 适宜是什么意思| nba什么时候开始| 卡介苗是预防什么| 负离子什么意思| 61年属什么生肖| 肠息肉有什么症状| 特殊是什么意思| 朝霞不出门晚霞行千里是什么意思| 飞机杯什么感觉| 消肿用什么药| 什么是白血病| 人格分裂什么意思| 梦见已故的父母是什么兆头| 什么值得买官网| 吃什么最容易减肥| 梦见什么是怀孕的征兆| 司长是什么级别| 肠子粘连有什么办法解决| 买李世民是什么生肖| 讳疾忌医什么意思| 打喷嚏流清鼻涕吃什么药| 红底白杠是什么标志| 心慌是什么原因引起的| 无名指和食指一样长代表什么| 一个木一个西读什么| 香槟玫瑰花语是什么意思| 老公是什么意思| 尿道流脓吃什么药| 肋骨痛挂什么科| 有偿什么意思| 试管都有什么方案| 报单什么意思| 各位同仁用在什么场合| 早孕什么意思| 尿酸高去医院挂什么科| 麦芽糖是什么做的| 冰室是什么意思| 小儿风寒感冒吃什么药| 结肠炎是什么原因引起的| 国家为什么不承认鬼神| nag是什么意思| 结石吃什么好| 没意思是什么意思| 爱而不得是什么感觉| 红脸代表什么| 弱水是什么意思| b2c模式是什么意思| 脑供血不足吃什么药| 执拗是什么意思| 昱五行属性是什么| 什么作用| 盥洗是什么意思| 胸部挂什么科| 浪凡算是什么档次的| 胃气上逆有什么好的办法治疗| 孩子为什么不听话| 2月15是什么星座| camel是什么牌子| 喝酒对身体有什么危害| 八卦是什么| svip是什么意思| 肺结核复发有什么症状| theme什么意思| 勾芡用什么粉| 民营和私营有什么区别| 减肥饿了可以吃什么| 海洋中最大的动物是什么| 月经2天就没了什么原因| 什么是春梦| 什么叫三观不合| 脑梗塞吃什么食物好| 为什么不能空腹喝豆浆| 屏幕总成带框和不带框有什么区别| 雨露均沾是什么意思| darker是什么意思| 斜视是什么意思| 拍身份证照片穿什么颜色衣服好看| 舌头麻是什么原因| 指控是什么意思| 手掌麻是什么原因引起的| 糖尿病的症状是什么| 心安是什么意思| 脖子凉是什么原因| 小寒节气的含义是什么| 防晒霜和防晒乳有什么区别| 血管检查什么方法最好| asd是什么意思| 唾液酸偏低意味什么| 暴殄天物是什么生肖| 早孕期间吃什么最营养| 吃什么囊肿会消失| 反酸吃什么药| 参谋长是什么军衔| 逆钟向转位是什么意思| 梦见好多猪是什么意思| 左后背疼是什么原因| 尿血什么原因| 王八和乌龟有什么区别| 走路有什么好处及功效| 肌层回声均匀是什么意思| 手指甲没有月牙是什么原因| 罗布麻是什么东西| 团长转业到地方是什么职务| 什么是脂肪肝| 高高的什么| 古代男子成年叫什么| 拾人牙慧的意思是什么| 什么是碳水| 我是什么结构| 紫苏叶有什么作用| 身份证后面有个x是什么意思| 肌电图挂什么科| 清宫后需要注意什么| 感染幽门螺旋杆菌吃什么药| 肾精亏虚吃什么中成药| 脾胃不和吃什么中成药| 沙拉是什么意思| 虾仁炒什么好吃| 骨髓瘤是什么病| 可乐饼为什么叫可乐饼| 一根长寿眉预示什么| rad是什么单位| 心肌病是什么病严重吗| 属牛男最在乎女人什么| 回民为什么不吃猪肉| 陈真属什么生肖| 一剪梅是什么意思| 什么东西最伤肾| 吃什么让月经量增多| 慈爱是什么意思| 双向转诊是什么意思| 毛囊炎什么症状| 尚书是什么官| kor是什么意思| 转述句什么意思| unny是什么牌子| 作灶什么意思| 莲雾什么季节成熟| 肠息肉是什么原因造成的| 穿堂风是什么意思| 吃的少还胖什么原因| 阴茎勃起不硬吃什么| 五谷都有什么| 嗝气是什么原因引起的| 鲣鱼是什么鱼| 琅琊榜是什么意思| 无冕之王是什么意思| 甲状腺吃什么盐好| 声音的传播需要什么| 做梦梦到狮子是什么意思| 198是什么意思| 很nice什么意思| 喉结肿大是什么原因| 狮子吃什么| 复苏是什么意思| 湿气用什么药最好最快| 射手是什么星座| 蝌蚪吃什么| 肛门周围痒是什么病| 小便发红是什么症状男| 养心吃什么食物好| 痹症是什么意思| 腿疼膝盖疼是什么原因| 男人吃鸽子有什么好处| 重庆房价为什么这么低| 眉毛脱落是什么原因造成的| 人参不能和什么一起吃| 请人帮忙用什么词| hpv检查什么| 吃盐吃多了有什么危害| 一个虫一个冉读什么| 忽然流鼻血是什么原因引起的| 斗米恩升米仇什么意思| 丑五行属什么| 为什么小便是红色的尿| 心律不齐用什么药| 全麦粉和小麦粉的区别是什么| 托大是什么意思| 女孩缺金取什么名字好| 千克又叫什么| 性激素六项什么时候检查| 见好就收是什么意思| 女生下面流水是什么原因| 买盘和卖盘是什么意思| 黑胡椒和白胡椒有什么区别| 石足念什么| 无什么于事| 刚怀孕吃什么水果对胎儿好| 浓茶喝多了有什么危害| 口腔有异味是什么原因引起的| 多囊卵巢综合症吃什么药| 秋天是什么样子的| 心脏彩超挂什么科| 为什么脖子上会长痘痘| 梦见蔬菜是什么预兆| 不拘是什么意思| 9月份什么星座| 开眼镜店需要什么条件| 三七粉是治什么病的| 卧龙凤雏什么意思| 保妇康栓治疗什么妇科病| 2008是什么年| 猪脚炖什么| 什么东东是什么意思| 双花红棍什么意思| 女人男相有什么说法| 乙肝两对半25阳性是什么意思| 歧途什么意思| 66年属马是什么命| 言谈举止是什么意思| 洁面膏和洗面奶有什么区别| 多动症是什么| 得了便宜还卖乖是什么意思| 贵州的特产是什么| 排卵期过后是什么期| 正的五行属性是什么| 维生素E什么牌子的效果最好| 支原体肺炎吃什么药| 什么水果含糖量最低| 疼痛科主要看什么病| 大运流年是什么意思| 小钙化灶是什么意思| 大姨妈为什么会推迟| 结婚55周年是什么婚| dbm是什么意思| 群众路线是什么| 膝盖不好的人适合什么运动| 百度

大娱乐-我观察了14年才发现,那些很努力却没有成就的人都有一个

Guanning Zeng1* ?Xiang Zhang2 ?Zirui Wang3 ?Haiyang Xu2
Zeyuan Chen2 ?Bingnan Li2 ?Zhuowen Tu2
1Tsinghua University ?2UC San Diego ?3UC Berkeley
Abstract
百度 高举这两面旗帜,既有深刻的历史继承性,又有鲜明的时代必然性,是我们党把统一战线的原则性和灵活性有机结合的又一成功范例。

We propose YOLO-Count, a differentiable open-vocabulary object counting model that tackles both general counting challenges and enables precise quantity control for text-to-image (T2I) generation. A core contribution is the ‘cardinality’ map, a novel regression target that accounts for variations in object size and spatial distribution. Leveraging representation alignment and a hybrid strong-weak supervision scheme, YOLO-Count bridges the gap between open-vocabulary counting and T2I generation control. Its fully differentiable architecture facilitates gradient-based optimization, enabling accurate object count estimation and fine-grained guidance for generative models. Extensive experiments demonstrate that YOLO-Count achieves state-of-the-art counting accuracy while providing robust and effective quantity control for T2I systems.

[Uncaptioned image]
Figure 1: Demonstrations of YOLO-Count’s object quantity controllability for text-to-image generation. Incorporating YOLO-Count as a differentiable guidance module over a strong baseline (SDXL?[41]) substantially improves alignment between text-specified object quantities and the generated images.
**footnotetext: Work done during internship at UC San Diego.

1 Introduction

Text-to-image (T2I) generative models have achieved remarkable success in producing high-fidelity images from natural language descriptions. However, ensuring precise alignment with textual specifications, particularly regarding object quantity, remains a significant challenge. While prior research has improved adherence to object layout, attributes, and style through conditional training and guidance mechanisms, accurately controlling the number of objects synthesized within an image remains difficult. Unlike localized attributes, object quantity constitutes a global constraint, requiring models to establish numerical correspondence between language tokens and compositional objects. Consequently, conventional conditional training approaches such as ControlNet?[52] are ill-suited for explicit quantity control. Moreover, the stochastic nature of the denoising process in T2I models introduces ambiguity in object differentiation, further complicating count consistency. Recent conditional guidance methods, such as BoxDiff?[46] and Ranni?[12], address aspects of spatial layout, object attributes, and semantic panel conditioning. However, these methods lack a direct and principled mechanism for precise quantity control, leaving a critical gap in bridging linguistic numeracy and visual synthesis.

In this work, we propose YOLO-Count, an open-vocabulary object counting model built on the YOLO architecture. YOLO-Count is a fully differentiable, regression-based model that demonstrates high accuracy, computational efficiency, and open-vocabulary capabilities. A key contribution is the introduction of the cardinality map, a novel representation that encodes object quantity while preserving awareness of object size and spatial location. Unlike traditional density maps, which apply Gaussian kernels at object centers, the cardinality map distributes quantity scores across object instances, improving accuracy and robustness to scale variation. Furthermore, YOLO-Count leverages representation alignment and a hybrid strong-weak supervision strategy, enabling the use of large-scale instance segmentation datasets without reliance on computationally expensive pre-trained visual encoders.

Beyond generic object counting, we are motivated to apply YOLO-Count for precise control of object quantities in text-to-image (T2I) generation. This is achieved by employing YOLO-Count as a differentiable guidance module?[5], where gradient signals from the counting model steer the generative process toward numerical consistency. While prior research has predominantly focused on guidance algorithms for attributes and layout, explicit quantity control remains underexplored. We argue that an ideal object counting model for T2I applications should possess four key properties: (1) full differentiability w.r.t. the input image; (2) open-vocabulary capability for diverse object categories; (3) cross-scale generalization to varying object sizes; (4) computational efficiency for practical deployment.

Constructing such a model introduces several challenges. First, state-of-the-art counting approaches?[35, 2] are often detection-based, producing outputs that preclude gradient propagation. Second, existing counting datasets such as FSC147?[37] or CARPK?[18] are limited in scale and category diversity, hindering open-vocabulary generalization. Third, while large-scale vision encoders (e.g., CLIP?[36] or GroundingDINO?[32, 38]) can alleviate data limitations, they impose significant computational overhead.

To address these issues, we integrate YOLO-Count with textual inversion?[13, 50] to achieve precise quantity control in T2I generation. Extensive experiments demonstrate that YOLO-Count achieves state-of-the-art accuracy on counting benchmarks, outperforms density-based and detection-based counting models, and substantially improves object quantity controllability in T2I generation.

Our contributions are summarized as follows:

  • ?

    We introduce the cardinality map, a novel regression target that improves object counting accuracy compared to density maps.

  • ?

    We develop YOLO-Count, an efficient, open-vocabulary, and fully differentiable counting model that achieves state-of-the-art performance and enhances quantity control for T2I generation.

  • ?

    We propose hybrid strong-weak supervision with representation alignment, enabling effective training using large-scale segmentation datasets without reliance on heavy visual encoders.

2 Related Works

2.1 Object Counting Models and Datasets

Object counting models can be broadly classified according to their category scope into fixed-category counting models?[40, 44, 14] and open-vocabulary counting models?[2, 35, 11]. For controlling object quantities in generative tasks, open-vocabulary counting is essential, as it supports arbitrary object categories without retraining. Based on the type of supervision or guidance, counting models can be further divided into text-guided models?[47, 58], visual-exemplar-guided models?[37, 20], multimodal-guided models?[2], and reference-less models?[17, 31, 45]. For T2I integration, a purely text-guided counting model is preferable to ensure compatibility with prompt-driven generation.

From a methodological perspective, counting models are typically divided into detection-based and regression-based approaches. Detection-based models?[18, 2, 34] rely on explicit object detection, filtering instances via thresholds and enumerating discrete counts, which inherently produce non-differentiable integer outputs. In contrast, regression-based models?[3, 4, 27] predict continuous-valued maps such as density maps?[10, 33] that represent pixel-wise contributions to the final count. This direct differentiability makes regression-based models particularly suitable for gradient-based control in generative pipelines.

Finally, training datasets for object counting are categorized into fixed-category datasets?[18, 43, 21] and open-vocabulary datasets?[37, 1]. Open-vocabulary datasets provide images containing diverse object categories and instance counts, but are expensive to collect and annotate?[37]. For example, the widely used FSC147 dataset includes only 3,659 training images, which limits scale and diversity. To address this, recent works?[22, 2] incorporate large-scale pre-trained visual backbones (e.g., CLIP?[36] and GroundingDINO?[32]) and fine-tune them on smaller counting datasets to enhance open-vocabulary generalization.

Refer to caption
Figure 2: YOLO-Count Model Overview. YOLO-Count comprises a YOLO backbone, CLIP text encoder, vision-language path aggregation network (VLPAN), cardinality regression head and classification head. Built upon the YOLO-World?[9] architecture, the cardinality head predicts a cardinality map. The final object quantity is obtained by summing over the cardinality map.

2.2 Controllable Text-to-Image Generation

Controllable text-to-image (T2I) generation methods can be broadly categorized into two paradigms: training-based methods?[52, 54, 19] and guidance-based methods?[53, 5, 49]. Training-based approaches, such as ControlNet?[52], IP-Adapter?[48], and GLIGEN?[28], inject conditional inputs directly into the generative model through additional network branches or adapters. While effective, these methods rely on large-scale training datasets annotated with the corresponding conditions. In contrast, guidance-based approaches, including BoxDiff?[46], Attend-and-Excite?[8], and Separate-and-Enhance?[6], control generation by manipulating the diffusion process at inference time, eliminating the need for retraining. Many of these methods exploit the interpretability of cross-attention mechanisms?[30] to steer image synthesis. However, cross-attention is primarily effective for distinguishing object categories rather than differentiating multiple instances of the same category. As a result, existing controllable T2I techniques excel at localized attribute binding?[15, 55] and layout control?[57, 56], but struggle with enforcing global constraints such as precise object quantity.

2.3 Object Quantity Control for T2I Models

Research on explicit object quantity control in text-to-image (T2I) models remains limited. [25] pioneers the use of universal diffusion guidance for quantity control, representing the first attempt to directly address this challenge. [7] introduces an attention-based representation for counting objects, but their approach is constrained to controlling small quantities (ranging from 1 to 10). More recently, prompt-tuning approaches?[50, 42] have been proposed to incorporate numerical cues into the text embedding space, enabling limited quantity control without modifying the underlying diffusion model. However, these methods still struggle with accurate control over larger counts.

3 Methods

3.1 Model Overview

Our proposed YOLO-Count builds upon the YOLO-World architecture?[9] and consists of three primary components: (1) a vision backbone, (2) a vision-language path aggregation network (VLPAN), and (3) prediction heads. Fig.?2 illustrates the overall pipeline and highlights our key architectural modifications.

Vision Backbone.

The vision backbone in YOLO-Count follows the design of YOLOv8l?[23] and YOLO-World-L?[9]. It comprises five stages of convolutional modules (ConvModules) and cross-stage partial layers (CSPLayers). Given an input image I?640×640×3I\in\mathbb{R}^{640\times 640\times 3}italic_I ∈ blackboard_R start_POSTSUPERSCRIPT 640 × 640 × 3 end_POSTSUPERSCRIPT, the backbone extracts multiscale visual features at three resolutions:

f0=[f80×80,f40×40,f20×20]=VisualBackbone?(I)f^{0}=[f_{80\times 80},f_{40\times 40},f_{20\times 20}]=\mathrm{VisualBackbone}(I)italic_f start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = [ italic_f start_POSTSUBSCRIPT 80 × 80 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 40 × 40 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 20 × 20 end_POSTSUBSCRIPT ] = roman_VisualBackbone ( italic_I )
Vision-Language Path Aggregation Network (VLPAN).

The VLPAN is designed to fuse visual features with textual semantics and aggregate information across scales. Inheriting from YOLO-World, it employs both top-down and bottom-up pathways, but with key enhancements: (1) T-CSPLayers: standard CSPLayers are replaced by T-CSPLayers, which integrate sigmoid attention blocks to modulate visual features based on precomputed CLIP text embeddings?[36]. (2) Extended Top-Down Fusion: to better preserve fine-grained spatial details, an additional top-down pathway is introduced following the initial bidirectional aggregation, maximizing high-resolution feature utilization, which is critical for accurate counting regression. The enhanced VLPAN is formulated as:

[f1,f2]=VLPAN?(f0,fT)[f^{1},f^{2}]=\mathrm{VLPAN}(f^{0},f_{\mathrm{T}})[ italic_f start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_f start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = roman_VLPAN ( italic_f start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT )

where fTf_{\mathrm{T}}italic_f start_POSTSUBSCRIPT roman_T end_POSTSUBSCRIPT denotes the CLIP text embedding of the category, f1f^{1}italic_f start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and f2f^{2}italic_f start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT represent multimodal features for classification and counting regression, respectively.

Prediction Heads.

Following the VLPAN, several ConvModules are applied to text-aware visual features to aggregate multi-scale signals into a unified 80×8080\times 8080 × 80 resolution. The prediction stage then produces two parallel outputs: (1) a cardinality regression head, which predicts a dense cardinality map for differentiable counting, and (2) a classification head, trained with contrastive supervision to ensure robust open-vocabulary capability. These two outputs jointly enable YOLO-Count to provide accurate, differentiable count estimates while maintaining strong category generalization, as shown on the right side of Fig.?2.

百度