本部分目前仅提供英文版本。

实验性层级

LibreVLM

把视觉语言模型对准一张图像，递给它一组词，就能得到检测框。LibreVLM 让 Qwen3-VL、Florence-2 等模型变成开放词表的目标检测器，并使用与其他所有 LibreYOLO 模型完全相同的 Results API。

简介

传统检测器在检测头中固化了一份固定的类别列表。LibreVLM 抛弃了这个限制。它封装现代指令微调的视觉语言模型，提示它们输出边界框，解析生成的文本，并返回你在 YOLO9 和 RF-DETR 中已经用过的同一个 Results 对象。类别列表只是你在运行时提供的一组词，因此新增一个类别毫无成本，而且是零样本生效。

开放词表。 检测 "pink car"、"license plate" 或 "the small island"，无需为它们训练任何检测头。
一个工厂，一份契约。 LibreVLM(...) 返回标准的 Results，含 boxes.xyxy、boxes.cls、boxes.conf，以及 .plot() 和 .save()。
可替换的后端。 一个别名字符串背后是六个模型系列，从 230M 的 Florence-2 到 8B 的 Qwen3-VL。
一个原始的逃生通道。 当你需要的不止是框时，chat() 提供自由形式的图像问答。

为什么单独分出一个层级（和一个页面）

LibreVLM 被刻意排除在闭合词表的 LibreYOLO(...) 工厂及其 .pt 注册表之外。这些模型由提示驱动、开放词表，并报告合成的置信度，因此遵循的是另一份契约。把它们作为独立的层级，可以让核心检测文档保持整洁，并对所测量的内容保持诚实。

处于 dev 分支

LibreVLM 目前位于 dev 分支，计划在 v1.3 版本发布；它不属于 v1.2.0。它是一个纯 Python 的推理层级：尚无训练、验证、导出或 CLI 路径，且置信度分数是占位符。在此之上构建之前，请先阅读局限性一节。

安装

LibreVLM 位于可选的 vlm extra 之后。它会引入较新的 transformers 以及一些处理器所需的辅助库。没有该 extra 时，导入某个 VLM 系列会抛出 ImportError 并将你指引到这里。

bash

1 pip install 'libreyolo[vlm]'

权重在首次使用时从 Hugging Face Hub 下载到本地的 weights/ 文件夹。少数系列采用非 OSI 许可证，会在下载前记录一次性提示。较大的后端推荐使用 GPU，但每个模型也都能通过 device="cpu" 在 CPU 上运行。

快速开始

构造一个模型，声明你关心的词，然后预测。默认后端是 Qwen3-VL-4B，它是该层级中最强的检测器，并采用 Apache-2.0 许可证。

python

1 from libreyolo import LibreVLM
2 
3 # Qwen3-VL-4B by default; weights autodownload on first use
4 model = LibreVLM()
5 
6 # The vocabulary is just words. Any words.
7 model.set_classes(["pink car", "wheel"])
8 
9 result = model.predict("street.jpg")
10 
11 print(result.boxes.xyxy)   # pixel [x1, y1, x2, y2]
12 print(result.boxes.cls)    # ids into ["pink car", "wheel"]
13 result.plot()              # same drawing helpers as any LibreYOLO model
14 result.save("out.jpg")

这就是整个流程。predict() 之后的一切都表现得像一个普通检测器，因此现有的可视化、裁剪和跟踪代码都能继续工作。

支持的模型

通过传给 LibreVLM(...) 的别名来选择后端。仅给出系列名会解析为其默认尺寸。整体默认后端是 qwen3-vl-4b。实践中最强的检测器是 Qwen3-VL、 LFM2-VL 和 Florence-2。

系列	别名	尺寸（参数量）	许可证	说明
Qwen3-VL	`qwen3-vl-2b / -4b / -8b`	2B / 4B / 8B	Apache-2.0	默认且最强。推荐作为起点。
LFM2-VL	`lfm2-vl-450m / -1.6b`	450M / 1.6B	LFM Open License	边缘尺寸，小型检测器表现意外出色。下载前有提示。
InternVL3	`internvl3-1b / -2b / -8b`	1B / 2B / 8B	Qwen License	8B 定位效果好；小尺寸较弱。下载前有提示。
Florence-2	`florence-2-base / -large`	0.23B / 0.77B	MIT	专为定位打造的模型。框很紧致，无 `chat()`。
SmolVLM2	`smolvlm2-500m / -2.2b`	500M / 2.2B	Apache-2.0	小巧快速；检测能力较弱。适合快速试用。
Kosmos-2	`kosmos-2`	~1.6B	MIT	2023 年的定位模型。框较粗糙，无 `chat()`。

选择后端

最佳质量： qwen3-vl-8b 或 qwen3-vl-4b（默认）。
紧致的框、占用小： florence-2-large。
边缘 / CPU： lfm2-vl-450m 或 smolvlm2-500m。
完全宽松的许可证： 任意尺寸的 Qwen3-VL、SmolVLM2、Florence-2 或 Kosmos-2。

许可证

Qwen3-VL 和 SmolVLM2 采用 Apache-2.0；Florence-2 和 Kosmos-2 采用 MIT。LFM2-VL 和 InternVL3 采用非 OSI 许可证，会在首次下载前发出一次性提示，以便你为商业用途做出知情选择。

设置词表

词表是开放词表检测的核心。用一组标签字符串调用 set_classes()。它是持久的：会在之后每一次 predict() 和 track() 调用中保留，直到你再次设置。它返回 self，因此可以链式调用。

python

1 # Sticky and chainable
2 model = LibreVLM("qwen3-vl-2b").set_classes(["person", "dog", "cat"])
3 
4 # Set it once at construction instead
5 model = LibreVLM("lfm2-vl-450m", names=["boat"], device="cpu")
6 
7 # Re-set any time to change what you are looking for
8 model.set_classes(["a red car", "a blue truck"])

标签可以是任意短语。它们在不区分大小写时必须唯一，并且你必须传入一个列表，而不是单个字符串。如果你从不调用 set_classes()，模型会回退到 COCO-80 词表，这样即使是裸 predict() 也能给出合理的结果。

预测

predict()（以及等价的 model(...) 调用）接受与任何 LibreYOLO 检测器相同的输入类型：路径、PIL 图像、numpy 数组、URL、文件夹或视频。 stream=True 和 track() 也都能用。

python

1 result = model.predict(
2     source="image.jpg",  # path | PIL | ndarray | URL | folder | video
3     conf=0.25,           # see note below: scoring is synthetic
4     classes=[0],         # optional: keep only these vocabulary ids
5     max_det=300,
6 )

返回结构

你会得到标准的 Results 对象，与闭合词表检测器完全一致：

字段	形状 / 类型	含义
`result.boxes.xyxy`	N x 4	像素框 [x1, y1, x2, y2]，缩放到原始图像尺寸。
`result.boxes.cls`	N	类别 id，索引到你的 set_classes() 词表。
`result.boxes.conf`	N	合成置信度：每个框都是 1.0（见“局限性”）。
`result.plot() / .save()`	-	常用的绘制与保存辅助方法。

在底层，LibreVLM 会宽容地解析模型输出（处理 markdown 代码围栏、多余的散文、重复的框以及被截断的数组），把自由文本标签映射回你的类别 id，并丢弃任何不在你词表中的标签。正是最后这一步，让一个自由生成的模型表现得像一个闭集检测器。

示例

检测特定颜色的物体

python

1 from libreyolo import LibreVLM
2 
3 model = LibreVLM("qwen3-vl-4b")
4 model.set_classes(["red car"])
5 
6 result = model.predict("parking_lot.jpg")
7 print(f"Found {len(result.boxes.cls)} red car(s)")
8 result.save("red_cars.jpg")

用 Florence-2 获得紧致的框

python

1 # Florence-2 is a purpose-built grounder: very tight pixel boxes.
2 model = LibreVLM("florence-2-large")
3 model.set_classes(["a red car", "license plate"])
4 
5 result = model.predict("car.jpg")
6 result.plot()

动态过滤到单个类别

python

1 model = LibreVLM("qwen3-vl-2b").set_classes(["person", "dog", "cat"])
2 
3 # classes= filters the configured vocabulary by id
4 people_only = model.predict("street.jpg", classes=[0])

用内置示例图像在 CPU 上运行

python

1 from libreyolo import LibreVLM, SAMPLE_IMAGE
2 
3 model = LibreVLM("lfm2-vl-450m", device="cpu")
4 # No set_classes() -> falls back to the COCO-80 vocabulary
5 result = model.predict(SAMPLE_IMAGE)
6 print(model.names[result.boxes.cls[0]])  # e.g. "person"

批量、文件夹与视频

python

1 model = LibreVLM().set_classes(["forklift", "pallet"])
2 
3 # A whole folder
4 for result in model.predict("warehouse_frames/", stream=True):
5     result.save()
6 
7 # A video file (frames are processed one at a time)
8 model.predict("warehouse.mp4", save=True)

原始对话

有时你想要的是模型本身，而不是检测器。采用对话模板的系列暴露了 chat()，它接受一张图像和一个自由形式的提示，并原样返回解码后的文本。可用于计数、生成描述或快速的视觉问答。

python

1 model = LibreVLM("qwen3-vl-4b")
2 
3 answer = model.chat("harbor.jpg", "How many boats are docked? Answer with a number.")
4 print(answer)

chat() 在采用对话模板的系列上可用（Qwen3-VL、LFM2-VL、 SmolVLM2、InternVL3）。Florence-2 和 Kosmos-2 是基于任务 token 的定位模型，会抛出 NotImplementedError；请对它们使用 predict()。

后端差异

每个系列都返回同样的 Results，但抵达方式各不相同。你很少需要关心这一点，不过了解某些后端为何如此表现会有帮助。对话系列被提示输出一个 JSON 框数组；定位模型则使用专门的任务 token。

系列	提示方式	坐标空间	chat()
Qwen3-VL	JSON 框提示	0 到 1000，重新缩放	是
LFM2-VL	JSON 框提示	归一化 0 到 1	是
SmolVLM2	JSON 框提示	归一化 0 到 1	是
InternVL3	JSON 框提示	0 到 1000，重新缩放	是
Florence-2	任务 token	原生像素	否
Kosmos-2	Grounding 提示	归一化，重新缩放	否

对于对话系列，你可以用构造函数参数 prompt= 覆盖检测提示，并用 max_new_tokens= 限制生成长度。设备和 dtype 会自动解析：CUDA 上为 bf16 或 fp16，CPU 上为 fp32。

局限性

LibreVLM 强大但尚不成熟。提前了解它的边界能避免日后意外。

合成置信度。 每个框的得分都是 1.0。因此 conf= 过滤表现为全有或全无，而非真正的阈值。
无 mAP / 验证。 val() 会抛出异常，因为合成分数会让 COCO mAP 产生误导。
无训练或导出。 train() 和 export() 会抛出异常。请在上游微调 VLM，然后加载得到的权重。
跟踪能力受限。 track() 可以运行，但统一的分数会让跟踪器的低置信度恢复阶段失效。
一次一张图像。 在 v1 中生成是串行的，因此更大的 batch= 值不会带来加速。
仅 Python API。 libreyolo CLI 尚不能解析 VLM 别名。

它的优势所在

当类别集合是开放式的、经常变化，或难以提前标注时，就使用 LibreVLM：快速原型、长尾或稀有类别，以及“用文字描述要找的东西”这类工作流。当你需要校准过的置信度、吞吐量或可部署的产物时，请按核心文档训练闭合词表的 YOLO9 或 RF-DETR。

仅推理dev 分支 / 面向 v1.3源码见 GitHub

1	from libreyolo import LibreVLM
2
3	# Qwen3-VL-4B by default; weights autodownload on first use
4	model = LibreVLM()
5
6	# The vocabulary is just words. Any words.
7	model.set_classes(["pink car", "wheel"])
8
9	result = model.predict("street.jpg")
10
11	print(result.boxes.xyxy) # pixel [x1, y1, x2, y2]
12	print(result.boxes.cls) # ids into ["pink car", "wheel"]
13	result.plot() # same drawing helpers as any LibreYOLO model
14	result.save("out.jpg")

1	# Sticky and chainable
2	model = LibreVLM("qwen3-vl-2b").set_classes(["person", "dog", "cat"])
3
4	# Set it once at construction instead
5	model = LibreVLM("lfm2-vl-450m", names=["boat"], device="cpu")
6
7	# Re-set any time to change what you are looking for
8	model.set_classes(["a red car", "a blue truck"])

1	result = model.predict(
2	source="image.jpg", # path \| PIL \| ndarray \| URL \| folder \| video
3	conf=0.25, # see note below: scoring is synthetic
4	classes=[0], # optional: keep only these vocabulary ids
5	max_det=300,
6	)

1	from libreyolo import LibreVLM
2
3	model = LibreVLM("qwen3-vl-4b")
4	model.set_classes(["red car"])
5
6	result = model.predict("parking_lot.jpg")
7	print(f"Found {len(result.boxes.cls)} red car(s)")
8	result.save("red_cars.jpg")

1	# Florence-2 is a purpose-built grounder: very tight pixel boxes.
2	model = LibreVLM("florence-2-large")
3	model.set_classes(["a red car", "license plate"])
4
5	result = model.predict("car.jpg")
6	result.plot()

1	model = LibreVLM("qwen3-vl-2b").set_classes(["person", "dog", "cat"])
2
3	# classes= filters the configured vocabulary by id
4	people_only = model.predict("street.jpg", classes=[0])

1	from libreyolo import LibreVLM, SAMPLE_IMAGE
2
3	model = LibreVLM("lfm2-vl-450m", device="cpu")
4	# No set_classes() -> falls back to the COCO-80 vocabulary
5	result = model.predict(SAMPLE_IMAGE)
6	print(model.names[result.boxes.cls[0]]) # e.g. "person"

1	model = LibreVLM().set_classes(["forklift", "pallet"])
2
3	# A whole folder
4	for result in model.predict("warehouse_frames/", stream=True):
5	result.save()
6
7	# A video file (frames are processed one at a time)
8	model.predict("warehouse.mp4", save=True)

1	model = LibreVLM("qwen3-vl-4b")
2
3	answer = model.chat("harbor.jpg", "How many boats are docked? Answer with a number.")
4	print(answer)