Experimental tier

LibreVLM

Point a vision language model at an image, hand it a list of words, and get back boxes. LibreVLM turns Qwen3-VL, Florence-2 and friends into open-vocabulary object detectors that speak the exact same Results API as every other LibreYOLO model.

Introduction

A classic detector ships with a fixed list of classes baked into its head. LibreVLM throws that constraint away. It wraps modern instruction-tuned vision language models, prompts them to emit bounding boxes, parses the generated text, and returns the same Results object you already use for YOLO9 and RF-DETR. The class list is just a list of words you supply at runtime, so adding a new category costs nothing and works zero-shot.

Open vocabulary. Detect "pink car", "license plate", or "the small island" without ever training a head for them.
One factory, one contract. LibreVLM(...) returns the standard Results with boxes.xyxy, boxes.cls, boxes.conf, plus .plot() and .save().
Swappable backends. Six model families behind one alias string, from a 230M Florence-2 to an 8B Qwen3-VL.
A raw escape hatch. chat() gives you free-form image question answering when you need more than boxes.

Why a separate tier (and a separate page)

LibreVLM is deliberately kept out of the closed-vocabulary LibreYOLO(...) factory and its .pt registry. These models are prompt-driven, open vocabulary, and report synthetic confidence, so they honor a different contract. Treating them as their own tier keeps the core detection docs clean and honest about what is being measured.

On the dev branch

LibreVLM currently lives on the dev branch and is targeted for the v1.3 release; it is not part of v1.2.0. It is a Python-only inference tier: there is no training, validation, export, or CLI path yet, and confidence scores are placeholders. Read the Limitations section before you build on top of it.

Installation

LibreVLM lives behind the optional vlm extra. It pulls in a recent transformers and the helpers a couple of processors need. Without the extra, importing a VLM family raises an ImportError that points you here.

bash

1 pip install 'libreyolo[vlm]'

Weights are downloaded from the Hugging Face Hub on first use into a local weights/ folder. A few families ship under non-OSI licenses and log a one-time notice before downloading. A GPU is recommended for the larger backends, but every model also runs on CPU with device="cpu".

Quickstart

Construct a model, declare the words you care about, and predict. The default backend is Qwen3-VL-4B, the strongest detector in the tier and Apache-2.0 licensed.

python

1 from libreyolo import LibreVLM
2 
3 # Qwen3-VL-4B by default; weights autodownload on first use
4 model = LibreVLM()
5 
6 # The vocabulary is just words. Any words.
7 model.set_classes(["pink car", "wheel"])
8 
9 result = model.predict("street.jpg")
10 
11 print(result.boxes.xyxy)   # pixel [x1, y1, x2, y2]
12 print(result.boxes.cls)    # ids into ["pink car", "wheel"]
13 result.plot()              # same drawing helpers as any LibreYOLO model
14 result.save("out.jpg")

That is the whole loop. Everything downstream of predict() behaves like a normal detector, so existing visualization, cropping, and tracking code keeps working.

Supported Models

Pick a backend with the alias you pass to LibreVLM(...). A bare family name resolves to its default size. The default backend overall is qwen3-vl-4b. In practice the strongest detectors are Qwen3-VL, LFM2-VL, and Florence-2.

Family	Alias	Sizes (params)	License	Notes
Qwen3-VL	`qwen3-vl-2b / -4b / -8b`	2B / 4B / 8B	Apache-2.0	Default and strongest. Recommended starting point.
LFM2-VL	`lfm2-vl-450m / -1.6b`	450M / 1.6B	LFM Open License	Edge sized, surprisingly strong small detector. Notice gated.
InternVL3	`internvl3-1b / -2b / -8b`	1B / 2B / 8B	Qwen License	Good grounding at 8B; small sizes are weak. Notice gated.
Florence-2	`florence-2-base / -large`	0.23B / 0.77B	MIT	Purpose-built grounding model. Tight boxes, no `chat()`.
SmolVLM2	`smolvlm2-500m / -2.2b`	500M / 2.2B	Apache-2.0	Tiny and fast; weaker detector. Good for quick trials.
Kosmos-2	`kosmos-2`	~1.6B	MIT	2023 grounder. Coarser boxes, no `chat()`.

Choosing a backend

Best quality: qwen3-vl-8b or qwen3-vl-4b (the default).
Tight boxes, small footprint: florence-2-large.
Edge / CPU: lfm2-vl-450m or smolvlm2-500m.
Fully permissive license: any Qwen3-VL, SmolVLM2, Florence-2 or Kosmos-2 size.

Licensing

Qwen3-VL and SmolVLM2 are Apache-2.0; Florence-2 and Kosmos-2 are MIT. LFM2-VL and InternVL3 carry non-OSI licenses and emit a one-time notice before their first download, so you can make an informed choice for commercial use.

Setting the Vocabulary

The vocabulary is the heart of open-vocabulary detection. Call set_classes() with a list of label strings. It is sticky: it persists across every later predict() and track() call until you set it again. It returns self, so it chains.

python

1 # Sticky and chainable
2 model = LibreVLM("qwen3-vl-2b").set_classes(["person", "dog", "cat"])
3 
4 # Set it once at construction instead
5 model = LibreVLM("lfm2-vl-450m", names=["boat"], device="cpu")
6 
7 # Re-set any time to change what you are looking for
8 model.set_classes(["a red car", "a blue truck"])

Labels can be any phrase. They must be unique case-insensitively, and you must pass a list, not a bare string. If you never call set_classes(), the model falls back to the COCO-80 vocabulary so a bare predict() still does something sensible.

Prediction

predict() (and the equivalent model(...) call) accepts the same source types as any LibreYOLO detector: a path, a PIL image, a numpy array, a URL, a folder, or a video. stream=True and track() work too.

python

1 result = model.predict(
2     source="image.jpg",  # path | PIL | ndarray | URL | folder | video
3     conf=0.25,           # see note below: scoring is synthetic
4     classes=[0],         # optional: keep only these vocabulary ids
5     max_det=300,
6 )

Return shape

You get back the standard Results object, identical to a closed-vocabulary detector:

Field	Shape / type	Meaning
`result.boxes.xyxy`	N x 4	Pixel boxes [x1, y1, x2, y2], scaled to the original image.
`result.boxes.cls`	N	Class ids indexing into your set_classes() vocabulary.
`result.boxes.conf`	N	Synthetic confidence: 1.0 for every box (see Limitations).
`result.plot() / .save()`	-	The usual drawing and saving helpers.

Under the hood, LibreVLM tolerantly parses the model output (handling markdown fences, stray prose, duplicated boxes, and truncated arrays), maps free-text labels back to your class ids, and drops any label that is not in your vocabulary. That last step is what makes a free-form generator behave like a closed-set detector.

Examples

Detect a specific colored object

python

1 from libreyolo import LibreVLM
2 
3 model = LibreVLM("qwen3-vl-4b")
4 model.set_classes(["red car"])
5 
6 result = model.predict("parking_lot.jpg")
7 print(f"Found {len(result.boxes.cls)} red car(s)")
8 result.save("red_cars.jpg")

Tight boxes with Florence-2

python

1 # Florence-2 is a purpose-built grounder: very tight pixel boxes.
2 model = LibreVLM("florence-2-large")
3 model.set_classes(["a red car", "license plate"])
4 
5 result = model.predict("car.jpg")
6 result.plot()

Filter to a single class on the fly

python

1 model = LibreVLM("qwen3-vl-2b").set_classes(["person", "dog", "cat"])
2 
3 # classes= filters the configured vocabulary by id
4 people_only = model.predict("street.jpg", classes=[0])

Run on CPU with a built-in sample image

python

1 from libreyolo import LibreVLM, SAMPLE_IMAGE
2 
3 model = LibreVLM("lfm2-vl-450m", device="cpu")
4 # No set_classes() -> falls back to the COCO-80 vocabulary
5 result = model.predict(SAMPLE_IMAGE)
6 print(model.names[result.boxes.cls[0]])  # e.g. "person"

Batches, folders, and video

python

1 model = LibreVLM().set_classes(["forklift", "pallet"])
2 
3 # A whole folder
4 for result in model.predict("warehouse_frames/", stream=True):
5     result.save()
6 
7 # A video file (frames are processed one at a time)
8 model.predict("warehouse.mp4", save=True)

Raw Chat

Sometimes you want the model, not the detector. The chat-template families expose chat(), which takes an image and a free-form prompt and returns the decoded text verbatim. Use it for counting, captioning, or quick visual questions.

python

1 model = LibreVLM("qwen3-vl-4b")
2 
3 answer = model.chat("harbor.jpg", "How many boats are docked? Answer with a number.")
4 print(answer)

chat() is available on the chat-template families (Qwen3-VL, LFM2-VL, SmolVLM2, InternVL3). Florence-2 and Kosmos-2 are task-token grounders and raise NotImplementedError; use predict() with them.

How Backends Differ

Every family returns the same Results, but they reach it differently. You rarely need to care, yet it helps to know why some backends behave the way they do. The chat families are prompted for a JSON array of boxes; the grounders use dedicated task tokens.

Family	Prompting	Coordinate space	chat()
Qwen3-VL	JSON box prompt	0 to 1000, rescaled	Yes
LFM2-VL	JSON box prompt	Normalized 0 to 1	Yes
SmolVLM2	JSON box prompt	Normalized 0 to 1	Yes
InternVL3	JSON box prompt	0 to 1000, rescaled	Yes
Florence-2	Task token	Native pixels	No
Kosmos-2	Grounding prompt	Normalized, rescaled	No

For the chat families you can override the detection prompt with the prompt= constructor argument, and cap generation length with max_new_tokens=. Device and dtype are resolved automatically: bf16 or fp16 on CUDA, fp32 on CPU.

Limitations

LibreVLM is powerful but young. Knowing the boundaries up front saves surprises later.

Synthetic confidence. Every box is scored 1.0. The conf= filter therefore behaves as all-or-nothing rather than a real threshold.
No mAP / validation. val() raises, because synthetic scores would make COCO mAP misleading.
No training or export. train() and export() raise. Fine-tune the VLM upstream and load the resulting weights instead.
Tracking is degraded. track() runs, but uniform scores make the tracker's low-confidence recovery stage inert.
One image at a time. Generation is sequential in v1, so larger batch= values give no speedup.
Python API only. The libreyolo CLI does not resolve VLM aliases yet.

Where it shines

Use LibreVLM when the class set is open ended, changes often, or is hard to label up front: rapid prototyping, long-tail or rare categories, and "find the thing I describe in words" workflows. When you need calibrated confidence, throughput, or a deployable artifact, train a closed-vocabulary YOLO9 or RF-DETR from the core docs.

Inference onlydev branch / targeting v1.3Source on GitHub

1	from libreyolo import LibreVLM
2
3	# Qwen3-VL-4B by default; weights autodownload on first use
4	model = LibreVLM()
5
6	# The vocabulary is just words. Any words.
7	model.set_classes(["pink car", "wheel"])
8
9	result = model.predict("street.jpg")
10
11	print(result.boxes.xyxy) # pixel [x1, y1, x2, y2]
12	print(result.boxes.cls) # ids into ["pink car", "wheel"]
13	result.plot() # same drawing helpers as any LibreYOLO model
14	result.save("out.jpg")

1	# Sticky and chainable
2	model = LibreVLM("qwen3-vl-2b").set_classes(["person", "dog", "cat"])
3
4	# Set it once at construction instead
5	model = LibreVLM("lfm2-vl-450m", names=["boat"], device="cpu")
6
7	# Re-set any time to change what you are looking for
8	model.set_classes(["a red car", "a blue truck"])

1	result = model.predict(
2	source="image.jpg", # path \| PIL \| ndarray \| URL \| folder \| video
3	conf=0.25, # see note below: scoring is synthetic
4	classes=[0], # optional: keep only these vocabulary ids
5	max_det=300,
6	)

1	from libreyolo import LibreVLM
2
3	model = LibreVLM("qwen3-vl-4b")
4	model.set_classes(["red car"])
5
6	result = model.predict("parking_lot.jpg")
7	print(f"Found {len(result.boxes.cls)} red car(s)")
8	result.save("red_cars.jpg")

1	# Florence-2 is a purpose-built grounder: very tight pixel boxes.
2	model = LibreVLM("florence-2-large")
3	model.set_classes(["a red car", "license plate"])
4
5	result = model.predict("car.jpg")
6	result.plot()

1	model = LibreVLM("qwen3-vl-2b").set_classes(["person", "dog", "cat"])
2
3	# classes= filters the configured vocabulary by id
4	people_only = model.predict("street.jpg", classes=[0])

1	from libreyolo import LibreVLM, SAMPLE_IMAGE
2
3	model = LibreVLM("lfm2-vl-450m", device="cpu")
4	# No set_classes() -> falls back to the COCO-80 vocabulary
5	result = model.predict(SAMPLE_IMAGE)
6	print(model.names[result.boxes.cls[0]]) # e.g. "person"

1	model = LibreVLM().set_classes(["forklift", "pallet"])
2
3	# A whole folder
4	for result in model.predict("warehouse_frames/", stream=True):
5	result.save()
6
7	# A video file (frames are processed one at a time)
8	model.predict("warehouse.mp4", save=True)

1	model = LibreVLM("qwen3-vl-4b")
2
3	answer = model.chat("harbor.jpg", "How many boats are docked? Answer with a number.")
4	print(answer)