LibreVLM
Point a vision language model at an image, hand it a list of words, and get back boxes. LibreVLM turns Qwen3-VL, Florence-2 and friends into open-vocabulary object detectors that speak the exact same Results API as every other LibreYOLO model.
Introduction
A classic detector ships with a fixed list of classes baked into its head. LibreVLM throws that constraint away. It wraps modern instruction-tuned vision language models, prompts them to emit bounding boxes, parses the generated text, and returns the same Results object you already use for YOLO9 and RF-DETR. The class list is just a list of words you supply at runtime, so adding a new category costs nothing and works zero-shot.
- Open vocabulary. Detect
"pink car","license plate", or"the small island"without ever training a head for them. - One factory, one contract.
LibreVLM(...)returns the standardResultswithboxes.xyxy,boxes.cls,boxes.conf, plus.plot()and.save(). - Swappable backends. Six model families behind one alias string, from a 230M Florence-2 to an 8B Qwen3-VL.
- A raw escape hatch.
chat()gives you free-form image question answering when you need more than boxes.
Why a separate tier (and a separate page)
LibreVLM is deliberately kept out of the closed-vocabulary LibreYOLO(...) factory and its .pt registry. These models are prompt-driven, open vocabulary, and report synthetic confidence, so they honor a different contract. Treating them as their own tier keeps the core detection docs clean and honest about what is being measured.
On the dev branch
LibreVLM currently lives on the dev branch and is targeted for the v1.3 release; it is not part of v1.2.0. It is a Python-only inference tier: there is no training, validation, export, or CLI path yet, and confidence scores are placeholders. Read the Limitations section before you build on top of it.
Installation
LibreVLM lives behind the optional vlm extra. It pulls in a recent transformers and the helpers a couple of processors need. Without the extra, importing a VLM family raises an ImportError that points you here.
1 pip install 'libreyolo[vlm]'
Weights are downloaded from the Hugging Face Hub on first use into a local weights/ folder. A few families ship under non-OSI licenses and log a one-time notice before downloading. A GPU is recommended for the larger backends, but every model also runs on CPU with device="cpu".
Quickstart
Construct a model, declare the words you care about, and predict. The default backend is Qwen3-VL-4B, the strongest detector in the tier and Apache-2.0 licensed.
1 from libreyolo import LibreVLM 2 3 # Qwen3-VL-4B by default; weights autodownload on first use 4 model = LibreVLM() 5 6 # The vocabulary is just words. Any words. 7 model.set_classes(["pink car", "wheel"]) 8 9 result = model.predict("street.jpg") 10 11 print(result.boxes.xyxy) # pixel [x1, y1, x2, y2] 12 print(result.boxes.cls) # ids into ["pink car", "wheel"] 13 result.plot() # same drawing helpers as any LibreYOLO model 14 result.save("out.jpg")
That is the whole loop. Everything downstream of predict() behaves like a normal detector, so existing visualization, cropping, and tracking code keeps working.
Supported Models
Pick a backend with the alias you pass to LibreVLM(...). A bare family name resolves to its default size. The default backend overall is qwen3-vl-4b. In practice the strongest detectors are Qwen3-VL, LFM2-VL, and Florence-2.
| Family | Alias | Sizes (params) | License | Notes |
|---|---|---|---|---|
| Qwen3-VL | qwen3-vl-2b / -4b / -8b | 2B / 4B / 8B | Apache-2.0 | Default and strongest. Recommended starting point. |
| LFM2-VL | lfm2-vl-450m / -1.6b | 450M / 1.6B | LFM Open License | Edge sized, surprisingly strong small detector. Notice gated. |
| InternVL3 | internvl3-1b / -2b / -8b | 1B / 2B / 8B | Qwen License | Good grounding at 8B; small sizes are weak. Notice gated. |
| Florence-2 | florence-2-base / -large | 0.23B / 0.77B | MIT | Purpose-built grounding model. Tight boxes, no chat(). |
| SmolVLM2 | smolvlm2-500m / -2.2b | 500M / 2.2B | Apache-2.0 | Tiny and fast; weaker detector. Good for quick trials. |
| Kosmos-2 | kosmos-2 | ~1.6B | MIT | 2023 grounder. Coarser boxes, no chat(). |
Choosing a backend
- Best quality:
qwen3-vl-8borqwen3-vl-4b(the default). - Tight boxes, small footprint:
florence-2-large. - Edge / CPU:
lfm2-vl-450morsmolvlm2-500m. - Fully permissive license: any Qwen3-VL, SmolVLM2, Florence-2 or Kosmos-2 size.
Licensing
Qwen3-VL and SmolVLM2 are Apache-2.0; Florence-2 and Kosmos-2 are MIT. LFM2-VL and InternVL3 carry non-OSI licenses and emit a one-time notice before their first download, so you can make an informed choice for commercial use.
Setting the Vocabulary
The vocabulary is the heart of open-vocabulary detection. Call set_classes() with a list of label strings. It is sticky: it persists across every later predict() and track() call until you set it again. It returns self, so it chains.
1 # Sticky and chainable 2 model = LibreVLM("qwen3-vl-2b").set_classes(["person", "dog", "cat"]) 3 4 # Set it once at construction instead 5 model = LibreVLM("lfm2-vl-450m", names=["boat"], device="cpu") 6 7 # Re-set any time to change what you are looking for 8 model.set_classes(["a red car", "a blue truck"])
Labels can be any phrase. They must be unique case-insensitively, and you must pass a list, not a bare string. If you never call set_classes(), the model falls back to the COCO-80 vocabulary so a bare predict() still does something sensible.
Prediction
predict() (and the equivalent model(...) call) accepts the same source types as any LibreYOLO detector: a path, a PIL image, a numpy array, a URL, a folder, or a video. stream=True and track() work too.
1 result = model.predict( 2 source="image.jpg", # path | PIL | ndarray | URL | folder | video 3 conf=0.25, # see note below: scoring is synthetic 4 classes=[0], # optional: keep only these vocabulary ids 5 max_det=300, 6 )
Return shape
You get back the standard Results object, identical to a closed-vocabulary detector:
| Field | Shape / type | Meaning |
|---|---|---|
result.boxes.xyxy | N x 4 | Pixel boxes [x1, y1, x2, y2], scaled to the original image. |
result.boxes.cls | N | Class ids indexing into your set_classes() vocabulary. |
result.boxes.conf | N | Synthetic confidence: 1.0 for every box (see Limitations). |
result.plot() / .save() | - | The usual drawing and saving helpers. |
Under the hood, LibreVLM tolerantly parses the model output (handling markdown fences, stray prose, duplicated boxes, and truncated arrays), maps free-text labels back to your class ids, and drops any label that is not in your vocabulary. That last step is what makes a free-form generator behave like a closed-set detector.
Examples
Detect a specific colored object
1 from libreyolo import LibreVLM 2 3 model = LibreVLM("qwen3-vl-4b") 4 model.set_classes(["red car"]) 5 6 result = model.predict("parking_lot.jpg") 7 print(f"Found {len(result.boxes.cls)} red car(s)") 8 result.save("red_cars.jpg")
Tight boxes with Florence-2
1 # Florence-2 is a purpose-built grounder: very tight pixel boxes. 2 model = LibreVLM("florence-2-large") 3 model.set_classes(["a red car", "license plate"]) 4 5 result = model.predict("car.jpg") 6 result.plot()
Filter to a single class on the fly
1 model = LibreVLM("qwen3-vl-2b").set_classes(["person", "dog", "cat"]) 2 3 # classes= filters the configured vocabulary by id 4 people_only = model.predict("street.jpg", classes=[0])
Run on CPU with a built-in sample image
1 from libreyolo import LibreVLM, SAMPLE_IMAGE 2 3 model = LibreVLM("lfm2-vl-450m", device="cpu") 4 # No set_classes() -> falls back to the COCO-80 vocabulary 5 result = model.predict(SAMPLE_IMAGE) 6 print(model.names[result.boxes.cls[0]]) # e.g. "person"
Batches, folders, and video
1 model = LibreVLM().set_classes(["forklift", "pallet"]) 2 3 # A whole folder 4 for result in model.predict("warehouse_frames/", stream=True): 5 result.save() 6 7 # A video file (frames are processed one at a time) 8 model.predict("warehouse.mp4", save=True)
Raw Chat
Sometimes you want the model, not the detector. The chat-template families expose chat(), which takes an image and a free-form prompt and returns the decoded text verbatim. Use it for counting, captioning, or quick visual questions.
1 model = LibreVLM("qwen3-vl-4b") 2 3 answer = model.chat("harbor.jpg", "How many boats are docked? Answer with a number.") 4 print(answer)
chat() is available on the chat-template families (Qwen3-VL, LFM2-VL, SmolVLM2, InternVL3). Florence-2 and Kosmos-2 are task-token grounders and raise NotImplementedError; use predict() with them.
How Backends Differ
Every family returns the same Results, but they reach it differently. You rarely need to care, yet it helps to know why some backends behave the way they do. The chat families are prompted for a JSON array of boxes; the grounders use dedicated task tokens.
| Family | Prompting | Coordinate space | chat() |
|---|---|---|---|
| Qwen3-VL | JSON box prompt | 0 to 1000, rescaled | Yes |
| LFM2-VL | JSON box prompt | Normalized 0 to 1 | Yes |
| SmolVLM2 | JSON box prompt | Normalized 0 to 1 | Yes |
| InternVL3 | JSON box prompt | 0 to 1000, rescaled | Yes |
| Florence-2 | Task token | Native pixels | No |
| Kosmos-2 | Grounding prompt | Normalized, rescaled | No |
For the chat families you can override the detection prompt with the prompt= constructor argument, and cap generation length with max_new_tokens=. Device and dtype are resolved automatically: bf16 or fp16 on CUDA, fp32 on CPU.
Limitations
LibreVLM is powerful but young. Knowing the boundaries up front saves surprises later.
- Synthetic confidence. Every box is scored 1.0. The
conf=filter therefore behaves as all-or-nothing rather than a real threshold. - No mAP / validation.
val()raises, because synthetic scores would make COCO mAP misleading. - No training or export.
train()andexport()raise. Fine-tune the VLM upstream and load the resulting weights instead. - Tracking is degraded.
track()runs, but uniform scores make the tracker's low-confidence recovery stage inert. - One image at a time. Generation is sequential in v1, so larger
batch=values give no speedup. - Python API only. The
libreyoloCLI does not resolve VLM aliases yet.
Where it shines
Use LibreVLM when the class set is open ended, changes often, or is hard to label up front: rapid prototyping, long-tail or rare categories, and "find the thing I describe in words" workflows. When you need calibrated confidence, throughput, or a deployable artifact, train a closed-vocabulary YOLO9 or RF-DETR from the core docs.