Experimental tier

LibreVLM

Point a vision language model at an image, hand it a list of words, and get back boxes. LibreVLM turns Qwen3-VL, Florence-2 and friends into open-vocabulary object detectors that speak the exact same Results API as every other LibreYOLO model.

Introduction

A classic detector ships with a fixed list of classes baked into its head. LibreVLM throws that constraint away. It wraps modern instruction-tuned vision language models, prompts them to emit bounding boxes, parses the generated text, and returns the same Results object you already use for YOLO9 and RF-DETR. The class list is just a list of words you supply at runtime, so adding a new category costs nothing and works zero-shot.

  • Open vocabulary. Detect "pink car", "license plate", or "the small island" without ever training a head for them.
  • One factory, one contract. LibreVLM(...) returns the standard Results with boxes.xyxy, boxes.cls, boxes.conf, plus .plot() and .save().
  • Swappable backends. Six model families behind one alias string, from a 230M Florence-2 to an 8B Qwen3-VL.
  • A raw escape hatch. chat() gives you free-form image question answering when you need more than boxes.

Why a separate tier (and a separate page)

LibreVLM is deliberately kept out of the closed-vocabulary LibreYOLO(...) factory and its .pt registry. These models are prompt-driven, open vocabulary, and report synthetic confidence, so they honor a different contract. Treating them as their own tier keeps the core detection docs clean and honest about what is being measured.

On the dev branch

LibreVLM currently lives on the dev branch and is targeted for the v1.3 release; it is not part of v1.2.0. It is a Python-only inference tier: there is no training, validation, export, or CLI path yet, and confidence scores are placeholders. Read the Limitations section before you build on top of it.

Installation

LibreVLM lives behind the optional vlm extra. It pulls in a recent transformers and the helpers a couple of processors need. Without the extra, importing a VLM family raises an ImportError that points you here.

bash
1pip install 'libreyolo[vlm]'

Weights are downloaded from the Hugging Face Hub on first use into a local weights/ folder. A few families ship under non-OSI licenses and log a one-time notice before downloading. A GPU is recommended for the larger backends, but every model also runs on CPU with device="cpu".

Quickstart

Construct a model, declare the words you care about, and predict. The default backend is Qwen3-VL-4B, the strongest detector in the tier and Apache-2.0 licensed.

python
1from libreyolo import LibreVLM
2
3# Qwen3-VL-4B by default; weights autodownload on first use
4model = LibreVLM()
5
6# The vocabulary is just words. Any words.
7model.set_classes(["pink car", "wheel"])
8
9result = model.predict("street.jpg")
10
11print(result.boxes.xyxy) # pixel [x1, y1, x2, y2]
12print(result.boxes.cls) # ids into ["pink car", "wheel"]
13result.plot() # same drawing helpers as any LibreYOLO model
14result.save("out.jpg")

That is the whole loop. Everything downstream of predict() behaves like a normal detector, so existing visualization, cropping, and tracking code keeps working.

Supported Models

Pick a backend with the alias you pass to LibreVLM(...). A bare family name resolves to its default size. The default backend overall is qwen3-vl-4b. In practice the strongest detectors are Qwen3-VL, LFM2-VL, and Florence-2.

FamilyAliasSizes (params)LicenseNotes
Qwen3-VLqwen3-vl-2b / -4b / -8b2B / 4B / 8BApache-2.0Default and strongest. Recommended starting point.
LFM2-VLlfm2-vl-450m / -1.6b450M / 1.6BLFM Open LicenseEdge sized, surprisingly strong small detector. Notice gated.
InternVL3internvl3-1b / -2b / -8b1B / 2B / 8BQwen LicenseGood grounding at 8B; small sizes are weak. Notice gated.
Florence-2florence-2-base / -large0.23B / 0.77BMITPurpose-built grounding model. Tight boxes, no chat().
SmolVLM2smolvlm2-500m / -2.2b500M / 2.2BApache-2.0Tiny and fast; weaker detector. Good for quick trials.
Kosmos-2kosmos-2~1.6BMIT2023 grounder. Coarser boxes, no chat().

Choosing a backend

  • Best quality: qwen3-vl-8b or qwen3-vl-4b (the default).
  • Tight boxes, small footprint: florence-2-large.
  • Edge / CPU: lfm2-vl-450m or smolvlm2-500m.
  • Fully permissive license: any Qwen3-VL, SmolVLM2, Florence-2 or Kosmos-2 size.

Licensing

Qwen3-VL and SmolVLM2 are Apache-2.0; Florence-2 and Kosmos-2 are MIT. LFM2-VL and InternVL3 carry non-OSI licenses and emit a one-time notice before their first download, so you can make an informed choice for commercial use.

Setting the Vocabulary

The vocabulary is the heart of open-vocabulary detection. Call set_classes() with a list of label strings. It is sticky: it persists across every later predict() and track() call until you set it again. It returns self, so it chains.

python
1# Sticky and chainable
2model = LibreVLM("qwen3-vl-2b").set_classes(["person", "dog", "cat"])
3
4# Set it once at construction instead
5model = LibreVLM("lfm2-vl-450m", names=["boat"], device="cpu")
6
7# Re-set any time to change what you are looking for
8model.set_classes(["a red car", "a blue truck"])

Labels can be any phrase. They must be unique case-insensitively, and you must pass a list, not a bare string. If you never call set_classes(), the model falls back to the COCO-80 vocabulary so a bare predict() still does something sensible.

Prediction

predict() (and the equivalent model(...) call) accepts the same source types as any LibreYOLO detector: a path, a PIL image, a numpy array, a URL, a folder, or a video. stream=True and track() work too.

python
1result = model.predict(
2 source="image.jpg", # path | PIL | ndarray | URL | folder | video
3 conf=0.25, # see note below: scoring is synthetic
4 classes=[0], # optional: keep only these vocabulary ids
5 max_det=300,
6)

Return shape

You get back the standard Results object, identical to a closed-vocabulary detector:

FieldShape / typeMeaning
result.boxes.xyxyN x 4Pixel boxes [x1, y1, x2, y2], scaled to the original image.
result.boxes.clsNClass ids indexing into your set_classes() vocabulary.
result.boxes.confNSynthetic confidence: 1.0 for every box (see Limitations).
result.plot() / .save()-The usual drawing and saving helpers.

Under the hood, LibreVLM tolerantly parses the model output (handling markdown fences, stray prose, duplicated boxes, and truncated arrays), maps free-text labels back to your class ids, and drops any label that is not in your vocabulary. That last step is what makes a free-form generator behave like a closed-set detector.

Examples

Detect a specific colored object

python
1from libreyolo import LibreVLM
2
3model = LibreVLM("qwen3-vl-4b")
4model.set_classes(["red car"])
5
6result = model.predict("parking_lot.jpg")
7print(f"Found {len(result.boxes.cls)} red car(s)")
8result.save("red_cars.jpg")

Tight boxes with Florence-2

python
1# Florence-2 is a purpose-built grounder: very tight pixel boxes.
2model = LibreVLM("florence-2-large")
3model.set_classes(["a red car", "license plate"])
4
5result = model.predict("car.jpg")
6result.plot()

Filter to a single class on the fly

python
1model = LibreVLM("qwen3-vl-2b").set_classes(["person", "dog", "cat"])
2
3# classes= filters the configured vocabulary by id
4people_only = model.predict("street.jpg", classes=[0])

Run on CPU with a built-in sample image

python
1from libreyolo import LibreVLM, SAMPLE_IMAGE
2
3model = LibreVLM("lfm2-vl-450m", device="cpu")
4# No set_classes() -> falls back to the COCO-80 vocabulary
5result = model.predict(SAMPLE_IMAGE)
6print(model.names[result.boxes.cls[0]]) # e.g. "person"

Batches, folders, and video

python
1model = LibreVLM().set_classes(["forklift", "pallet"])
2
3# A whole folder
4for result in model.predict("warehouse_frames/", stream=True):
5 result.save()
6
7# A video file (frames are processed one at a time)
8model.predict("warehouse.mp4", save=True)

Raw Chat

Sometimes you want the model, not the detector. The chat-template families expose chat(), which takes an image and a free-form prompt and returns the decoded text verbatim. Use it for counting, captioning, or quick visual questions.

python
1model = LibreVLM("qwen3-vl-4b")
2
3answer = model.chat("harbor.jpg", "How many boats are docked? Answer with a number.")
4print(answer)

chat() is available on the chat-template families (Qwen3-VL, LFM2-VL, SmolVLM2, InternVL3). Florence-2 and Kosmos-2 are task-token grounders and raise NotImplementedError; use predict() with them.

How Backends Differ

Every family returns the same Results, but they reach it differently. You rarely need to care, yet it helps to know why some backends behave the way they do. The chat families are prompted for a JSON array of boxes; the grounders use dedicated task tokens.

FamilyPromptingCoordinate spacechat()
Qwen3-VLJSON box prompt0 to 1000, rescaledYes
LFM2-VLJSON box promptNormalized 0 to 1Yes
SmolVLM2JSON box promptNormalized 0 to 1Yes
InternVL3JSON box prompt0 to 1000, rescaledYes
Florence-2Task tokenNative pixelsNo
Kosmos-2Grounding promptNormalized, rescaledNo

For the chat families you can override the detection prompt with the prompt= constructor argument, and cap generation length with max_new_tokens=. Device and dtype are resolved automatically: bf16 or fp16 on CUDA, fp32 on CPU.

Limitations

LibreVLM is powerful but young. Knowing the boundaries up front saves surprises later.

  • Synthetic confidence. Every box is scored 1.0. The conf= filter therefore behaves as all-or-nothing rather than a real threshold.
  • No mAP / validation. val() raises, because synthetic scores would make COCO mAP misleading.
  • No training or export. train() and export() raise. Fine-tune the VLM upstream and load the resulting weights instead.
  • Tracking is degraded. track() runs, but uniform scores make the tracker's low-confidence recovery stage inert.
  • One image at a time. Generation is sequential in v1, so larger batch= values give no speedup.
  • Python API only. The libreyolo CLI does not resolve VLM aliases yet.

Where it shines

Use LibreVLM when the class set is open ended, changes often, or is hard to label up front: rapid prototyping, long-tail or rare categories, and "find the thing I describe in words" workflows. When you need calibrated confidence, throughput, or a deployable artifact, train a closed-vocabulary YOLO9 or RF-DETR from the core docs.

Inference onlydev branch / targeting v1.3Source on GitHub