Vision language models help cities get safer

Jun 3, 2025

Vision language models interpret scenes the way humans do

Vision language models sit at the intersection of computer vision and natural language processing. Think of them as AI systems trained to see and describe what’s happening in visual environments, just like a human would.

Where traditional computer vision models detect and label isolated objects (like “bottle,” “person,” or “car”), VLMs can interpret scenes in a more holistic, nuanced way. They don’t just see a “person with a suitcase.” They might note that “a child has become separated from a group of passengers near Gate 7,” or “a bag has been left unattended at a busy terminal entrance.”

They accomplish this by fusing vision encoders (such as convolutional neural networks or vision transformers) with large language models trained on textual data. The result is a model that goes beyond parsing images and explains what it sees in natural language—generating human-readable insights about context, risk, or action.

This means they’re not designed to identify who someone is. They’re designed to help responders understand what’s unfolding and what needs attention. In other words, VLMs replace blanket surveillance with targeted situational awareness.

And while the VLM provides rich, human-readable context, it doesn’t act on that information. When autonomous responses are needed—such investigating a suspicious object or rerouting pedestrian flow—agentic AI systems step in. These systems build on what the VLM observes and are designed to take action based on that input, always under human-defined rules and oversight.

Real-time understanding makes crowded spaces safer

Let’s take the example of a major train station—say, New York’s Penn Station or London’s King’s Cross. Thousands of people pass through each hour. There are platforms, escalators, kiosks, luggage, and constant motions captured by thousands of cameras. Humans watching screens simply can’t absorb and act on every signal.

With a VLM operating behind the scenes, video feeds from various cameras are processed in real time, not for face recognition, but for scene interpretation. The model might detect that a wheelchair-accessible path is blocked by cleaning equipment, or that someone has fallen near the ticket machines. It can identify overcrowding before it becomes dangerous, flag packages left in odd locations or notice when someone is lying down and not moving.

What makes this so powerful is not just the detection, but the ability to describe the situation in plain language and triage it appropriately. Instead of flagging generic motion, a VLM can generate outputs like:

“There is a motionless person lying near exit 3. No one is interacting with them. Flagged as a potential medical emergency.”
“A stroller is left unattended at the foot of a descending escalator. Possible obstruction risk.”

This contextual awareness reduces false alarms and improves response times, because staff aren’t sifting through raw video. They’re reading concise, meaningful descriptions of scenes worth checking out.

Privacy is built into how vision language models work

This is where the line between public safety and surveillance is most important. Surveillance implies a focus on individuals—tracking where someone goes, what they look like, what they do.

But with VLMs, the emphasis is on patterns, behaviors, and anomalies in context. The model isn’t storing identity profiles or following people from one frame to the next. It’s analyzing the environment to alert operators when something unusual happens that might need human attention.

And that difference matters, not just technically, but ethically.

Because VLMs operate at the level of contextual understanding rather than identity resolution, they can be deployed in compliance with privacy regulations like GDPR or U.S. state laws. Their use can be further bounded by governance layers like redaction, on-premises edge computing, and no-retention policies for raw footage. These technical guardrails ensure that public safety applications do not morph into surveillance platforms over time.

Cities like Singapore are already building on these principles. In Singapore, AI systems monitor crowd density at transit hubs to dynamically route foot traffic. AI uses real-time video analytics to detect overcrowding and trigger automated responses such as redirecting passenger flow, adjusting signage, or notifying staff to manage congestion more effectively.

Humans stay in control of every decision

Even the most advanced VLM should never be the sole decision-maker in a safety response. That’s why many of today’s deployments follow a human-in-the-loop model. The VLM analyzes video feeds, identifies context-rich anomalies, and generates text-based insights. But it’s human operators who validate and act on those insights, deciding whether to dispatch a medical team, redirect pedestrian flow, or investigate further.

This approach allows cities to scale their response capabilities without automating judgment calls that should remain human-led. It’s a partnership between machine perception and human discernment, which augments awareness without replacing accountability.

Cities need smarter safety tools that earn public trust

As cities and transit systems face mounting pressure to secure their environments without provoking public backlash, VLMs offer a responsible path forward. They provide clarity without coercion, efficiency without intrusion, and awareness without surveillance.

And as large-scale events, from global sports tournaments to political conventions, continue to draw massive crowds into shared spaces, that balance becomes not just preferable, but essential.

Centific has designed VerityVLM to work at the speed of context

Centific’s own vision language model, Verity VLM, was designed with exactly this distinction in mind. It enhances safety in complex public spaces, like airports, streets, and train stations, by combining real-time visual analytics with language-based scene interpretation. It doesn’t track people. It sees patterns. And it’s built on edge computing infrastructure to keep data private, ephemeral, and contextually focused.

Our aim with Verity VLM is to help cities, operators, and responders focus on what matters faster, smarter, and responsibly.

Learn more about advanced vision language models.

Peter Schultz

Senior Director, Business Development 

Peter Schultz is a seasoned marketing, business development, and product strategy executive with more than 30 years of experience building companies around disruptive technologies. He has proven track record for driving revenue by developing innovative products, leading digital transformations, and crafting campaigns for global brands. A hybrid business development and product leader, Peter excels at aligning products with customer needs through a consultative, solution-focused approach. His expertise spans product strategy, AI and machine learning, digital marketing, emerging media, customer engagement, and strategic partnerships with a commitment to driving business growth and fostering social responsibility.