Skip to main content

WorldVision

Editor's note. This page is groomed from the 6-file WorldVision suite in lumina5/docs-site/docs/roadmap/world-vision/ (overview, lean MVP, legacy MVP plan, full implementation plan, post-MVP roadmap). It distills vision and ambition first, with implementation specifics treated as supporting evidence — not the centerpiece. Source documents are kept verbatim in lumina5 for engineers; this page is for strategy and shared understanding.

What it is

WorldVision turns the mobile camera into an input device for Bike4Mind. Point a phone at a chess board, a Tarot spread, a D&D battle map, a cluttered workshop, a robot's field of view — and the AI on the other end understands what it's looking at and acts on it. Capture, reason, respond. The cognitive workshop graduates from text-and-files to text-and-files-and-the-world-around-you.

This is also the bridge from the screen to the room. WorldVision is the perceptual layer that makes the Physical AI roadmap — Magic Mirrors, Mobile Mirrors, Full Robotics — real.

The capabilities

CapabilityWhat it does
Visual understandingGPT-4V (primary) + Gemini (fallback) over images captured on phone. The AI reads the scene, not just the pixels.
Game-aware modesChess (board recognition, FEN notation), Tarot (card identification, interpretation), D&D (dice, miniatures, eventually the whole table). Each mode is a system prompt + a function-call kit.
Robotics APIExternal REST API any robot can call. Phone-on-rover, Raspberry Pi, Jetson, ESP32-CAM, USB webcam — same endpoint, same auth, same response shape.
Continuous captureTimer-based or motion-triggered capture for autonomous systems. Webhook out when something interesting changes.
Multimodal sessionVision analyses link back to the chat session that asked for them. The conversation remembers what it saw.
Privacy-firstGDPR right to deletion, CCPA data export, soft-delete + S3 lifecycle expiry. The user owns their images and can pull the plug.

The architecture leans on infrastructure Bike4Mind already runs: UserApiKey (with a new VISION_ANALYZE scope), apiKeyAuth, rateLimit, fabFileService (S3 + presigned URLs), and the standard service pattern. WorldVision is mostly new prompts and one new table on top of the existing B4M substrate — not a parallel stack.

The MVP

The lean MVP ships Robotics + Chess with the minimum new surface area:

  • One new tablevision_analyses (user, session, image reference, mode, prompt, response, tokens, soft-delete timestamp).
  • One new API scopeApiKeyScope.VISION_ANALYZE.
  • One new servicevisionAnalysisService, following the existing service pattern.
  • Two new endpoints/api/vision/analyze and /api/vision/history. The upload endpoint is the existing /api/files/generate-presigned-url; no duplication.
  • One new UI componentVisionAnalyzer, using Capacitor's native camera on iOS/Android.

Target users for the MVP:

  • Hobbyist roboticists who want a one-line vision call from a phone-mounted rover, a Raspberry Pi, an ESP32-CAM, or a Jetson. The pitch is pip install b4m-vision, point your robot at the world, ask the AI what to do.”
  • Chess players who want to photograph any board (real, on a screen, in a magazine), get the FEN, get a position read.
  • B4M users who want any object, any scene, any document held up to the camera to be readable inside their conversation.

Tarot and D&D follow on the same primitive — they are different system prompts and different toolkits against the same pipeline. The 22 major arcana ship after Chess proves the pattern; D&D in its full form is the flagship below.

Quest chain

MilestoneWhat shipsDepends on
M1 — Backend Foundationvision_analyses table, VISION_ANALYZE scope, visionAnalysisService, /api/vision/analyze + /api/vision/history
M2 — Mobile UI + TestingVisionAnalyzer component, Capacitor camera wired, Chess + Robotics modes, on-device testingM1
M3 — GDPR + Polish + LaunchDelete + export endpoints, audit logging, staging, beta, productionM2

Target window: Q3 2026 MVP, with the rest of the post-MVP roadmap (Tarot, D&D, Game Template Studio, full ROS2, visual SLAM) sequenced as separate quest chains afterward.

The D&D Room — flagship use case

This is the demonstration that proves what WorldVision plus the Deep Agents stack can deliver.

The scene. A dedicated room set up to wrap a real Dungeons & Dragons session in physical-AR. The players sit at a smart table. 8K monitors are arrayed around the room — the world wraps around the party. Tablets and mobile phones in players' hands serve as character sheets, spell books, inventory, party chat. The smart table itself is the battle map — terrain, minis, line-of-sight, fog-of-war, all live. Cameras throughout the room watch the table and the players.

The physical mechanics. The game isn't just on screens. In-world actions are wired to physical mechanics in the room:

  • Beanbag toss for a thrown-weapon attack roll
  • Axe throwing for a barbarian's swing
  • Darts for a ranged attack or a critical-hit confirmation
  • — and a long, growing list of room-scale skill challenges that feel like the thing the character is doing

WorldVision sees the throw land, scores it, feeds the result into the rules engine, and the AI Dungeon Master narrates the consequence.

The AI behind the curtain. The DM is an extremely strong agentic AI running on B4M Deep Agents — DM voice, NPC voices, world simulation, narrative continuity, encounter management, balance, surprise. It calls into thousands of bespoke function calls and tools: the rules engine, the bestiary, spell effects, item effects, the campaign world state, NPC memory, faction politics, weather, lighting cues for the room, monitor scene management, music selection, sound effects. The room is the body; the agent is the mind.

Why this matters. The D&D Room is not a side project — it is the proof-of-experience for an entire B4M thesis: that the right combination of WorldVision (eyes) + Deep Agents (mind) + bespoke tool kits (hands) produces experiences that are simply unavailable from any other AI stack. Generic chat is a commodity. Running a D&D game across an 8K-monitor physical-AR room with axe-throwing as the attack roll is not. The room becomes a demo no one else can put on, and it's all stitched together from Bike4Mind primitives.

Privacy and policy posture

WorldVision processes images of people, places, and things. The posture is privacy-first from day one, not bolted on later.

  • Right to deletion (GDPR). One-click delete per analysis or full account purge. Hard-deletes the S3 object, soft-deletes the database row, writes an audit log entry.
  • Right to export (CCPA / GDPR portability). User can pull a ZIP of every analysis they've ever run — images, prompts, responses, timestamps — in machine-readable form.
  • Data retention. S3 lifecycle rules expire images on a 30-day default; metadata can outlive the image but is itself purgeable on request.
  • CCPA ADMT disclosure. Automated Decision-Making Technology notice covers the AI-vision processing path, third-party model providers (OpenAI, Google), and opt-out mechanics.
  • Content moderation. Automated flagging for inappropriate content + a user reporting mechanism + a human review queue for edge cases.
  • Encryption. TLS 1.2+ in flight, AES-256 at rest, bcrypt for API key hashes.
  • Rate limiting and abuse controls. Reuse of the existing Redis-backed rateLimit middleware; tier-aware limits at the user, organization, and API-key level; AWS Shield / WAF in front for DDoS.
  • WCAG 2.2 Level AA accessibility for all user-facing surfaces.

The privacy and compliance work isn't a separate phase — it's wired into Milestone 3 of the MVP itself. The feature is not “GA” until deletion, export, retention, audit log, and ADMT notice are all live.

Place in the broader roadmap

WorldVision is the perception layer for everything physical Bike4Mind is building. The progression:

StageForm factorWhat it addsTarget
WorldVisionPhone (you already own it)Eyes. The cognitive workshop sees.Q3 2026 MVP
Magic Mirrors (Phantasia)4K display + Jetson + depth camera + mic array + speakersStationary in-room presence. Kitchen, reception, retail kiosk, healthcare check-in.Q3 2026
Mobile MirrorsMirror + mecanum wheel base + LiDARMobility. “Rosie the Robot, Jetsons-style” — follows you, comes when called.Q4 2026
Full Robotics+ xArm 6 + gripper + safetyManipulation. The brain can now do things in the world.2027

WorldVision is the common substrate. The mirror is just a phone glued to a wall; the mobile mirror is a phone with wheels; the robot is a phone with wheels and arms. Same perceptual API, same Deep Agents mind, same B4M tool chest — the body keeps growing. See the 2026 roadmap strawman for the full physical-AI sequencing.

The D&D Room is a special case in this lineage: it's a physical-AR room where the room itself is the body. Not a single mirror or a single robot — an entire instrumented space.

Lean approach

The WorldVision suite went through one important reset: the original MVP plan duplicated infrastructure that already exists in Bike4Mind (custom robot-API-key table, custom rate-limiting, custom S3 integration, custom auth middleware). The lean MVP plan replaced all of that with one new database table, one enum value, one service, two endpoints, one UI component. Roughly 500 LOC instead of 2,000.

The principle is the same one that runs through the rest of the catalog: don't reinvent B4M primitives — extend them. The platform already has hardened API key management with bcrypt, Redis-backed rate limiting, presigned S3 uploads via fabFileService, a service pattern with adapter injection for testability, and CloudWatch monitoring. WorldVision becomes a thin feature on a thick substrate, which means:

  • Less code to write, less code to maintain.
  • Same auth, same rate limits, same monitoring as every other B4M surface.
  • Future features (D&D mode, Tarot mode, video, edge processing, ROS2) extend the same pipeline rather than spinning up parallel ones.
  • Other products on the B4M platform get camera-input as a feature for free.

The same lean discipline applies to game modes: each mode is a system prompt + a function-call kit, not a new microservice. The Game Template Studio in the post-MVP roadmap takes this even further — community-authored game modes as data, not code.