36. The Vision Prism

Context and Problem Statement

Interpreting terminal output, structural diagrams, and graphical user interfaces depends on the ingestion and analysis of pixel data. Vision Language Models (VLMs) impose significant VRAM demands, creating a physical resource conflict with high-tier reasoning models on consumer-grade hardware. A static infrastructure model results in either systemic OOM failures or permanent "blindness." Additionally, visual containers operate in a dual capacity: providing both raw inference (Animators) and specialized logic (Tools). This duality necessitates an orchestration strategy that manages sight as a stateful, dynamically dispatched capability without destabilizing the machine’s primary cognitive loop.

Requirements

Atomic Coven Manifestation: Mandatory grouping of VLM, OCR, and pre-processing units into a single operational state to ensure hardware synchronicity.
Provider-Tool Segmentation: Provision of a mechanism to distinguish between Animators (Inference Providers) and Tools (Capability Functions) during the discovery phase.
The Stasis Trigger: Mandatory integration with the Dispatcher (22). When a vision tool is invoked while the hardware is "Cold," it must raise the HardwareTransitionRequired signal to freeze the cognitive thread via the Stasis Protocol (22).
Multimodal Context Integration: Utilization of Pydantic AI’s native BinaryContent to facilitate the passage of pixel buffers into the reasoning cortex.
Dynamic VRAM Budgeting: Support for model tiering to enable the concurrent manifestation of small Vision models alongside Reasoning models, minimizing full coven swaps.
Pre-Inference Optimization: Provision of a pipeline to normalize and resize raw binary data to match model-specific resolutions, ensuring token efficiency.
Sovereign Optic Wall: Mandatory physical restriction of sensitive visual data to local covens, with summarization logic acting as a gateway for optional cloud-bursting.

Considered Options

Option 1: Specialized Vision Sidecars

Running a separate, permanent vision container alongside the primary reasoning model.

Cons: Catastrophic VRAM Contention. Running two massive models (e.g., a 70B Reasoner and a 13B VLM) simultaneously is impossible on consumer-grade hardware. It violates the Law of Exclusivity (08).

Option 2: Pure Cloud Vision (GPT-4o / Claude 3.5)

Offloading all visual processing to external Portals.

Cons: The Breach of Privacy. Sending screenshots of private code or internal infrastructure to the cloud is a violation of the Iron Pact (00).

Option 3: The Vision Coven (Stateful Sight)

Treating the entire vision capability as a dynamically activated operational state managed by the Sovereign.

Pros:
- Hardware Safety: The Orchestrator ensures the heavy Vision Coven is only resident when needed.
- Logical Parallelism: Utilizes the Stasis Protocol to allow the mind to "pause" while the eyes open, preserving thought continuity across hardware swaps.
- Unified Interface: To the Agent, the vision-analysis capability works identically whether provided by a local Coven or an OpenAI Portal.

Decision Outcome

The Prism is adopted as the Vision Extension. It is implemented as the reference implementation of the vision.coven—a stateful capability for structural visual reasoning.

1. The Vision Coven (Body)

The Prism manifests as a collection of Runes (08) managed as a mutually exclusive state.

The Eye (vlm.container): The primary Soulstone providing the VLM (e.g., LLaVA, Yi-VL), tagged with the vision-analysis capability.
The Scribe (ocr.container): An optional, lightweight Rune for pure text extraction (e.g., Tesseract).
Functional Overlap: A powerful VLM Rune may declare both vision-analysis (Provider) and ocr (Tool) capabilities.

2. Optic Dispatching & The Stasis Protocol

The Prism utilizes the Dispatcher (22) to manage the physical reality of sight:

The Animator (Provider): When an Agent requires a Vision Model, the Dispatcher resolves the vision-analysis tag.
The Handshake:
1. The Dispatcher queries the Orchestrator.
2. If the vision.coven is COLD, the Dispatcher raises HardwareTransitionRequired.
3. The Freeze: The Agent's state is serialized to the Phylactery (06).
4. The Swap: The Orchestrator banishes the current coven and summons the Vision Coven.
5. The Thaw: Once the Vision Rune is warm, the Agent rehydrates and proceeds with the vision-analysis model.
The Tool (Capability): Specialized tasks (e.g., extract_text_from_image) follow the exact same Stasis logic, ensuring the Agent never attempts to use a tool that doesn't physically exist.

3. The Pixel Pipeline (`BinaryContent`)

The extension implements a pre-inference pipeline to ensure high-fidelity "Observations":

Ingest: The system receives raw binary data via the interface or a background Ghoul (14).
Transmute: The Prism resizes the image to the optimal resolution for the active Rune, minimizing token overhead.
Observation: The processed artifact is injected into the Agent's context as Pydantic AI BinaryContent.

4. Orchestration of Sight

In the logic of the Orchestrator (22), visual intents are treated with high priority.

Tiered Sight: If VRAM is constrained, the Orchestrator may manifest a lower-tier Vision Soulstone (e.g., Moondream) to allow a reasoning model to remain resident, avoiding a full coven swap.
The Transition: If a high-tier visual ritual is required, the Orchestrator executes the Drain protocol on the Reasoning Titan before manifesting the Vision Eye.

Consequences

Positive

Structural Awareness: The Lych can interpret terminal output, UI errors, and diagrams as if it possessed a biological optic nerve.
Resource Purity: The distinction between Providers and Tools allows the Dispatcher to choose the most VRAM-efficient container for a specific task.
Thought Continuity: The Stasis Protocol ensures that "opening the eyes" does not kill the thought process, even if it takes 30 seconds to load the model.

Negative

State Swap Latency: Activating the Vision Coven is a heavy operation, potentially introducing friction into interactive scrying rituals.
Context Pressure: Visual tokens are expensive. Ingesting multiple artifacts can rapidly saturate the context window.