Soulstone: The Forged Local Engine

"A Portal is a whisper from the remote sky, but a Soulstone is a daemon in a bottle. It lives on local iron. It burns local electricity. It answers only the Magus."

A Soulstone is the local runtime Animator: a Quadlet/systemd-backed service that lives on local iron. A Soulstone Rune is the Codex TOML declaration that describes it. When inscribed in the Codex, the system transmutes the Soulstone Rune into physical Podman Quadlet manifests and services (see Containers (08)).

Unlike a remote API, a Soulstone requires the Magus to understand the physics of local hardware. For model-backed Soulstones, the Discipline of Animation must align with the model's mass and the silicon's capacity. For non-model Soulstones, the same principle applies to CPU, RAM, disk, sockets, credentials, and any other local substrate the service consumes.

💎 The Infrastructure Mapping

Every Soulstone Rune in the Codex is a concrete leaf config under the abstract SoulstoneConfig branch, such as GenericSoulstoneConfig, LlamaCppSoulstoneConfig, VllmSoulstoneConfig, or SglangSoulstoneConfig. The fields defined in the scroll shape the local runtime and the generated container manifest.

TOML Field	Runtime Mapping	Purpose
`image`	`QuadletContainer.image`	The OCI image (e.g., llama.cpp, vLLM, SGLang, Phoenix, Playwright, or another service image).
`runtime`	runtime adapter selection	Selects the local runtime family (`llamacpp`, `vllm`, `sglang`, etc.).
`groups`	coven targets + `Conflicts=` synthesis	Coven/state membership for orchestration.
`port`	runtime `--port` + pod publish mapping	Host-visible endpoint identity for the Soulstone.
`base_url`	runtime connector endpoint	Optional override; defaults to `http://localhost:{port}/v1`.
`exec`	`RuntimePlan.exec_args` override	Explicit command override that bypasses adapter synthesis.
`extra_args` (runtime-specific)	adapter flag synthesis tail	Runtime-specific override/extension flags (for example llama.cpp, vLLM, SGLang).
`volumes` / `env_vars`	`QuadletContainer` mounts/env	Extra local runtime mounts and environment variables.
`secret_env_files`	`QuadletContainer.secrets` + env hydration	Map env var names to Podman secret names; transmuter mounts each secret and sets env var to `/run/secrets/<secret>`.
`models` / `model_path`	connector/runtime offer surface	Optional local model catalog or single-model artifact path for model-backed binding later.

Binding Identity vs. Container Shape

Older docs described Soulstones as carrying model_provider / tool_provider directly. In the current codebase, Soulstones primarily define local runtime shape and local model artifacts. Dispatcher/Binder policy can still resolve provider routes, but the runtime binding path is connector-based.

Model-Backed Disciplines of Animation

A Soulstone is inert until it is bound to an Animator adapter: the connector that turns a local service into routable capabilities. The current model-backed core ships with built-in Soulstone profiles for vLLM, SGLang, and llama.cpp. Additional disciplines can be introduced through extensions, including non-model services whose adapters expose observation, browsing, execution, or peer-network capabilities.

I. The Kinetic (vLLM)

"The Workhorse of the Iron."

Best For: High-throughput chat, serving multiple agents simultaneously, and models that fit strictly within VRAM (e.g., Llama-3-70B AWQ on 2x3090).
The Mechanic (Continuous Batching): The Kinetic engine creates a "fluid" memory space. If two Agents query the Soulstone simultaneously, vLLM splits the GPU's attention cycle, serving both in parallel slots. It is the only way to run a "Hive Mind" on limited silicon without queuing latency.
The Constraint: It demands purity. The model must fit entirely in VRAM. If it overflows, it crashes.
The Configuration:
- Memory Greed: By default, vLLM consumes 90% of VRAM instantly for the KV Cache. During testing, curb its appetite or an OOM occurs before a single token is generated. Use --gpu-memory-utilization 0.9 to tune this.
- The Batching Trap: For a single user, vLLM's aggressive batching can sometimes increase latency. During debugging, use --max-num-seqs 1 to force serial processing, though this defeats the engine's primary purpose.
- Quantization: Excellent support for AWQ. Define --quantization awq explicitly.

II. The Weaver (SGLang)

"The Specialist of Loops."

Best For: Agentic Orchestrators, complex tool-use loops, and structured data extraction.
The Mechanic (Radix Attention): Unlike the Kinetic engine which sees memory as isolated blocks, The Weaver sees memory as a Tree.
- The Loop: When an Agent tries a plan, fails, and backtracks to the system prompt to try again, The Weaver does not re-compute the prompt. It simply "branches" the tree from the existing memory node.
- The Result: Massive efficiency gains for Agents that "think" in loops or multi-turn reasoning steps.
The Hardware Reality (Ampere): SGLang utilizes the Marlin Kernel (--enable-marlin) for AWQ models. This is highly optimized for RTX 3090 architectures, often outperforming standard GEMM kernels.
The Nuance: SGLang is strictly for NVIDIA. While vLLM attempts to support AMD/ROCm, SGLang focuses on CUDA purity.

The Pydantic Synergy (No DSL Required)

The complex SGLang DSL (sgl.gen) is not required to unlock this speed. SGLang is natively compatible with OpenAI-style chat completions, which the binder can expose through the same OpenAIChatModel path used by other OpenAI-compatible runtimes.

Automatic FSM: When PydanticAI sends a standard json_schema in the API request, SGLang automatically detects it and engages its Compressed Finite State Machine. This forces the GPU to generate valid JSON at hardware speed, bypassing the need for Python-based regex parsing.
The Multitasking Tree: Context switching is safe when the buffer has room. The Radix Attention engine is a Tree, not a single block. A "Coder Agent" and a "Vision Agent" can run with completely different System Prompts simultaneously. As long as the VRAM context buffer (the ~13GB margin) is not 100% full, SGLang keeps both conversation branches "hot" in memory, switching between them instantly without reloading.
The Iterative Ingestion Pattern (Attention Exactness): Avoid dumping 100K+ context files (like full framework repos) into a single prompt. LLM attention mechanisms degrade and lose precision in massive contexts. Instead, establish a Base Prompt and loop over the document chapter-by-chapter (Base Prompt + Snippet 1, Base Prompt + Snippet 2). SGLang's Radix Attention instantly prefills the Base Prompt for every iteration, allowing fast, aggregated results with pinpoint attention accuracy across massive codebases.

III. The Titan (llama.cpp)

"The Burden of Atlas."

Best For: Massive Models (MoE, 405B) that exceed a 48GB VRAM envelope, and Orchestration tasks where raw intelligence outweighs speed.
The Mechanic (The Offload): The Titan accepts that the GPU is finite. It splits the model layer-by-layer. Layers 1-40 might live on the GPU (Fast), while layers 41-80 live in System RAM (Slow).
The Flags of Power:
- --n-gpu-layers: The slider of speed. Raise it until VRAM is nearly full.
- --n-cpu-moe: A critical flag for Mixture-of-Experts (like Mixtral or DeepSeek). It allows the "Expert" layers to live in RAM while the attention heads stay on GPU.
The Cost: Speed bleeds away the deeper the model reaches into System RAM. The PCIe bus becomes the bottleneck.
The Solitude: The Titan is solitary. It generally processes one request at a time (Serial).

Router Specialization (llama.cpp)

llama.cpp is treated as a special runtime with two startup modes:

Single Mode: starts with -m <model_path> and serves one model alias.
Router Mode: starts without -m and uses --models-dir or --models-preset to load/unload models dynamically.

When startup_mode = "auto":

if model_path is set -> single mode
otherwise -> router mode

This allows a single Soulstone to expose a model catalog while still presenting one runtime endpoint to the dispatcher/binder at any given moment.

Mode/argument precedence is deterministic:

exec set explicitly in TOML → runtime adapter does not synthesize flags.
startup_mode set to single/router → forced mode.
startup_mode = "auto" → infer from model_path (single if set, else router).
extra_args → appended last, so users can override defaults without forking schema.

Capability State During Model Swaps

In router mode, a single llama.cpp Soulstone can serve different models over its lifetime without the container restarting. Each model load/unload transitions the Animator's capability state:

Container boots → is_static=True for all capabilities the current model supports.
Model swap triggered (via llama.cpp API) → old model's capabilities flip is_active=False.
New model loads and warms → new model's capabilities flip is_active=True.
The Orchestrator manages these transitions; no coven swap (Systemd restart) is required.

This means a single llama.cpp Soulstone can dynamically expose chat, vision, or embedding capability families as different models are loaded. The Dispatcher tracks is_active state and routes accordingly.

The Ritual of Compression (Quantization)

Models should not run in FP16 (Raw weight) unless H100-class hardware is available. The degradation in intelligence from 4-bit quantization is negligible compared to the massive gains in VRAM efficiency (allowing for larger context windows).

Discipline	Format	Recommended Quant	Notes
Kinetic / Weaver	AWQ	4-bit	The gold standard for vLLM/SGLang. Faster decoding than GPTQ on Ampere. Compatible with the Marlin kernel for extreme speed.
Titan	GGUF	Q4_K_M	The "Balanced" quant. Offers the best ratio of perplexity (intelligence) to size. Avoid Q2/Q3 unless strictly necessary for 405B models.

🤝 Coven Management (The Group Rule)

To manage finite VRAM, Soulstones declare their membership in Covens using the groups field.

Inclusive Coexistence: If two Soulstones share at least one common group (e.g., groups = ["vision-state"]), they belong to the same Coven. Systemd allows them to run simultaneously.
Exclusive Banishment: If two Soulstones share no common groups, they are mutually exclusive. The system generates a Conflicts= directive between them.

Example: A Vision Coven

# ~/.config/lychd/runes/animator/soulstones/sglang/vision_eye.toml
name = "eye"
description = "Reasoning and Vision engine."
image = "lmsysorg/sglang:latest"
runtime = "sglang"
groups = ["vision-ritual"]
port = 8780
model_path = "/models/qwen3-next-80b-awq"
tensor_parallel_size = 2
enable_marlin = true
# extra_args are appended after adapter defaults
extra_args = ["--reasoning-parser", "deepseek-r1"]

# ~/.config/lychd/runes/animator/soulstones/llamacpp/vision_scribe.toml
name = "scribe"
description = "Specialized OCR tool (Titan)."
image = "ghcr.io/ggerganov/llama.cpp:server"
runtime = "llamacpp"
groups = ["vision-ritual"] # Shares the group; will NOT be killed by 'eye'
port = 8781
model_path = "/models/moondream.gguf"
startup_mode = "single"
n_gpu_layers = 99

The Dispatcher later binds capability surfaces from the runtime connector exposed by these Soulstones. In this example those surfaces are model and tool capabilities, but the same placement law applies to non-model local services.

Podman Secret Hydration

Soulstones can reference Podman secrets directly with secret_env_files.

name = "private-runtime"
runtime = "vllm"
image = "vllm/vllm-openai:latest"
model_path = "/models/qwen-awq"

# ENV var -> Podman secret name
[secret_env_files]
HF_TOKEN_FILE = "hf_runtime_token"

At bind time:

LychD checks that each referenced secret exists in Podman.
Generated Quadlets emit Secret=hf_runtime_token.
The container env contains HF_TOKEN_FILE=/run/secrets/hf_runtime_token.

This keeps rune files reference-only while allowing runtime code to read credential files.

⚔️ The Law of Exclusivity

The Orchestrator uses these group definitions to manifest the machine's state.

The Intent: An Agent needs vision. The Orchestrator identifies the vision-ritual coven.
The Cleansing: Systemd automatically stops any active Soulstone services/Quadlet units that do not belong to vision-ritual (e.g., a heavy deep-think coven).
The Manifestation: All services tagged with vision-ritual are started in concert.

The Port Singularity

Every Soulstone must listen on a unique host port. Even if they are in different Covens and never run together, the host OS requires a "cool down" period for the TCP socket. Reusing a port across different Soulstones causes state transitions to fail with Address already in use.

Self-Aware Connectivity

The system automatically calculates the base_url for every Soulstone as http://localhost:{port}/v1 unless the rune overrides it.

The Lich handles the internal networking within the Pod. The rune defines the local runtime shape (runtime, ports, groups, models, flags), and the Dispatcher/Binder hydrate callable capability surfaces from the connector exposed by that runtime.