Soulstone: The Forged Rune
"A Portal is a whisper from the void, but a Soulstone is a god trapped in a bottle. It lives on your iron. It burns your electricity. It obeys only you."
A Soulstone is the configuration for a local, containerized inference engine. It is the architectural source for a Container Rune. When inscribed in the Codex, the system's "Hand" transmutes this TOML into a physical Podman Quadlet (Systemd Service).
Unlike a remote API, a Soulstone requires the Magus to understand the physics of their own hardware. You must choose the Discipline of Animation that aligns with the model's mass and your silicon's capacity.
π The Infrastructure Mapping
Every Soulstone in the Codex is a manifestation of the ContainerRune schema. The fields you define in the scroll dictate the physical form of the container:
| TOML Field | Rune Property | Purpose |
|---|---|---|
image |
image |
The OCI image (e.g., vLLM, SGLang, Aphrodite). |
groups |
covens |
The mutually inclusive states this Rune belongs to. |
capabilities |
capabilities |
The abstract functional tags (e.g., ocr, vision). |
exec |
exec |
The joined shell command for the container entrypoint. |
port_expose |
ExposePort |
Signals the Pod to publish the port to the host. |
The Four Disciplines of Animation
A Soulstone is inert until it is bound to an Animatorβthe engine that pumps electricity into the weights. The architecture recognizes four distinct Disciplines. You must choose based on your objective (Agentic Loops vs. Raw Speed) and your hardware constraints (VRAM vs. System RAM).
I. The Kinetic (vLLM)
"The Workhorse of the Void."
- Best For: High-throughput chat, serving multiple agents simultaneously, and models that fit strictly within VRAM (e.g., Llama-3-70B AWQ on 2x3090).
- The Mechanic (Continuous Batching): The Kinetic engine creates a "fluid" memory space. If two Agents query the Soulstone simultaneously, vLLM splits the GPU's attention cycle, serving both in parallel slots. It is the only way to run a "Hive Mind" on limited silicon without queuing latency.
- The Constraint: It demands purity. The model must fit entirely in VRAM. If it overflows, it crashes.
- The Configuration:
- Memory Greed: By default, vLLM consumes 90% of VRAM instantly for the KV Cache. When testing, you must curb its appetite or you will OOM before generating a single token. Use
--gpu-memory-utilization 0.9to tune this. - The Batching Trap: For a single user, vLLM's aggressive batching can sometimes increase latency. If you are debugging, use
--max-num-seqs 1to force serial processing, though this defeats the engine's primary purpose. - Quantization: Excellent support for AWQ. Ensure you strictly define
--quantization awq.
- Memory Greed: By default, vLLM consumes 90% of VRAM instantly for the KV Cache. When testing, you must curb its appetite or you will OOM before generating a single token. Use
II. The Weaver (SGLang)
"The Specialist of Loops."
- Best For: Agentic Orchestrators, complex tool-use loops, and structured data extraction.
- The Mechanic (Radix Attention): Unlike the Kinetic engine which sees memory as isolated blocks, The Weaver sees memory as a Tree.
- The Loop: When an Agent tries a plan, fails, and backtracks to the system prompt to try again, The Weaver does not re-compute the prompt. It simply "branches" the tree from the existing memory node.
- The Result: Massive efficiency gains for Agents that "think" in loops or multi-turn reasoning steps.
- The Hardware Reality (Ampere): SGLang utilizes the Marlin Kernel (
--enable-marlin) for AWQ models. This is highly optimized for RTX 3090 architectures, often outperforming standard GEMM kernels. - The Nuance: SGLang is strictly for NVIDIA. While vLLM attempts to support AMD/ROCm, SGLang focuses on CUDA purity.
The Pydantic Synergy (No DSL Required)
You do not need to learn the complex SGLang DSL (sgl.gen) to unlock this speed. SGLANG is natively compatible with OpenAIChatCompletions.
- Automatic FSM: When PydanticAI sends a standard
json_schemain the API request, SGLang automatically detects it and engages its Compressed Finite State Machine. This forces the GPU to generate valid JSON at hardware speed, bypassing the need for Python-based regex parsing. - The Multitasking Tree: Do not fear context switching. The Radix Attention engine is a Tree, not a single block. You can run a "Coder Agent" and a "Vision Agent" with completely different System Prompts simultaneously. As long as your VRAM context buffer (the ~13GB margin) is not 100% full, SGLang keeps both conversation branches "hot" in memory, switching between them instantly without reloading.
III. The Titan (llama.cpp)
"The Burden of Atlas."
- Best For: Massive Models (MoE, 405B) that exceed your 48GB VRAM capacity, and Orchestration tasks where raw intelligence outweighs speed.
- The Mechanic (The Offload): The Titan accepts that the GPU is finite. It splits the model layer-by-layer. Layers 1-40 might live on the GPU (Fast), while layers 41-80 live in System RAM (Slow).
- The Flags of Power:
--n-gpu-layers: The slider of speed. You push this until your VRAM is 99% full.--n-cpu-moe: A critical flag for Mixture-of-Experts (like Mixtral or DeepSeek). It allows the "Expert" layers to live in RAM while the attention heads stay on GPU.
- The Cost: Speed bleeds away the deeper you tap into System RAM. The PCIe bus becomes the bottleneck.
- The Solitude: The Titan is solitary. It generally processes one request at a time (Serial).
IV. The Flash (ExLlamaV2)
"The Speed of Light."
- Best For: Single-user throughput and "Fractional" Quantization (e.g., 4.65bpw) to squeeze the absolute maximum model size into VRAM.
- The Architecture: This engine uses the ExLlamaV2 kernel.
- Warning: There is a "V3" kernel designed for Hopper (H100) architecture. For your RTX 3090s (Ampere), ExLlamaV2 is still the superior choice. Do not be seduced by the higher number; architecture compatibility matters more.
- The Format: Requires models converted to
.exl2. This format allows for "measurement files" that calibrate the quantization specifically to minimize perplexity loss on critical layers.
The Ritual of Compression (Quantization)
Do not run models in FP16 (Raw weight) unless you possess H100s. The degradation in intelligence from 4-bit quantization is negligible compared to the massive gains in VRAM efficiency (allowing for larger context windows).
| Discipline | Format | Recommended Quant | Notes |
|---|---|---|---|
| Kinetic / Weaver | AWQ | 4-bit | The gold standard for vLLM/SGLang. Faster decoding than GPTQ on Ampere. Compatible with the Marlin kernel for extreme speed. |
| Titan | GGUF | Q4_K_M | The "Balanced" quant. Offers the best ratio of perplexity (intelligence) to size. Avoid Q2/Q3 unless strictly necessary for 405B models. |
| Flash | EXL2 | 4.0 - 6.0 bpw | Fractional bit-per-weight. Allows you to fill your VRAM to the exact megabyte. |
π€ Coven Management (The Group Rule)
To manage finite VRAM, Soulstones declare their membership in Covens using the groups field.
- Inclusive Coexistence: If two Soulstones share at least one common group (e.g.,
groups = ["vision-state"]), they belong to the same Coven. Systemd allows them to run simultaneously. - Exclusive Banishment: If two Soulstones share no common groups, they are mutually exclusive. The system generates a
Conflicts=directive between them.
Example: A Vision Coven
# ~/.config/lychd/soulstones/vision_eye.toml
[eye]
description = "Reasoning and Vision engine."
image = "lmsysorg/sglang:latest"
groups = ["vision-ritual"]
capabilities = ["text-generation", "vision-analysis"]
port = 8780
# SGLang requires explicit TP mapping and Radix settings
exec = """
python3 -m sglang.launch_server
--model-path TheBloke/Llama-3-70B-Instruct-AWQ
--tp 2
--port 8780
--enable-marlin
--disable-radix-attention False
"""
# ~/.config/lychd/soulstones/vision_scribe.toml
[scribe]
description = "Specialized OCR tool (Titan)."
image = "ghcr.io/ggerganov/llama.cpp:server"
groups = ["vision-ritual"] # Shares the group; will NOT be killed by 'eye'
capabilities = ["ocr"]
port = 8781
exec = """
/server -m /models/moondream.gguf --n-gpu-layers 99 --port 8781
"""
βοΈ The Law of Exclusivity
The Orchestrator uses these group definitions to manifest the machine's state.
- The Intent: An Agent needs
vision. The Orchestrator identifies thevision-ritualcoven. - The Cleansing: Systemd automatically stops any active Runes that do not belong to
vision-ritual(e.g., your heavydeep-thinkcoven). - The Manifestation: All Runes tagged with
vision-ritualare started in concert.
The Port Singularity
Every Soulstone must listen on a unique host port.
Even if they are in different Covens and never run together, the host OS requires a "cool down" period for the TCP socket. Reusing a port across different Soulstones will cause state transitions to fail with Address already in use.
Self-Aware Connectivity
The system automatically calculates the uri for every Soulstone as http://localhost:{port}/v1. The Lich handles the internal networking within the Pod; you only define the capabilities and the groups.