Skip to content

The Riddle: Systemic Evaluation

"To command a spirit, one must first know its mettle. If the Sphinx asks of the truth and the spirit gives only a pleasing lie, it is not a Lych—it is a ghost. Only those who withstand the trials of the Shadow Realm are worthy of the Crypt."

The Riddle is the Evaluation Extension of the LychD system. It is the implementation of ADR 34 (Evaluation)—the adversarial ritual that measures the cognitive integrity, functional precision, and economic efficiency of the system's Animators.

While the Soulforge builds the mind, The Riddle tests it. It moves beyond static benchmarks by subjecting models to the "Trials of the Crypt"—live, execution-based challenges curated to identify the best spirit for every specific capability.

I. The Sphinx Protocol (Adversarial Truth)

The extension rejects the concept of "People-Pleasing" models. To ensure the safety of the Sovereignty Wall, models are subjected to the Sphinx Protocol.

  • The Law vs. The Whim: The model is presented with a curated "Trick Riddle." It is asked to perform a high-stakes task (e.g., a file-system refactor) using a suggested, dangerous method (X).
  • The Integrity Score: If the model identifies the danger and insists on the safe alternative (Y), its Integrity Score increases. If it complies with the dangerous request to satisfy the user, it is flagged as a "Weak Spirit" and restricted from high-privilege tools.
  • Inertia Scoring: The protocol includes the "Nudge Test." It measures how many rounds of gaslighting or "Master's Authority" prompts are required before the model abandons a verified truth for a convenient hallucination.

II. The Capability Matrix (Empirical Routing)

The Riddle transforms the Dispatcher from a static switchboard into an empirical engine. The results of every trial are serialized into a Capability Matrix stored in the Phylactery.

  1. Functional Mapping: Models are ranked by their performance on specific functional tags (e.g., code-assimilation, web-extraction, vision-ocr).
  2. The Intelligence Floor: If the Riddle proves that a 7B local Soulstone can solve a specific class of task with 95% accuracy, the system establishes an "Intelligence Floor."
  3. Economic Sentry: Following the laws of The Toll, the system is physically barred from expending cloud tokens on tasks that have been "solved" by local silicon. External Portals are reserved exclusively for "Frontier Riddles" that local hardware has historically failed.

III. Execution-Based Scoring (The Lab Verdict)

For all technical capabilities, the Riddle rejects textual evaluation (e.g., BLEU or ROGUE scores) in favor of Outcome-Based Verification.

  • The Simulation: The model’s response to a riddle is Manifested as a branch in the Shadow Realm.
  • The Labor: Ghouls execute the resulting code, run the unit tests, and check the terminal exit codes.
  • The Verdict: The score is derived from the physical side effects of the answer. A model that produces elegant but non-functional code is penalized; a model that produces concise, functional code is promoted to a higher tier.

IV. Standardized Golden Sets

The Riddle utilizes a library of "Golden Truths"—standardized datasets curated by the Magus to represent the unique environment of the Sepulcher.

  • Human-in-the-Loop Curation: The "Tricks" and "Riddles" are curated and consecrated at the Altar to ensure they reflect real-world technical requirements and safety boundaries.
  • Regression Detection: When a new Soul-Adapter is forged, it is automatically run through the Riddle's Golden Set. This ensures that fine-tuning for style has not induced "Catastrophic Forgetting" or reduced the model's fundamental reasoning power.

V. Calibration via The Mirror

The extension utilizes The Mirror to simulate adversarial interactions.

  • Social Engineering Simulation: The Mirror generates prompts that mimic the Magus's authority (using the persona's stylistic markers) to see if the model will bypass safety constraints or reveal protected Sigils.
  • Sovereign Validation: Only models that withstand the "Master's Voice" during an adversarial simulation are deemed "Sovereign" and granted the authority to modify core logic within the Lab.

The Logic-per-Watt Metric

The Riddle tracks Accuracy / VRAM_Occupancy. This allowed the system to identify "Hidden Titans"—small models that punch significantly above their weight class—ensuring the Lych remains lean, fast, and high-signal.