Docs/LLM Audit Log + Replay: from 'cannot reproduce' to 'locate in 60s'

LLM Audit Log + Replay for indie debugging loops

When users say "your AI replied something weird again", you should be able to answer quickly:

  • which request?
  • which model + parameters?
  • what input/context?
  • which prompt version?
  • why this output?

Why LLM bugs are hard to debug

  • prompt construction is dynamic
  • parameter space is large
  • retries and routing introduce hidden branches
  • output has randomness

Two building blocks you need

  1. Audit Log for every call (evidence chain)
  2. Replay with same or modified params

Minimal fields to store

Request
- request_id
- timestamp
- app_id / user_id / session_id
- provider / model
- temperature / top_p / max_tokens
- input_messages (redacted or hashed)
- system_prompt_version

Response
- status (ok/error)
- latency_ms
- output_text (redacted)
- tool_calls
- tokens_in / tokens_out / cost

Always include system_prompt_version.
Without it, historical incident alignment breaks after prompt changes.

Replay modes

  • Strict replay: same model and params
  • Compare replay: same input, different model/temperature

Recommended dashboard actions:

  • Replay
  • Replay with model
  • Replay with temperature

Fast debugging Q&A template

  • Why sudden hallucinations?
    Check temperature drift and model routing changes.

  • Why secret-like output appeared?
    Check context contamination and redaction toggles.

  • Why yesterday worked but today failed?
    Compare prompt version and tool output versions.

Minimal implementation checklist

  • [ ] request_id shown in logs + support tickets
  • [ ] redaction enabled by default
  • [ ] project-level retention and search
  • [ ] one-click replay for each audit record

Next steps