LLM Audit Log + Replay for indie debugging loops
When users say "your AI replied something weird again", you should be able to answer quickly:
- which request?
- which model + parameters?
- what input/context?
- which prompt version?
- why this output?
Why LLM bugs are hard to debug
- prompt construction is dynamic
- parameter space is large
- retries and routing introduce hidden branches
- output has randomness
Two building blocks you need
- Audit Log for every call (evidence chain)
- Replay with same or modified params
Minimal fields to store
Request
- request_id
- timestamp
- app_id / user_id / session_id
- provider / model
- temperature / top_p / max_tokens
- input_messages (redacted or hashed)
- system_prompt_version
Response
- status (ok/error)
- latency_ms
- output_text (redacted)
- tool_calls
- tokens_in / tokens_out / cost
Always include system_prompt_version.
Without it, historical incident alignment breaks after prompt changes.
Replay modes
- Strict replay: same model and params
- Compare replay: same input, different model/temperature
Recommended dashboard actions:
- Replay
- Replay with model
- Replay with temperature
Fast debugging Q&A template
-
Why sudden hallucinations?
Check temperature drift and model routing changes. -
Why secret-like output appeared?
Check context contamination and redaction toggles. -
Why yesterday worked but today failed?
Compare prompt version and tool output versions.
Minimal implementation checklist
- [ ] request_id shown in logs + support tickets
- [ ] redaction enabled by default
- [ ] project-level retention and search
- [ ] one-click replay for each audit record