LLM Audit Log + Replay for indie debugging loops

When users say "your AI replied something weird again", you should be able to answer quickly:

which request?
which model + parameters?
what input/context?
which prompt version?
why this output?

Why LLM bugs are hard to debug

prompt construction is dynamic
parameter space is large
retries and routing introduce hidden branches
output has randomness

Two building blocks you need

Audit Log for every call (evidence chain)
Replay with same or modified params

Minimal fields to store

Request
- request_id
- timestamp
- app_id / user_id / session_id
- provider / model
- temperature / top_p / max_tokens
- input_messages (redacted or hashed)
- system_prompt_version

Response
- status (ok/error)
- latency_ms
- output_text (redacted)
- tool_calls
- tokens_in / tokens_out / cost

Always include system_prompt_version.
Without it, historical incident alignment breaks after prompt changes.

Replay modes

Strict replay: same model and params
Compare replay: same input, different model/temperature

Recommended dashboard actions:

Replay
Replay with model
Replay with temperature

Fast debugging Q&A template

Why sudden hallucinations?
Check temperature drift and model routing changes.
Why secret-like output appeared?
Check context contamination and redaction toggles.
Why yesterday worked but today failed?
Compare prompt version and tool output versions.

Minimal implementation checklist

[ ] request_id shown in logs + support tickets
[ ] redaction enabled by default
[ ] project-level retention and search
[ ] one-click replay for each audit record