LLM Evaluation

Agentic AI evals: lessons from real life

AI products can change under your feet. Here’s what I learned about measuring whether they do what you think they should.

Automatic failure diagnosis

An eval score going down tells you something broke. It doesn’t tell you what. ProbeLLM is a new approach to automatic failure diagnosis that treats AI evaluation like an oral exam.

Grading the graders: how do we know if an AI judge is any good?

We use AI systems to evaluate other AI systems. But validating those judges is harder than it looks — especially when the right answer isn’t as clear as it seems.