Agentic AI evals: lessons from real life

AI products can change under your feet. Here’s what I learned about measuring whether they do what you think they should.

May 7, 2026 · 7 min · Zili Shen

Automatic failure diagnosis

An eval score going down tells you something broke. It doesn’t tell you what. ProbeLLM is a new approach to automatic failure diagnosis that treats AI evaluation like an oral exam.

April 28, 2026 · 5 min · Zili Shen

Grading the graders: how do we know if an AI judge is any good?

We use AI systems to evaluate other AI systems. But validating those judges is harder than it looks — especially when the right answer isn’t as clear as it seems.

January 23, 2026 · 6 min · Zili Shen