Agentic AI evals: lessons from real life

AI products can change under your feet. Here’s what I learned about measuring whether they do what you think they should.

May 7, 2026 · 7 min · Zili Shen

Automatic failure diagnosis

An eval score going down tells you something broke. It doesn’t tell you what. ProbeLLM is a new approach to automatic failure diagnosis that treats AI evaluation like an oral exam.

April 28, 2026 · 5 min · Zili Shen

Grading the graders: how do we know if an AI judge is any good?

We use AI systems to evaluate other AI systems. But validating those judges is harder than it looks — especially when the right answer isn’t as clear as it seems.

January 23, 2026 · 6 min · Zili Shen

Cloud Computing for (Observational) Astronomy

You’ve used the cloud, but have you thought about using it for astronomy? A roundup from a panel at #AAS241.

January 26, 2023 · 1 min · Zili Shen

How Not to Bury Ourselves Under Space Trash

Our planet is already blanketed by space debris. As small commercial satellites rapidly multiply, will humans block ourselves from space?

February 24, 2022 · 1 min · Zili Shen

From Star Parties to Observatories: An Astronomer's Journey

Reflecting on observatory trips and what I love about observing the night sky.

November 12, 2021 · 1 min · Zili Shen