Blog | Zili Shen

Agentic AI evals: lessons from real life

AI products can change under your feet. Here’s what I learned about measuring whether they do what you think they should.

Automatic failure diagnosis

An eval score going down tells you something broke. It doesn’t tell you what. ProbeLLM is a new approach to automatic failure diagnosis that treats AI evaluation like an oral exam.

Grading the graders: how do we know if an AI judge is any good?

We use AI systems to evaluate other AI systems. But validating those judges is harder than it looks — especially when the right answer isn’t as clear as it seems.

Cloud Computing for (Observational) Astronomy

You’ve used the cloud, but have you thought about using it for astronomy? A roundup from a panel at #AAS241.

How Not to Bury Ourselves Under Space Trash

Our planet is already blanketed by space debris. As small commercial satellites rapidly multiply, will humans block ourselves from space?

From Star Parties to Observatories: An Astronomer's Journey

Reflecting on observatory trips and what I love about observing the night sky.