Zili Shen

Agentic AI evals: lessons from real life

Thu, 07 May 2026 00:00:00 +0000

AI products can change under your feet. Here’s what I learned about measuring whether they do what you think they should.

Automatic failure diagnosis

Tue, 28 Apr 2026 00:00:00 +0000

An eval score going down tells you something broke. It doesn’t tell you what. ProbeLLM is a new approach to automatic failure diagnosis that treats AI evaluation like an oral exam.

Grading the graders: how do we know if an AI judge is any good?

Fri, 23 Jan 2026 00:00:00 +0000

We use AI systems to evaluate other AI systems. But validating those judges is harder than it looks — especially when the right answer isn’t as clear as it seems.

Cloud Computing for (Observational) Astronomy

Thu, 26 Jan 2023 00:00:00 +0000

You’ve used the cloud, but have you thought about using it for astronomy? A roundup from a panel at #AAS241.

How Not to Bury Ourselves Under Space Trash

Thu, 24 Feb 2022 00:00:00 +0000

Our planet is already blanketed by space debris. As small commercial satellites rapidly multiply, will humans block ourselves from space?

From Star Parties to Observatories: An Astronomer's Journey

Fri, 12 Nov 2021 00:00:00 +0000

Reflecting on observatory trips and what I love about observing the night sky.

About

Mon, 01 Jan 0001 00:00:00 +0000

Zili is a Member of Technical Staff at P-1 AI, where she works as an AI eval research engineer specializing in LLM-based agents.

She graduated from Yale in 2025 with a Ph.D. in astrophysics. For her thesis, she led the science analysis of the Dragonfly Ultrawide Survey, mapping 10,000 square degrees of the northern sky using a custom data pipeline on AWS.

She writes about science. She contributed 18 articles to Astrobites and worked at the Yale Poorvu Center as a Graduate Writing Fellow, offering one-on-one writing consultations and leading workshops.

Research

Mon, 01 Jan 0001 00:00:00 +0000

AI Safety & Evaluations

Current role: Member of Technical Staff at P-1 AI, working as an AI eval research engineer specializing in LLM-based agents.

Previous: Algoverse AI Safety Fellowship — Evaluating agents on long-horizon tasks.

LLM agents are increasingly deployed to carry out complex, multi-step tasks on behalf of users. During this process, are agents able to retain their alignment training, remember the original goal, and adapt to unexpected changes in the environment?