Agentic AI evals: lessons from real life
AI products can change under your feet. Here’s what I learned about measuring whether they do what you think they should.
AI products can change under your feet. Here’s what I learned about measuring whether they do what you think they should.
An eval score going down tells you something broke. It doesn’t tell you what. ProbeLLM is a new approach to automatic failure diagnosis that treats AI evaluation like an oral exam.
We use AI systems to evaluate other AI systems. But validating those judges is harder than it looks — especially when the right answer isn’t as clear as it seems.
You’ve used the cloud, but have you thought about using it for astronomy? A roundup from a panel at #AAS241.
Our planet is already blanketed by space debris. As small commercial satellites rapidly multiply, will humans block ourselves from space?
Reflecting on observatory trips and what I love about observing the night sky.