<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>AI on Zili Shen</title>
    <link>https://zilishen.com/tags/ai/</link>
    <description>Recent content in AI on Zili Shen</description>
    <generator>Hugo</generator>
    <language>en-us</language>
    <lastBuildDate>Thu, 07 May 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://zilishen.com/tags/ai/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Agentic AI evals: lessons from real life</title>
      <link>https://zilishen.com/blog/agentic-ai-evals/</link>
      <pubDate>Thu, 07 May 2026 00:00:00 +0000</pubDate>
      <guid>https://zilishen.com/blog/agentic-ai-evals/</guid>
      <description>AI products can change under your feet. Here&amp;rsquo;s what I learned about measuring whether they do what you think they should.</description>
    </item>
    <item>
      <title>Automatic failure diagnosis</title>
      <link>https://zilishen.com/blog/probellm-failure-diagnosis/</link>
      <pubDate>Tue, 28 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://zilishen.com/blog/probellm-failure-diagnosis/</guid>
      <description>An eval score going down tells you something broke. It doesn&amp;rsquo;t tell you what. ProbeLLM is a new approach to automatic failure diagnosis that treats AI evaluation like an oral exam.</description>
    </item>
    <item>
      <title>Grading the graders: how do we know if an AI judge is any good?</title>
      <link>https://zilishen.com/blog/llm-judge-validation/</link>
      <pubDate>Fri, 23 Jan 2026 00:00:00 +0000</pubDate>
      <guid>https://zilishen.com/blog/llm-judge-validation/</guid>
      <description>We use AI systems to evaluate other AI systems. But validating those judges is harder than it looks — especially when the right answer isn&amp;rsquo;t as clear as it seems.</description>
    </item>
  </channel>
</rss>
