Cache Me if You Can - COSW '25 Conference Talk
In AI systems, caching isn't just about storing data - it's about intelligently managing expensive computations. This talk reveals advanced caching strategies to make AI economically viable at scale. We'll explore semantic similarity caching for LLM responses, where similar queries retrieve cached completions. I'll share lessons in reducing/removing inference costs, reducing latency, clever model selection and economic considerations. The talk includes a deep dive into vector similarity search for cache lookups and strategies for caching non-deterministic AI outputs.