Cache Me if You Can - COSW '25 Conference Talk

In AI systems, caching isn't just about storing data - it's about intelligently managing expensive computations. This talk reveals advanced caching strategies to make AI economically viable at scale. We'll explore semantic similarity caching for LLM responses, where similar queries retrieve cached completions. I'll share lessons in reducing/removing inference costs, reducing latency, clever model selection and economic considerations. The talk includes a deep dive into vector similarity search for cache lookups and strategies for caching non-deterministic AI outputs.

Audience

ML Engineers, AI Architects, Platform Teams

Audience

ML Engineers, AI Architects, Platform Teams

Audience

ML Engineers, AI Architects, Platform Teams

Format

Technical Talk - 45 minutes

Format

Technical Talk - 45 minutes

Format

Technical Talk - 45 minutes

Key Takeaways

• Semantic caching for LLM responses to cut API costs by 80% • Vector similarity search for intelligent cache retrieval • Distributed caching patterns for model weights and embeddings • Managing cache invalidation when models are retrained

Key Takeaways

• Semantic caching for LLM responses to cut API costs by 80% • Vector similarity search for intelligent cache retrieval • Distributed caching patterns for model weights and embeddings • Managing cache invalidation when models are retrained

Key Takeaways

• Semantic caching for LLM responses to cut API costs by 80% • Vector similarity search for intelligent cache retrieval • Distributed caching patterns for model weights and embeddings • Managing cache invalidation when models are retrained

Simple Yet Powerful AI Caching Strategies

Simple Yet Powerful AI Caching Strategies

Simple Yet Powerful AI Caching Strategies

Caching in AI systems is surprisingly simple and elegant - store expensive computations and reuse them intelligently. We cover semantic caching where similar prompts retrieve cached LLM responses, embedding caches that store vector representations for instant similarity search, and inference result caching for deterministic model outputs. The talk demonstrates how these straightforward patterns can reduce latency from seconds to milliseconds and cut costs by orders of magnitude, making AI features economically viable at scale.

Caching in AI systems is surprisingly simple and elegant - store expensive computations and reuse them intelligently. We cover semantic caching where similar prompts retrieve cached LLM responses, embedding caches that store vector representations for instant similarity search, and inference result caching for deterministic model outputs. The talk demonstrates how these straightforward patterns can reduce latency from seconds to milliseconds and cut costs by orders of magnitude, making AI features economically viable at scale.

Caching in AI systems is surprisingly simple and elegant - store expensive computations and reuse them intelligently. We cover semantic caching where similar prompts retrieve cached LLM responses, embedding caches that store vector representations for instant similarity search, and inference result caching for deterministic model outputs. The talk demonstrates how these straightforward patterns can reduce latency from seconds to milliseconds and cut costs by orders of magnitude, making AI features economically viable at scale.

The Napkin Math Inspiration

The Napkin Math Inspiration

The Napkin Math Inspiration

This talk started with a simple realization while building my SaaS product: inference costs would destroy my margins at scale. I did the napkin math - in a network effects platform, users often ask similar questions. If 100 users ask variations of the same query, why pay for 100 API calls? By caching responses and matching similar queries, I could theoretically cut inference costs by 70-90%. The economics were compelling: better unit economics meant I could price competitively while maintaining healthy SaaS margins. This strategic use of caching transformed AI from a cost center into a sustainable product feature.

This talk started with a simple realization while building my SaaS product: inference costs would destroy my margins at scale. I did the napkin math - in a network effects platform, users often ask similar questions. If 100 users ask variations of the same query, why pay for 100 API calls? By caching responses and matching similar queries, I could theoretically cut inference costs by 70-90%. The economics were compelling: better unit economics meant I could price competitively while maintaining healthy SaaS margins. This strategic use of caching transformed AI from a cost center into a sustainable product feature.

This talk started with a simple realization while building my SaaS product: inference costs would destroy my margins at scale. I did the napkin math - in a network effects platform, users often ask similar questions. If 100 users ask variations of the same query, why pay for 100 API calls? By caching responses and matching similar queries, I could theoretically cut inference costs by 70-90%. The economics were compelling: better unit economics meant I could price competitively while maintaining healthy SaaS margins. This strategic use of caching transformed AI from a cost center into a sustainable product feature.