Cache Me if You Can - COSW '25 Conference Talk

In AI systems, caching isn't just about storing data - it's about intelligently managing expensive computations. This talk reveals advanced caching strategies to make AI economically viable at scale. We'll explore semantic similarity caching for LLM responses, where similar queries retrieve cached completions. I'll share lessons in reducing/removing inference costs, reducing latency, clever model selection and economic considerations. The talk includes a deep dive into vector similarity search for cache lookups and strategies for caching non-deterministic AI outputs.

Audience

ML Engineers, AI Architects, Platform Teams

Audience

ML Engineers, AI Architects, Platform Teams

Audience

ML Engineers, AI Architects, Platform Teams

Format

Technical Talk - 45 minutes

Format

Technical Talk - 45 minutes

Format

Technical Talk - 45 minutes

Key Takeaways

• Semantic caching for LLM responses to cut API costs by 80% • Vector similarity search for intelligent cache retrieval • Distributed caching patterns for model weights and embeddings • Managing cache invalidation when models are retrained

Key Takeaways

File Title

File

File Title

File

File Title

File

Simple Yet Powerful AI Caching Strategies

Caching in AI systems is surprisingly simple and elegant - store expensive computations and reuse them intelligently. We cover semantic caching where similar prompts retrieve cached LLM responses, embedding caches that store vector representations for instant similarity search, and inference result caching for deterministic model outputs. The talk demonstrates how these straightforward patterns can reduce latency from seconds to milliseconds and cut costs by orders of magnitude, making AI features economically viable at scale.

The Napkin Math Inspiration

This talk started with a simple realization while building my SaaS product: inference costs would destroy my margins at scale. I did the napkin math - in a network effects platform, users often ask similar questions. If 100 users ask variations of the same query, why pay for 100 API calls? By caching responses and matching similar queries, I could theoretically cut inference costs by 70-90%. The economics were compelling: better unit economics meant I could price competitively while maintaining healthy SaaS margins. This strategic use of caching transformed AI from a cost center into a sustainable product feature.

Proletariat Multi-Agent Orchestration NPM Package

Proletariat is an NPM package designed to revolutionize how developers manage multiple git worktrees. It enables coordination between multiple AI agents or automated processes working on the same codebase simultaneously, preventing conflicts and optimizing parallel development workflows. The package provides a robust API for worktree creation, synchronization, and cleanup, making it ideal for teams using AI-assisted development or running complex CI/CD pipelines.

Book this talk

Cache Me if You Can - COSW '25 Conference Talk

Simple Yet Powerful AI Caching Strategies

Simple Yet Powerful AI Caching Strategies

Simple Yet Powerful AI Caching Strategies

The Napkin Math Inspiration

The Napkin Math Inspiration

The Napkin Math Inspiration

Proletariat Multi-Agent Orchestration NPM Package

Get in touch

Get in touch

Get in touch