Chapter 11: Performance Patterns at Scale
Running an AI coding assistant for a handful of developers differs dramatically from serving thousands of concurrent users. When AI processes complex refactoring requests that spawn multiple sub-agents, each analyzing different parts of a codebase, the computational demands multiply quickly. Add real-time synchronization, file system operations, and LLM inference costs, and performance becomes the make-or-break factor for production viability.
This chapter explores performance patterns that enable AI coding assistants to scale from proof-of-concept to production systems serving entire engineering organizations. We'll examine caching strategies, database optimizations, edge computing patterns, and load balancing approaches that maintain sub-second response times even under heavy load.
The Performance Challenge
AI coding assistants face unique performance constraints compared to traditional web applications:
// A single user interaction might trigger:
- Multiple model inference calls (coordinators + specialized agents)
- Dozens of file system operations
- Real-time synchronization across platforms
- Tool executions that spawn processes
- Code analysis across thousands of files
- Version control operations on large repositories
Consider what happens when a user asks an AI assistant to "refactor this authentication system to use OAuth":
- Initial Analysis - The system reads dozens of files to understand the current auth implementation
- Planning - Model generates a refactoring plan, potentially coordinating multiple agents
- Execution - Multiple tools modify files, run tests, and verify changes
- Synchronization - All changes sync across environments and collaborators
- Persistence - Conversation history, file changes, and metadata save to storage
Each step has opportunities for optimization—and potential bottlenecks that can degrade the user experience.
Caching Strategies
The most effective performance optimization is avoiding work entirely. Multi-layered caching minimizes redundant operations:
Model Response Caching
Model inference represents the largest latency and cost factor. Intelligent caching can dramatically improve performance:
class ModelResponseCache {
private memoryCache = new Map<string, CachedResponse>();
private persistentCache: PersistentStorage;
private readonly config: CacheConfiguration;
constructor(config: CacheConfiguration) {
this.config = {
maxMemoryEntries: 1000,
ttlMs: 3600000, // 1 hour
persistHighValue: true,
...config
};
this.initializePersistentCache();
}
async get(
request: ModelRequest
): Promise<CachedResponse | null> {
// Generate stable cache key from request parameters
const key = this.generateCacheKey(request);
// Check memory cache first (fastest)
const memoryResult = this.memoryCache.get(key);
if (memoryResult && this.isValid(memoryResult)) {
this.updateAccessMetrics(memoryResult);
return memoryResult;
}
// Check persistent cache (slower but larger)
const persistentResult = await this.persistentCache.get(key);
if (persistentResult && this.isValid(persistentResult)) {
// Promote to memory cache
this.memoryCache.set(key, persistentResult);
return persistentResult;
}
return null;
}
async set(
messages: Message[],
model: string,
temperature: number,
response: LLMResponse
): Promise<void> {
const key = this.generateCacheKey(messages, model, temperature);
const cached: CachedResponse = {
key,
messages,
model,
temperature,
response,
timestamp: Date.now(),
lastAccessed: Date.now(),
hitCount: 0
};
this.cache.set(key, cached);
// Evict old entries if cache is full
if (this.cache.size > this.MAX_CACHE_SIZE) {
this.evictLRU();
}
// Persist high-value entries
if (this.shouldPersist(cached)) {
await this.persistEntry(key, cached);
}
}
private generateCacheKey(
messages: Message[],
model: string,
temperature: number
): string {
// Only cache deterministic requests (temperature = 0)
if (temperature > 0) {
return crypto.randomUUID(); // Unique key = no caching
}
// Create stable key from messages
const messageHash = crypto
.createHash('sha256')
.update(JSON.stringify(messages))
.digest('hex');
return `${model}:${temperature}:${messageHash}`;
}
private evictLRU(): void {
// Find least recently used entry
let lruKey: string | null = null;
let lruTime = Infinity;
for (const [key, entry] of this.cache) {
if (entry.lastAccessed < lruTime) {
lruTime = entry.lastAccessed;
lruKey = key;
}
}
if (lruKey) {
this.cache.delete(lruKey);
}
}
private shouldPersist(entry: CachedResponse): boolean {
// Persist frequently accessed or expensive responses
return entry.hitCount > 5 ||
entry.response.usage.totalTokens > 4000;
}
}
File System Caching
File operations are frequent and can be expensive, especially on network filesystems:
export class FileSystemCache {
private contentCache = new Map<string, FileCacheEntry>();
private statCache = new Map<string, StatCacheEntry>();
// Watch for file changes to invalidate cache
private watcher = chokidar.watch([], {
persistent: true,
ignoreInitial: true
});
constructor() {
this.watcher.on('change', path => this.invalidate(path));
this.watcher.on('unlink', path => this.invalidate(path));
}
async readFile(path: string): Promise<string> {
const cached = this.contentCache.get(path);
if (cached) {
// Verify cache validity
const stats = await fs.stat(path);
if (stats.mtimeMs <= cached.mtime) {
cached.hits++;
return cached.content;
}
}
// Cache miss - read from disk
const content = await fs.readFile(path, 'utf-8');
const stats = await fs.stat(path);
this.contentCache.set(path, {
content,
mtime: stats.mtimeMs,
size: stats.size,
hits: 0
});
// Start watching this file
this.watcher.add(path);
return content;
}
async glob(pattern: string, options: GlobOptions = {}): Promise<string[]> {
const cacheKey = `${pattern}:${JSON.stringify(options)}`;
// Use cached result if recent enough
const cached = this.globCache.get(cacheKey);
if (cached && Date.now() - cached.timestamp < 5000) {
return cached.results;
}
const results = await fastGlob(pattern, options);
this.globCache.set(cacheKey, {
results,
timestamp: Date.now()
});
return results;
}
private invalidate(path: string): void {
this.contentCache.delete(path);
this.statCache.delete(path);
// Invalidate glob results that might include this file
for (const [key, entry] of this.globCache) {
if (this.mightMatch(path, key)) {
this.globCache.delete(key);
}
}
}
}
Repository Analysis Caching
Code intelligence features require analyzing repository structure, which can be computationally expensive:
export class RepositoryAnalysisCache {
private repoMapCache = new Map<string, RepoMapCache>();
private dependencyCache = new Map<string, DependencyGraph>();
async getRepoMap(
rootPath: string,
options: RepoMapOptions = {}
): Promise<RepoMap> {
const cached = this.repoMapCache.get(rootPath);
if (cached && this.isCacheValid(cached)) {
return cached.repoMap;
}
// Generate new repo map
const repoMap = await this.generateRepoMap(rootPath, options);
// Cache with metadata
this.repoMapCache.set(rootPath, {
repoMap,
timestamp: Date.now(),
gitCommit: await this.getGitCommit(rootPath),
fileCount: repoMap.files.length
});
return repoMap;
}
private async isCacheValid(cache: RepoMapCache): Promise<boolean> {
// Invalidate if git commit changed
const currentCommit = await this.getGitCommit(cache.rootPath);
if (currentCommit !== cache.gitCommit) {
return false;
}
// Invalidate if too old
const age = Date.now() - cache.timestamp;
if (age > 300000) { // 5 minutes
return false;
}
// Sample a few files to check for changes
const samplesToCheck = Math.min(10, cache.fileCount);
const samples = this.selectRandomSamples(cache.repoMap.files, samplesToCheck);
for (const file of samples) {
try {
const stats = await fs.stat(file.path);
if (stats.mtimeMs > cache.timestamp) {
return false;
}
} catch {
// File deleted
return false;
}
}
return true;
}
}
Database Optimization
Conversation storage requires careful optimization to handle millions of interactions efficiently:
Indexed Storage Schema
Efficient conversation storage uses layered database architecture with strategic indexing:
class ConversationDatabase {
private storage: DatabaseAdapter;
async initialize(): Promise<void> {
await this.storage.connect();
await this.ensureSchema();
}
private async ensureSchema(): Promise<void> {
// Conversation metadata for quick access
await this.storage.createTable('conversations', {
id: 'primary_key',
userId: 'indexed',
teamId: 'indexed',
title: 'indexed',
created: 'indexed',
lastActivity: 'indexed',
isShared: 'indexed',
version: 'indexed'
});
// Separate table for message content to optimize loading
await this.storage.createTable('messages', {
id: 'primary_key',
conversationId: 'indexed',
sequence: 'indexed',
timestamp: 'indexed',
content: 'blob',
metadata: 'json'
});
// Lightweight summary table for listings
await this.storage.createTable('conversation_summaries', {
id: 'primary_key',
title: 'indexed',
lastMessage: 'text',
messageCount: 'integer',
participants: 'json'
});
}
async getThread(id: ThreadID): Promise<Thread | null> {
const transaction = this.db.transaction(['threads', 'messages'], 'readonly');
const threadStore = transaction.objectStore('threads');
const messageStore = transaction.objectStore('messages');
// Get thread metadata
const thread = await this.getFromStore(threadStore, id);
if (!thread) return null;
// Get messages separately for large threads
if (thread.messageCount > 100) {
const messageIndex = messageStore.index('threadId');
const messages = await this.getAllFromIndex(messageIndex, id);
thread.messages = messages;
}
return thread;
}
async queryThreads(
query: ThreadQuery
): Promise<ThreadMeta[]> {
const transaction = this.db.transaction(['threadMeta'], 'readonly');
const metaStore = transaction.objectStore('threadMeta');
let results: ThreadMeta[] = [];
// Use index if available
if (query.orderBy === 'lastActivity') {
const index = metaStore.index('lastActivity');
const range = query.after
? IDBKeyRange.lowerBound(query.after, true)
: undefined;
results = await this.getCursorResults(
index.openCursor(range, 'prev'),
query.limit
);
} else {
// Full table scan with filtering
results = await this.getAllFromStore(metaStore);
results = this.applyFilters(results, query);
}
return results;
}
}
Write Batching
Frequent small writes can overwhelm storage systems. Batching improves throughput:
export class BatchedThreadWriter {
private writeQueue = new Map<ThreadID, PendingWrite>();
private flushTimer?: NodeJS.Timeout;
constructor(
private storage: ThreadStorage,
private options: BatchOptions = {}
) {
this.options = {
batchSize: 50,
flushInterval: 1000,
maxWaitTime: 5000,
...options
};
}
async write(thread: Thread): Promise<void> {
const now = Date.now();
this.writeQueue.set(thread.id, {
thread,
queuedAt: now,
priority: this.calculatePriority(thread)
});
// Schedule flush
this.scheduleFlush();
// Immediate flush for high-priority writes
if (this.shouldFlushImmediately(thread)) {
await this.flush();
}
}
private scheduleFlush(): void {
if (this.flushTimer) return;
this.flushTimer = setTimeout(() => {
this.flush().catch(error =>
logger.error('Batch flush failed:', error)
);
}, this.options.flushInterval);
}
private async flush(): Promise<void> {
if (this.writeQueue.size === 0) return;
// Clear timer
if (this.flushTimer) {
clearTimeout(this.flushTimer);
this.flushTimer = undefined;
}
// Sort by priority and age
const writes = Array.from(this.writeQueue.values())
.sort((a, b) => {
if (a.priority !== b.priority) {
return b.priority - a.priority;
}
return a.queuedAt - b.queuedAt;
});
// Process in batches
for (let i = 0; i < writes.length; i += this.options.batchSize) {
const batch = writes.slice(i, i + this.options.batchSize);
try {
await this.storage.batchWrite(
batch.map(w => w.thread)
);
// Remove from queue
batch.forEach(w => this.writeQueue.delete(w.thread.id));
} catch (error) {
logger.error('Batch write failed:', error);
// Keep in queue for retry
}
}
// Schedule next flush if items remain
if (this.writeQueue.size > 0) {
this.scheduleFlush();
}
}
private calculatePriority(thread: Thread): number {
let priority = 0;
// Active threads get higher priority
if (thread.messages.length > 0) {
const lastMessage = thread.messages[thread.messages.length - 1];
const age = Date.now() - lastMessage.timestamp;
if (age < 60000) priority += 10; // Active in last minute
}
// Shared threads need immediate sync
if (thread.meta?.shared) priority += 5;
// Larger threads are more important to persist
priority += Math.min(thread.messages.length / 10, 5);
return priority;
}
}
CDN and Edge Computing
Static assets and frequently accessed data benefit from edge distribution:
Asset Optimization
Amp serves static assets through a CDN with aggressive caching:
export class AssetOptimizer {
private assetManifest = new Map<string, AssetEntry>();
async optimizeAssets(buildDir: string): Promise<void> {
const assets = await this.findAssets(buildDir);
for (const asset of assets) {
// Generate content hash
const content = await fs.readFile(asset.path);
const hash = crypto
.createHash('sha256')
.update(content)
.digest('hex')
.substring(0, 8);
// Create versioned filename
const ext = path.extname(asset.path);
const base = path.basename(asset.path, ext);
const hashedName = `${base}.${hash}${ext}`;
// Optimize based on type
const optimized = await this.optimizeAsset(asset, content);
// Write optimized version
const outputPath = path.join(
buildDir,
'cdn',
hashedName
);
await fs.writeFile(outputPath, optimized.content);
// Update manifest
this.assetManifest.set(asset.originalPath, {
cdnPath: `/cdn/${hashedName}`,
size: optimized.content.length,
hash,
headers: this.getCacheHeaders(asset.type)
});
}
// Write manifest for runtime
await this.writeManifest(buildDir);
}
private getCacheHeaders(type: AssetType): Headers {
const headers = new Headers();
// Immutable for versioned assets
headers.set('Cache-Control', 'public, max-age=31536000, immutable');
// Type-specific headers
switch (type) {
case 'javascript':
headers.set('Content-Type', 'application/javascript');
break;
case 'css':
headers.set('Content-Type', 'text/css');
break;
case 'wasm':
headers.set('Content-Type', 'application/wasm');
break;
}
// Enable compression
headers.set('Content-Encoding', 'gzip');
return headers;
}
}
Edge Function Patterns
Compute at the edge reduces latency for common operations:
export class EdgeFunctionRouter {
// Deployed to Cloudflare Workers or similar
async handleRequest(request: Request): Promise<Response> {
const url = new URL(request.url);
// Handle different edge-optimized endpoints
switch (url.pathname) {
case '/api/threads/list':
return this.handleThreadList(request);
case '/api/auth/verify':
return this.handleAuthVerification(request);
case '/api/assets/repomap':
return this.handleRepoMapRequest(request);
default:
// Pass through to origin
return fetch(request);
}
}
private async handleThreadList(
request: Request
): Promise<Response> {
const cache = caches.default;
const cacheKey = new Request(request.url, {
method: 'GET',
headers: {
'Authorization': request.headers.get('Authorization') || ''
}
});
// Check cache
const cached = await cache.match(cacheKey);
if (cached) {
return cached;
}
// Fetch from origin
const response = await fetch(request);
// Cache successful responses
if (response.ok) {
const headers = new Headers(response.headers);
headers.set('Cache-Control', 'private, max-age=60');
const cachedResponse = new Response(response.body, {
status: response.status,
statusText: response.statusText,
headers
});
await cache.put(cacheKey, cachedResponse.clone());
return cachedResponse;
}
return response;
}
private async handleAuthVerification(
request: Request
): Promise<Response> {
const token = request.headers.get('Authorization')?.split(' ')[1];
if (!token) {
return new Response('Unauthorized', { status: 401 });
}
// Verify JWT at edge
try {
const payload = await this.verifyJWT(token);
// Add user info to request headers
const headers = new Headers(request.headers);
headers.set('X-User-Id', payload.sub);
headers.set('X-User-Email', payload.email);
// Forward to origin with verified user
return fetch(request, { headers });
} catch (error) {
return new Response('Invalid token', { status: 401 });
}
}
}
Global Thread Sync
Edge presence enables efficient global synchronization:
export class GlobalSyncCoordinator {
private regions = ['us-east', 'eu-west', 'ap-south'];
async syncThread(
thread: Thread,
originRegion: string
): Promise<void> {
// Write to origin region first
await this.writeToRegion(thread, originRegion);
// Fan out to other regions asynchronously
const otherRegions = this.regions.filter(r => r !== originRegion);
await Promise.all(
otherRegions.map(region =>
this.replicateToRegion(thread, region)
.catch(error => {
logger.error(`Failed to replicate to ${region}:`, error);
// Queue for retry
this.queueReplication(thread.id, region);
})
)
);
}
private async writeToRegion(
thread: Thread,
region: string
): Promise<void> {
const endpoint = this.getRegionalEndpoint(region);
const response = await fetch(`${endpoint}/api/threads/${thread.id}`, {
method: 'PUT',
headers: {
'Content-Type': 'application/json',
'X-Sync-Version': thread.v.toString(),
'X-Origin-Region': region
},
body: JSON.stringify(thread)
});
if (!response.ok) {
throw new Error(`Regional write failed: ${response.status}`);
}
}
async readThread(
threadId: ThreadID,
userRegion: string
): Promise<Thread | null> {
// Try local region first
const localThread = await this.readFromRegion(threadId, userRegion);
if (localThread) {
return localThread;
}
// Fall back to other regions
for (const region of this.regions) {
if (region === userRegion) continue;
try {
const thread = await this.readFromRegion(threadId, region);
if (thread) {
// Replicate to user's region for next time
this.replicateToRegion(thread, userRegion)
.catch(() => {}); // Best effort
return thread;
}
} catch {
continue;
}
}
return null;
}
}
Load Balancing Patterns
Distributing load across multiple servers requires intelligent routing:
Session Affinity
AI conversations benefit from session affinity to maximize cache hits:
export class SessionAwareLoadBalancer {
private servers: ServerPool[] = [];
private sessionMap = new Map<string, string>();
async routeRequest(
request: Request,
sessionId: string
): Promise<Response> {
// Check for existing session affinity
let targetServer = this.sessionMap.get(sessionId);
if (!targetServer || !this.isServerHealthy(targetServer)) {
// Select new server based on load
targetServer = await this.selectServer(request);
this.sessionMap.set(sessionId, targetServer);
}
// Route to selected server
return this.forwardRequest(request, targetServer);
}
private async selectServer(
request: Request
): Promise<string> {
const healthyServers = this.servers.filter(s => s.healthy);
if (healthyServers.length === 0) {
throw new Error('No healthy servers available');
}
// Consider multiple factors
const scores = await Promise.all(
healthyServers.map(async server => ({
server,
score: await this.calculateServerScore(server, request)
}))
);
// Select server with best score
scores.sort((a, b) => b.score - a.score);
return scores[0].server.id;
}
private async calculateServerScore(
server: ServerPool,
request: Request
): Promise<number> {
let score = 100;
// Current load (lower is better)
score -= server.currentConnections / server.maxConnections * 50;
// CPU usage
score -= server.cpuUsage * 30;
// Memory availability
score -= (1 - server.memoryAvailable / server.memoryTotal) * 20;
// Geographic proximity (if available)
const clientRegion = request.headers.get('CF-IPCountry');
if (clientRegion && server.region === clientRegion) {
score += 10;
}
// Specialized capabilities
if (request.url.includes('/api/code-analysis') && server.hasGPU) {
score += 15;
}
return Math.max(0, score);
}
}
Queue Management
Graceful degradation under load prevents system collapse:
export class AdaptiveQueueManager {
private queues = new Map<Priority, Queue<Task>>();
private processing = new Map<string, ProcessingTask>();
constructor(
private options: QueueOptions = {}
) {
this.options = {
maxConcurrent: 100,
maxQueueSize: 1000,
timeoutMs: 30000,
...options
};
// Initialize priority queues
for (const priority of ['critical', 'high', 'normal', 'low']) {
this.queues.set(priority as Priority, new Queue());
}
}
async enqueue(
task: Task,
priority: Priority = 'normal'
): Promise<TaskResult> {
// Check queue capacity
const queue = this.queues.get(priority)!;
if (queue.size >= this.options.maxQueueSize) {
// Shed load for low priority tasks
if (priority === 'low') {
throw new Error('System overloaded, please retry later');
}
// Bump up priority for important tasks
if (priority === 'normal') {
return this.enqueue(task, 'high');
}
}
// Add to queue
const promise = new Promise<TaskResult>((resolve, reject) => {
queue.enqueue({
task,
resolve,
reject,
enqueuedAt: Date.now()
});
});
// Process queue
this.processQueues();
return promise;
}
private async processQueues(): Promise<void> {
if (this.processing.size >= this.options.maxConcurrent) {
return; // At capacity
}
// Process in priority order
for (const [priority, queue] of this.queues) {
while (
queue.size > 0 &&
this.processing.size < this.options.maxConcurrent
) {
const item = queue.dequeue()!;
// Check for timeout
const waitTime = Date.now() - item.enqueuedAt;
if (waitTime > this.options.timeoutMs) {
item.reject(new Error('Task timeout in queue'));
continue;
}
// Process task
this.processTask(item);
}
}
}
private async processTask(item: QueueItem): Promise<void> {
const taskId = crypto.randomUUID();
this.processing.set(taskId, {
item,
startedAt: Date.now()
});
try {
const result = await item.task.execute();
item.resolve(result);
} catch (error) {
item.reject(error);
} finally {
this.processing.delete(taskId);
// Process more tasks
this.processQueues();
}
}
}
Resource Pooling
Expensive resources like database connections benefit from pooling:
export class ResourcePool<T> {
private available: T[] = [];
private inUse = new Map<T, PooledResource<T>>();
private waiting: ((resource: T) => void)[] = [];
constructor(
private factory: ResourceFactory<T>,
private options: PoolOptions = {}
) {
this.options = {
min: 5,
max: 20,
idleTimeoutMs: 300000,
createTimeoutMs: 5000,
...options
};
// Pre-create minimum resources
this.ensureMinimum();
}
async acquire(): Promise<PooledResource<T>> {
// Return available resource
while (this.available.length > 0) {
const resource = this.available.pop()!;
// Validate resource is still good
if (await this.factory.validate(resource)) {
const pooled = this.wrapResource(resource);
this.inUse.set(resource, pooled);
return pooled;
} else {
// Destroy invalid resource
await this.factory.destroy(resource);
}
}
// Create new resource if under max
if (this.inUse.size < this.options.max) {
const resource = await this.createResource();
const pooled = this.wrapResource(resource);
this.inUse.set(resource, pooled);
return pooled;
}
// Wait for available resource
return new Promise((resolve) => {
this.waiting.push((resource) => {
const pooled = this.wrapResource(resource);
this.inUse.set(resource, pooled);
resolve(pooled);
});
});
}
private wrapResource(resource: T): PooledResource<T> {
const pooled = {
resource,
acquiredAt: Date.now(),
release: async () => {
this.inUse.delete(resource);
// Give to waiting request
if (this.waiting.length > 0) {
const waiter = this.waiting.shift()!;
waiter(resource);
return;
}
// Return to available pool
this.available.push(resource);
// Schedule idle check
setTimeout(() => {
this.checkIdle();
}, this.options.idleTimeoutMs);
}
};
return pooled;
}
private async checkIdle(): Promise<void> {
while (
this.available.length > this.options.min &&
this.waiting.length === 0
) {
const resource = this.available.pop()!;
await this.factory.destroy(resource);
}
}
}
// Example: Database connection pool
const dbPool = new ResourcePool({
async create() {
const conn = await pg.connect({
host: 'localhost',
database: 'amp',
// Connection options
});
return conn;
},
async validate(conn) {
try {
await conn.query('SELECT 1');
return true;
} catch {
return false;
}
},
async destroy(conn) {
await conn.end();
}
});
Real-World Performance Gains
These optimization strategies compound to deliver significant performance improvements:
Latency Reduction
Before optimization:
- Conversation load: 800ms (database query + message fetch)
- Model response: 3-5 seconds
- File operations: 50-200ms per file
- Total interaction: 5-10 seconds
After optimization:
- Conversation load: 50ms (memory cache hit)
- Model response: 100ms (cached) or 2-3s (cache miss)
- File operations: 5-10ms (cached)
- Total interaction: 200ms - 3 seconds
Throughput Improvements
Single server capacity:
- Before: 10-20 concurrent users
- After: 500-1000 concurrent users
With load balancing:
- 10 servers: 5,000-10,000 concurrent users
- Horizontal scaling: Linear growth with server count
Resource Efficiency
Model usage optimization:
- 40% reduction through response caching
- 60% reduction in duplicate file reads
- 80% reduction in repository analysis
Infrastructure optimization:
- 70% reduction in database operations
- 50% reduction in bandwidth (CDN caching)
- 30% reduction in compute (edge functions)
Monitoring and Optimization
Performance requires continuous monitoring and adjustment:
export class PerformanceMonitor {
private metrics = new Map<string, MetricCollector>();
constructor(
private reporter: MetricReporter
) {
// Core metrics
this.registerMetric('thread.load.time');
this.registerMetric('llm.response.time');
this.registerMetric('cache.hit.rate');
this.registerMetric('queue.depth');
this.registerMetric('concurrent.users');
}
async trackOperation<T>(
name: string,
operation: () => Promise<T>
): Promise<T> {
const start = performance.now();
try {
const result = await operation();
this.recordMetric(name, {
duration: performance.now() - start,
success: true
});
return result;
} catch (error) {
this.recordMetric(name, {
duration: performance.now() - start,
success: false,
error: error.message
});
throw error;
}
}
private recordMetric(
name: string,
data: MetricData
): void {
const collector = this.metrics.get(name);
if (!collector) return;
collector.record(data);
// Check for anomalies
if (this.isAnomalous(name, data)) {
this.handleAnomaly(name, data);
}
}
private isAnomalous(
name: string,
data: MetricData
): boolean {
const collector = this.metrics.get(name)!;
const stats = collector.getStats();
// Detect significant deviations
if (data.duration) {
const deviation = Math.abs(data.duration - stats.mean) / stats.stdDev;
return deviation > 3; // 3 sigma rule
}
return false;
}
}
Summary
Performance at scale requires a multi-layered approach combining caching, database optimization, edge computing, and intelligent load balancing. Effective AI coding assistant architectures demonstrate how these patterns work together:
- Aggressive caching reduces redundant work at every layer
- Database optimization handles millions of conversations efficiently
- Edge distribution brings compute closer to users
- Load balancing maintains quality of service under pressure
- Resource pooling maximizes hardware utilization
- Queue management provides graceful degradation
The key insight is that AI coding assistants have unique performance characteristics—long-running operations, large context windows, and complex tool interactions—that require specialized optimization strategies. By building these patterns into the architecture from the start, systems can scale from proof-of-concept to production without major rewrites.
These performance patterns form the foundation for building AI coding assistants that can serve thousands of developers concurrently while maintaining the responsiveness that makes them useful in real development workflows.
In the next chapter, we'll explore observability and monitoring strategies for understanding and optimizing these complex systems in production.