mirror of https://github.com/go-micro/go-micro.git synced 2026-05-06 19:21:46 +02:00

Files

T

Copilot baeb282cf1 Enhance cache rate limiting to prevent registry overload during rolling deployments (#2824 )

* Initial plan

* Enhance cache rate limiting to protect against rolling deployment scenarios

Per @asim's feedback, rate limiting now applies to ALL cache refresh attempts
(not just error cases) to prevent registry overload during rolling deployments.

Changes:
- Rate limit ALL refresh attempts using lastRefreshAttempt tracking
- Always return stale cache if available (even if expired) when rate limiting
- Only return ErrNotFound when no cache exists during rate limit period
- Rate limiting happens inside singleflight to avoid race conditions
- Update test to reflect new behavior (no retry when stale cache + rate limit)
- Enhanced documentation with rolling deployment scenario examples

This addresses the scenario where all upstream services expire their cache
simultaneously during a downstream rolling deployment, which would previously
cause a stampede to the registry under high QPS.

Co-authored-by: asim <17530+asim@users.noreply.github.com>

* Remove unused failedAttempts and consecutiveFailures fields

Per @asim's feedback, these fields are no longer needed since the new
rate limiting strategy uses lastRefreshAttempt to track ALL refresh
attempts, not just failed ones.

Removed:
- failedAttempts map[string]time.Time
- consecutiveFailures int
- All code that sets these fields

The rate limiting logic now only uses lastRefreshAttempt, making these
fields redundant. All tests continue to pass.

Co-authored-by: asim <17530+asim@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: asim <17530+asim@users.noreply.github.com>

2026-01-14 10:57:14 +00:00

4.0 KiB

Raw Permalink Blame History

Registry Cache

Cache is a library that provides a caching layer for the go-micro registry.

If you're looking for caching in your microservices use the selector.

Features

Caching: Caches registry lookups with configurable TTL
Stale Cache Fallback: Returns stale cached data when registry is unavailable
Singleflight Protection: Deduplicates concurrent requests for the same service
Adaptive Throttling: Rate limits failed lookups to prevent cache penetration (new in v5)

Interface

// Cache is the registry cache interface
type Cache interface {
	// embed the registry interface
	registry.Registry
	// stop the cache watcher
	Stop()
}

Usage

Basic Usage

import (
	"github.com/micro/go-micro/registry"
	"github.com/micro/go-micro/registry/cache"
)

r := registry.NewRegistry()
cache := cache.New(r)

services, _ := cache.GetService("my.service")

Advanced Configuration

import (
	"time"
	"github.com/micro/go-micro/registry"
	"github.com/micro/go-micro/registry/cache"
)

r := registry.NewRegistry()

// Configure cache with custom options
cache := cache.New(r,
	cache.WithTTL(2*time.Minute),                    // Cache TTL
	cache.WithMinimumRetryInterval(10*time.Second),  // Throttle failed lookups
)

services, _ := cache.GetService("my.service")

Adaptive Throttling

The cache implements rate limiting on ALL cache refresh attempts (not just errors) to prevent overwhelming the registry. This protects against multiple scenarios:

Registry failures: When etcd is down/overloaded
Rolling deployments: When all caches expire simultaneously under high QPS
Cache expiration storms: When many services expire at once

How It Works

Rate limiting: Refresh attempts are throttled per-service using MinimumRetryInterval (default 5s)
Stale cache preference: If stale cache exists (even if expired), return it instead of calling registry
No cache fallback: If no cache exists, return ErrNotFound and rely on gRPC retry
Singleflight deduplication: Concurrent requests are still deduplicated
Recovery: Throttling is reset on successful registry lookup

Example Scenarios

Scenario 1: Registry Failure with Stale Cache

cache := cache.New(etcdRegistry, cache.WithMinimumRetryInterval(10*time.Second))

// Initial lookup populates cache
services, _ := cache.GetService("api")  // → Calls etcd, caches result

// Cache expires after TTL
time.Sleep(2 * time.Minute)

// Etcd fails, but we have stale cache
services, err := cache.GetService("api")  // → Returns stale cache WITHOUT calling etcd
// err == nil, services contains stale data

Scenario 2: Rolling Deployment Cache Storm

// Scenario: All 1000 upstream pods watch downstream service
// Downstream does rolling deployment - last pod updated
// All 1000 upstream caches expire simultaneously
// High QPS hits the system at this moment

// First request after cache expiration
services, _ := cache.GetService("downstream")  // → Calls etcd, updates lastRefreshAttempt

// Next 999 requests arrive within MinimumRetryInterval
services, _ := cache.GetService("downstream")  // → Returns stale cache, NO etcd call
// Rate limiting prevents 999 stampede requests to etcd

Scenario 3: No Cache Available

// First lookup when etcd is down (no cache exists yet)
_, err := cache.GetService("new-service")  // → Calls etcd, fails, records attempt time
// err != nil

// Immediate retry (< 10s later, still no cache)
_, err = cache.GetService("new-service")  // → Throttled, returns ErrNotFound immediately
// err == ErrNotFound

// After MinimumRetryInterval
time.Sleep(10 * time.Second)
_, err = cache.GetService("new-service")  // → Allowed to retry, calls etcd again

This prevents cache penetration scenarios where thousands of concurrent requests hammer a failing or overloaded registry.

4.0 KiB Raw Permalink Blame History