Skip to content

[TT-15379] Introduce certificate expiry monitor #7271

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 37 commits into
base: master
Choose a base branch
from

Conversation

edsonmichaque
Copy link
Contributor

@edsonmichaque edsonmichaque commented Aug 4, 2025

User description

TT-15379
Summary GW Certificate Expiring Soon Notification Event
Type Story Story
Status In Dev
Points N/A
Labels -

Description

This PR implements certificate expiry monitoring functionality for Tyk Gateway. The feature monitors TLS certificates for expiration and sends webhook notifications when certificates are approaching their expiry date.

Key Changes:

  1. Certificate Check Middleware Implementation:

    • Enhanced gateway/mw_certificate_check.go with certificate expiry monitoring logic
    • Implemented cooldown mechanisms using Redis storage
    • Added concurrent certificate processing with goroutines
    • Integrated with Tyk's event system for webhook notifications
  2. Configuration System:

    • Added CertificateExpiryMonitorConfig struct to config/config.go
    • Configurable warning threshold (default: 30 days)
    • Configurable check cooldown (default: 3600 seconds)
    • Configurable event cooldown (default: 86400 seconds)
  3. Comprehensive Test Suite:

    • Added gateway/mw_certificate_check_test.go - Unit tests
    • Added gateway/mw_certificate_check_integration_test.go - Integration tests
    • Added gateway/mw_certificate_check_benchmark_test.go - Performance benchmarks
  4. Test Data & Configuration:

    • Added 4 new test data files in config/testdata/
    • Added configuration validation tests in config/config_test.go
  5. Webhook Template:

    • Added templates/certificate_check_webhook.json for certificate expiry event formatting

Related Issue

https://tyktech.atlassian.net/browse/TT-15379

Motivation and Context

How This Has Been Tested

Unit Testing:

  • Comprehensive unit tests covering certificate validation, cooldown mechanisms, and event firing
  • Test coverage for various certificate expiry scenarios
  • Validation of configuration parsing and defaults

Integration Testing:

  • End-to-end testing of certificate expiry monitoring workflow
  • Redis integration testing for cooldown persistence
  • Event system integration testing

Performance Testing:

  • Benchmark tests for certificate processing performance
  • Concurrent certificate handling validation
  • Memory usage and CPU performance metrics

Configuration Testing:

  • Test data files covering various configuration scenarios
  • Environment variable override testing
  • Default configuration validation

Benchmarks

BenchmarkCerstructtificateCheckMW_ProcessRequest

With expiry certificate monitor disabled

Scenario N (iterations) ns/op B/op allocs/op
NoMutualTLS 155 513 281 7.616 0 0
ValidCertificate 848 334 1 346 280 7
ExpiringCertificate 888 495 1 337 280 7
MultipleCertificates 874 704 1 350 280 7

With expiry certificate monitor enabled

Scenario N (iterations) ns/op B/op allocs/op
NoMutualTLS 161 764 090 7.270 0 0
ValidCertificate 9 289 114 495 1 138 28
ExpiringCertificate 10 191 105 234 1 138 28
MultipleCertificates 4 196 307 409 3 627 85

Screenshots (if appropriate)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Refactoring or add test (improvements in base code or adds test coverage to functionality)

Checklist

  • I ensured that the documentation is up to date
  • I explained why this PR updates go.mod in detail with reasoning why it's required
  • I would like a code coverage CI quality gate exception and have explained why

PR Type

Enhancement, Tests


Description

  • Introduce certificate expiry monitoring and notification system

  • Add configuration for expiry thresholds and cooldowns

  • Implement Redis-backed cooldown for event suppression

  • Provide comprehensive unit, integration, and benchmark tests


Diagram Walkthrough

flowchart LR
  config["Config: Add certificate_expiry_monitor options"]
  mw["Middleware: Certificate expiry check & event logic"]
  redis["Redis: Cooldown state storage"]
  event["Event: CertificateExpiringSoon"]
  tests["Tests: Unit, integration, benchmark"]
  schema["Schema: Add event to OpenAPI/streams"]
  template["Template: Webhook JSON for expiry event"]

  config -- "configures" --> mw
  mw -- "uses" --> redis
  mw -- "fires" --> event
  event -- "documented in" --> schema
  mw -- "covered by" --> tests
  event -- "renders" --> template
Loading

File Walkthrough

Relevant files

@buger
Copy link
Member

buger commented Aug 4, 2025

Let's make that PR title a 💯 shall we? 💪

Your PR title and story title look slightly different. Just checking in to know if it was intentional!

Story Title GW Certificate Expiring Soon Notification Event
PR Title [TT-15379] Introduce certificate expiry monitor

Check out this guide to learn more about PR best-practices.

@buger
Copy link
Member

buger commented Aug 4, 2025

This PR is too huge for one to review 💔

Additions 2098 🙅‍♀️
Expected ⬇️ 800

Consider breaking it down into multiple small PRs.

Check out this guide to learn more about PR best-practices.

Copy link
Contributor

github-actions bot commented Aug 4, 2025

PR Reviewer Guide 🔍

(Review updated until commit 6cb53ec)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 5 🔵🔵🔵🔵🔵
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Race Condition

The cooldown logic for certificate checks and event firing uses a check-then-set pattern with Redis, which is not atomic. This can lead to race conditions where multiple goroutines may bypass the intended cooldown, especially under high concurrency. Consider reviewing the implementation and possibly extending the storage handler to support atomic operations.

func (m *CertificateCheckMW) shouldCooldown(monitorConfig config.CertificateExpiryMonitorConfig, certID string) bool {
	// NOTE: This method currently has a race condition where multiple goroutines checking the same certificate
	// could both proceed with their operations. The current storage.Handler interface doesn't provide atomic
	// operations, so we use a check-then-set pattern. In the future, consider extending the Handler interface
	// to support atomic operations (e.g., SET key value NX EX seconds) to eliminate the race condition and
	// improve performance. This would replace the current check-then-set pattern with a single atomic operation.

	checkCooldownSeconds := monitorConfig.CheckCooldownSeconds

	if checkCooldownSeconds == 0 {
		checkCooldownSeconds = config.DefaultCheckCooldownSeconds
	}

	checkCooldownKey := fmt.Sprintf("%s%s", certCheckCooldownPrefix, certID)

	// Use Redis for cooldowns
	if m.store == nil {
		// Store not initialized, skip cooldown check
		return false
	}

	_, err := m.store.GetKey(checkCooldownKey)
	if err == nil {
		log.Debugf("Certificate expiry monitor: Check cooldown active for certificate ID: %s... (cooldown: %ds)", certID[:8], checkCooldownSeconds)
		return true // Skip check due to cooldown
	}
	// Set check cooldown atomically (protected by lock)
	if err := m.store.SetKey(checkCooldownKey, "1", int64(checkCooldownSeconds)); err != nil {
		log.Warningf("Certificate expiry monitor: Failed to set check cooldown for certificate ID: %s... - %v", certID[:8], err)
	}

	log.Debugf("Certificate expiry monitor: Check cooldown set for certificate ID: %s... (cooldown: %ds)", certID[:8], checkCooldownSeconds)

	return false // Don't skip check
}

// shouldFireExpiryEvent checks if an event should be fired based on cooldown
func (m *CertificateCheckMW) shouldFireExpiryEvent(certID string, monitorConfig config.CertificateExpiryMonitorConfig) bool {
	// NOTE: This method currently has a race condition where multiple goroutines could both fire events
	// for the same certificate. In the future, consider using Redis atomic operations
	// (e.g., SET key value NX EX seconds) to eliminate the race condition and prevent duplicate events.
Synchronous Certificate Expiry Checks

Certificate expiration checks and Redis operations are performed synchronously in the request path, which may impact request latency, especially with many certificates or slow Redis. Consider reviewing the impact on performance and whether this should be made asynchronous.

	// Consider making this asynchronous in the future to improve request response times, especially when
	// processing large numbers of certificates or when Redis operations are slow.
	m.checkCertificatesExpiration(apiCerts)
}
Hot Reload Consistency

The warning threshold for certificate expiry is read dynamically, but cooldowns in Redis are not invalidated when configuration changes. This could lead to inconsistent behavior after a hot reload. Review if this is acceptable or if cooldown invalidation is needed.

func (m *CertificateCheckMW) isCertificateExpiringSoon(hoursUntilExpiry int) bool {
	// NOTE: Configuration is retrieved fresh on each call, which allows hot reload to take effect.
	// However, existing cooldowns in Redis are not invalidated when configuration changes.
	// This means that if WarningThresholdDays is updated via hot reload, certificates with active
	// cooldowns will continue to use the old threshold until their cooldown expires.
	// Consider implementing cooldown invalidation when configuration changes to ensure consistent behavior.
	warningThresholdDays := m.Gw.GetConfig().Security.CertificateExpiryMonitor.WarningThresholdDays

Copy link
Contributor

github-actions bot commented Aug 4, 2025

PR Code Suggestions ✨

Latest suggestions up to 6cb53ec
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Make cooldown check atomic to prevent races

The check-then-set pattern for cooldowns is not atomic and can result in race
conditions, allowing multiple checks for the same certificate. If possible, use an
atomic set-if-not-exists (SETNX) operation in the storage handler to ensure only one
check proceeds per cooldown period.

gateway/mw_certificate_check.go [227-261]

 func (m *CertificateCheckMW) shouldCooldown(monitorConfig config.CertificateExpiryMonitorConfig, certID string) bool {
-	// NOTE: This method currently has a race condition where multiple goroutines checking the same certificate
-	// could both proceed with their operations. The current storage.Handler interface doesn't provide atomic
-	// operations, so we use a check-then-set pattern. In the future, consider extending the Handler interface
-	// to support atomic operations (e.g., SET key value NX EX seconds) to eliminate the race condition and
-	// improve performance. This would replace the current check-then-set pattern with a single atomic operation.
-
 	checkCooldownSeconds := monitorConfig.CheckCooldownSeconds
 
 	if checkCooldownSeconds == 0 {
 		checkCooldownSeconds = config.DefaultCheckCooldownSeconds
 	}
 
 	checkCooldownKey := fmt.Sprintf("%s%s", certCheckCooldownPrefix, certID)
 
-	// Use Redis for cooldowns
 	if m.store == nil {
-		// Store not initialized, skip cooldown check
 		return false
 	}
 
-	_, err := m.store.GetKey(checkCooldownKey)
-	if err == nil {
+	// Try to set the key only if it does not exist (atomic operation)
+	set, err := m.store.SetKeyIfNotExists(checkCooldownKey, "1", int64(checkCooldownSeconds))
+	if err != nil {
+		log.Warningf("Certificate expiry monitor: Failed to set check cooldown for certificate ID: %s... - %v", certID[:8], err)
+		return false
+	}
+	if !set {
 		log.Debugf("Certificate expiry monitor: Check cooldown active for certificate ID: %s... (cooldown: %ds)", certID[:8], checkCooldownSeconds)
-		return true // Skip check due to cooldown
-	}
-	// Set check cooldown atomically (protected by lock)
-	if err := m.store.SetKey(checkCooldownKey, "1", int64(checkCooldownSeconds)); err != nil {
-		log.Warningf("Certificate expiry monitor: Failed to set check cooldown for certificate ID: %s... - %v", certID[:8], err)
+		return true
 	}
 
 	log.Debugf("Certificate expiry monitor: Check cooldown set for certificate ID: %s... (cooldown: %ds)", certID[:8], checkCooldownSeconds)
-
-	return false // Don't skip check
+	return false
 }
Suggestion importance[1-10]: 8

__

Why: This suggestion addresses a real concurrency bug: the current check-then-set pattern is not atomic and can lead to race conditions, allowing multiple checks for the same certificate. Making the operation atomic is important for correctness in concurrent environments, and the improved code accurately reflects the required change.

Medium
Make event cooldown atomic to avoid duplicates

The event cooldown logic is not atomic and may allow duplicate events in concurrent
scenarios. Use an atomic set-if-not-exists (SETNX) operation to ensure only one
event fires per cooldown window.

gateway/mw_certificate_check.go [264-298]

 func (m *CertificateCheckMW) shouldFireExpiryEvent(certID string, monitorConfig config.CertificateExpiryMonitorConfig) bool {
-	// NOTE: This method currently has a race condition where multiple goroutines could both fire events
-	// for the same certificate. In the future, consider using Redis atomic operations
-	// (e.g., SET key value NX EX seconds) to eliminate the race condition and prevent duplicate events.
-	// This would replace the current check-then-set pattern with a single atomic operation.
-
 	eventCooldownSeconds := monitorConfig.EventCooldownSeconds
 
 	if eventCooldownSeconds == 0 {
 		eventCooldownSeconds = config.DefaultEventCooldownSeconds
 	}
 
-	// Use Redis for cooldowns
 	if m.store == nil {
-		// Store not initialized, allow event to fire
 		return true
 	}
 
 	cooldownKey := fmt.Sprintf("%s%s", certExpiryCooldownPrefix, certID)
 
-	_, err := m.store.GetKey(cooldownKey)
-	if err == nil {
+	// Try to set the key only if it does not exist (atomic operation)
+	set, err := m.store.SetKeyIfNotExists(cooldownKey, "1", int64(eventCooldownSeconds))
+	if err != nil {
+		log.Warningf("Certificate expiry monitor: Failed to set event cooldown for certificate ID: %s... - %v", certID[:8], err)
+		return true
+	}
+	if !set {
 		log.Debugf("Certificate expiry monitor: Event cooldown active for certificate ID: %s... (cooldown: %ds)", certID[:8], eventCooldownSeconds)
 		return false
 	}
 
-	// Set cooldown atomically (protected by lock)
-	if err := m.store.SetKey(cooldownKey, "1", int64(eventCooldownSeconds)); err != nil {
-		log.Warningf("Certificate expiry monitor: Failed to set event cooldown for certificate ID: %s... - %v", certID[:8], err)
-	}
-
 	log.Debugf("Certificate expiry monitor: Event cooldown set for certificate ID: %s... (cooldown: %ds)", certID[:8], eventCooldownSeconds)
-
 	return true
 }
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies and fixes a concurrency issue where duplicate events could be fired due to non-atomic cooldown logic. Using an atomic set-if-not-exists operation is a significant improvement for correctness in concurrent scenarios, and the improved code matches the intent.

Medium
Limit concurrent certificate check goroutines

Spawning a goroutine for each certificate without limiting concurrency can exhaust
system resources or cause performance issues if the certificate list is large.
Implement a worker pool or limit the number of concurrent goroutines to prevent
resource exhaustion.

gateway/mw_certificate_check.go [81-116]

 func (m *CertificateCheckMW) checkCertificatesExpiration(certificates []*tls.Certificate) {
 	if m.Gw == nil {
 		log.Warning("Certificate expiry monitor: Gateway not initialized, skipping certificate checks")
 		return
 	}
 
 	monitorConfig := m.Gw.GetConfig().Security.CertificateExpiryMonitor
 	now := time.Now()
 
 	log.Debug("[CertificateCheckMW] Starting expiration check for ", len(certificates), " certificates with warning threshold of ", monitorConfig.WarningThresholdDays, " days")
 
-	// NOTE: In the future, consider using a worker pool pattern instead of spawning a goroutine per certificate.
-	// This would help limit resource usage with large numbers of certificates by:
-	// 1. Using a fixed number of worker goroutines
-	// 2. Processing certificates through a buffered channel
-	// 3. Providing better backpressure handling
-	// 4. Allowing for more controlled error handling and retries
+	const maxWorkers = 10
+	certChan := make(chan *tls.Certificate)
 	var wg sync.WaitGroup
 
-	for i, cert := range certificates {
-		if cert == nil {
-			continue
-		}
-
+	// Start worker pool
+	for w := 0; w < maxWorkers; w++ {
 		wg.Add(1)
-
-		go func(cert *tls.Certificate, idx int) {
+		go func() {
 			defer wg.Done()
-			log.Debugf("[CertificateCheckMW] Checking certificate %d/%d", idx+1, len(certificates))
-			m.checkCertificate(cert, monitorConfig, now)
-		}(cert, i)
+			for cert := range certChan {
+				if cert == nil {
+					continue
+				}
+				m.checkCertificate(cert, monitorConfig, now)
+			}
+		}()
 	}
 
-	// Wait for all goroutines to complete
+	// Send certificates to workers
+	for _, cert := range certificates {
+		certChan <- cert
+	}
+	close(certChan)
+
 	wg.Wait()
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion to use a worker pool instead of spawning a goroutine per certificate is valid and improves resource management, especially for large certificate lists. While not critical for correctness, it enhances scalability and prevents potential resource exhaustion, making it a solid maintainability and performance improvement.

Medium

Previous suggestions

Suggestions
CategorySuggestion                                                                                                                                    Impact
Possible issue
Use safe cache key for certificate ID

Using string(cert.Leaf.Raw) as a cache key may cause collisions or unexpected
behavior, since certificate raw bytes may not be valid UTF-8 and string conversion
is not guaranteed to be unique or safe. Use a hash (e.g., SHA-1) of the raw bytes as
the cache key instead to ensure uniqueness and avoid encoding issues.

gateway/mw_certificate_check.go [211-235]

 func (m *CertificateCheckMW) generateCertificateID(cert *tls.Certificate) string {
 	if cert == nil || cert.Leaf == nil || len(cert.Leaf.Raw) == 0 {
 		return ""
 	}
 
-	// Create a cache key from the certificate's raw data
-	cacheKey := string(cert.Leaf.Raw)
+	// Use SHA-1 hash of the raw bytes as the cache key
+	hash := sha1.Sum(cert.Leaf.Raw)
+	cacheKey := hex.EncodeToString(hash[:])
 
 	// Check if we already have this certificate ID cached
 	if cachedID, ok := m.certIDCache.Load(cacheKey); ok {
 		if cachedString, ok := cachedID.(string); ok {
 			return cachedString
 		}
 		// If type assertion fails, fall through to regenerate
 	}
 
-	// Generate new hash if not cached
-	hash := sha1.Sum(cert.Leaf.Raw)
-	certID := hex.EncodeToString(hash[:])
+	certID := cacheKey
 
 	// Cache the result
 	m.certIDCache.Store(cacheKey, certID)
 
 	return certID
 }
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a subtle but important bug: using string(cert.Leaf.Raw) as a cache key is unsafe because raw certificate bytes may not be valid UTF-8 and can lead to collisions or panics. Switching to a hash of the raw bytes ensures uniqueness and safety, which is critical for correct cache behavior and thread safety.

Medium

Copy link
Contributor

github-actions bot commented Aug 4, 2025

API Changes

--- prev.txt	2025-08-06 09:54:03.641034656 +0000
+++ current.txt	2025-08-06 09:53:54.366000721 +0000
@@ -5524,6 +5524,18 @@
 CONSTANTS
 
 const (
+	// DefaultWarningThresholdDays is the default number of days before certificate expiration to start sending notifications
+	DefaultWarningThresholdDays = 30
+
+	// DefaultCheckCooldownSeconds is the default minimum time in seconds between certificate expiration checks
+	DefaultCheckCooldownSeconds = 3600 // 1 hour
+
+	// DefaultEventCooldownSeconds is the default minimum time in seconds between firing the same certificate expiration event
+	DefaultEventCooldownSeconds = 86400 // 24 hours
+)
+    Certificate monitor constants
+
+const (
 	PickFirstStrategy IPsHandleStrategy = "pick_first"
 	RandomStrategy    IPsHandleStrategy = "random"
 	NoCacheStrategy   IPsHandleStrategy = "no_cache"
@@ -5573,6 +5585,13 @@
 			AllowUnsafe: []string{},
 		},
 		PIDFileLocation: "/var/run/tyk/tyk-gateway.pid",
+		Security: SecurityConfig{
+			CertificateExpiryMonitor: CertificateExpiryMonitorConfig{
+				WarningThresholdDays: DefaultWarningThresholdDays,
+				CheckCooldownSeconds: DefaultCheckCooldownSeconds,
+				EventCooldownSeconds: DefaultEventCooldownSeconds,
+			},
+		},
 	}
 )
 var Global func() Config
@@ -5695,6 +5714,22 @@
 	KeyFile string `json:"key_file"`
 }
 
+type CertificateExpiryMonitorConfig struct {
+	// WarningThresholdDays specifies the number of days before certificate expiration to start sending notifications
+	// Default: DefaultWarningThresholdDays (30 days)
+	WarningThresholdDays int `json:"warning_threshold_days"`
+
+	// CheckCooldownSeconds specifies the minimum time in seconds between certificate expiration checks
+	// Default: DefaultCheckCooldownSeconds (3600 seconds = 1 hour)
+	CheckCooldownSeconds int `json:"check_cooldown_seconds"`
+
+	// EventCooldownSeconds specifies the minimum time in seconds between firing the same certificate expiration event
+	// Default: DefaultEventCooldownSeconds (86400 seconds = 24 hours)
+	EventCooldownSeconds int `json:"event_cooldown_seconds"`
+}
+    CertificateExpiryMonitorConfig configures the certificate expiration
+    notification feature
+
 type CertificatesConfig struct {
 	API []string `json:"apis"`
 	// Upstream is used to specify the certificates to be used in mutual TLS connections to upstream services. These are set at gateway level as a map of domain -> certificate id or path.
@@ -6720,6 +6755,9 @@
 	PinnedPublicKeys map[string]string `json:"pinned_public_keys"`
 
 	Certificates CertificatesConfig `json:"certificates"`
+
+	// CertificateExpiryMonitor configures the certificate expiration notification feature
+	CertificateExpiryMonitor CertificateExpiryMonitorConfig `json:"certificate_expiry_monitor"`
 }
 
 type ServiceDiscoveryConf struct {
@@ -8457,6 +8495,8 @@
 	EventTokenUpdated = event.TokenUpdated
 	// EventTokenDeleted is an alias maintained for backwards compatibility.
 	EventTokenDeleted = event.TokenDeleted
+	// EventCertificateExpiringSoon is an alias maintained for backwards compatibility.
+	EventCertificateExpiringSoon = event.CertificateExpiringSoon
 )
 const (
 	MsgAuthFieldMissing                        = "Authorization field missing"
@@ -9115,6 +9155,7 @@
 
 type CertificateCheckMW struct {
 	*BaseMiddleware
+	// Has unexported fields.
 }
     CertificateCheckMW is used if domain was not detected or multiple APIs bind
     on the same domain. In this case authentification check happens not on TLS
@@ -9350,6 +9391,18 @@
     HandleError is the actual error handler and will store the error details in
     analytics if analytics processing is enabled.
 
+type EventCertificateExpiringSoonMeta struct {
+	EventMetaDefault
+	CertID        string    `json:"cert_id"`
+	CertName      string    `json:"cert_name"`
+	ExpiresAt     time.Time `json:"expires_at"`
+	DaysRemaining int       `json:"days_remaining"`
+	APIID         string    `json:"api_id"`
+	OrgID         string    `json:"org_id"`
+}
+    EventCertificateExpiringSoonMeta is the metadata structure for certificate
+    expiration events
+
 type EventCurcuitBreakerMeta struct {
 	EventMetaDefault
 	Path         string

Copy link
Contributor

github-actions bot commented Aug 5, 2025

Persistent review updated to latest commit c1ba970

Copy link
Contributor

github-actions bot commented Aug 5, 2025

Persistent review updated to latest commit c1ba970

Copy link
Contributor

github-actions bot commented Aug 5, 2025

PR Code Suggestions ✨

No code suggestions found for the PR.

@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@TykTechnologies TykTechnologies deleted a comment from github-actions bot Aug 6, 2025
@edsonmichaque edsonmichaque reopened this Aug 6, 2025
Copy link
Contributor

github-actions bot commented Aug 6, 2025

Persistent review updated to latest commit 6cb53ec

Copy link
Contributor

github-actions bot commented Aug 6, 2025

📦 Impact Review Snapshot

Effort Downstream Updates Compatibility Docs TL;DR
Low ⚠️ 🟢 📖 Certificate expiry monitoring feature requires minimal downstream updates, mainly to tyk-charts for exposing new config options
## Impact Assessment

This PR introduces certificate expiry monitoring functionality to Tyk Gateway, which checks TLS certificates for upcoming expiration and sends webhook notifications. The implementation is largely self-contained within the gateway codebase, with minimal impact on downstream repositories.

The feature adds a new event type (CertificateExpiringSoon), configuration options in the SecurityConfig struct, and a webhook template, but doesn't modify API definition schemas or RPC protocols that would require significant changes in downstream repositories.

## Required Updates
  1. tyk-charts:

    • Update Helm charts to expose new certificate expiry monitor configuration options:
      • security.certificate_expiry_monitor.warning_threshold_days
      • security.certificate_expiry_monitor.check_cooldown_seconds
      • security.certificate_expiry_monitor.event_cooldown_seconds
  2. Documentation:

    • Update documentation to describe the new certificate expiry monitoring feature
    • Document the new webhook event type and its payload structure (from certificate_check_webhook.json)
    • Provide configuration examples for the new settings
## Compatibility Concerns

No backward compatibility issues identified. The feature:

  • Adds new configuration options with sensible defaults
  • Introduces a new event type that doesn't conflict with existing ones
  • Uses existing Redis storage mechanisms for cooldown state
  • Integrates with the existing event system without modifications to its core functionality

The implementation includes proper handling of default values, so existing deployments will work without explicit configuration.

## Summary & Recommendations
  • Update tyk-charts to expose the new certificate expiry monitoring configuration options
  • Consider adding documentation for the new feature, including webhook configuration examples using the provided template
  • No immediate updates needed for tyk-operator, portal, or tyk-sink

Tip: Mention me again using /dependency <request>.
Powered by Probe AI
Tyk Gateway Dependency Impact Reviewer

Copy link
Contributor

github-actions bot commented Aug 6, 2025

🛡️ Security Snapshot

Effort Risk Level Tests Compliance TL;DR
Medium 🟡 ✔️ Certificate expiry monitoring with Redis-backed cooldown has minor race conditions but is generally secure
## Security Impact Analysis

The PR introduces certificate expiry monitoring to detect and notify when TLS certificates are approaching expiration. The implementation enhances security posture by providing early warning for certificate renewals, preventing potential service disruptions and security vulnerabilities from expired certificates. The feature uses Redis for cooldown state storage and integrates with Tyk's event system for notifications.

The implementation is generally secure, with proper validation of certificates and appropriate logging. However, there are two race conditions in the cooldown logic that could lead to duplicate events or checks in high-concurrency scenarios.

## Identified Vulnerabilities

Medium Risk:

  1. Race Conditions in Cooldown Logic: The check-then-set pattern in shouldCooldown and shouldFireExpiryEvent methods is not atomic, which could lead to duplicate events or checks in high-concurrency environments. This is a correctness issue rather than a security vulnerability, but could cause notification spam.

Low Risk:

  1. Synchronous Certificate Processing: Certificate expiration checks are performed synchronously in the request path, which could impact request latency, especially with many certificates or slow Redis operations. This is noted in the code comments as a potential future improvement.

  2. Hot Reload Consistency: When configuration changes via hot reload, existing cooldowns in Redis are not invalidated, which could lead to inconsistent behavior where some certificates use old thresholds until their cooldown expires.

## Security Recommendations
  1. Implement Atomic Operations: Extend the storage handler interface to support atomic operations (e.g., SET key value NX EX seconds) to eliminate the race conditions in cooldown logic. This would replace the current check-then-set pattern with a single atomic operation.

  2. Consider Asynchronous Processing: Move certificate expiration checks to a background goroutine or worker pool to avoid blocking API requests, especially when processing large numbers of certificates.

  3. Add Cooldown Invalidation: Implement a mechanism to invalidate cooldowns in Redis when configuration changes via hot reload to ensure consistent behavior.

  4. Add Certificate Expiry Metrics: Consider adding metrics for certificate expiry status to enable monitoring and alerting beyond the webhook notifications.

## OWASP Compliance

The implementation aligns well with OWASP best practices:

  • A5:2017-Broken Access Control: The feature enhances security by monitoring certificate expiration, which helps prevent unauthorized access due to expired certificates.

  • A6:2017-Security Misconfiguration: The feature helps prevent security misconfigurations by providing early warning for certificate renewals.

  • A10:2017-Insufficient Logging & Monitoring: The implementation includes appropriate logging of certificate status and events, enhancing the monitoring capabilities.

The code does not introduce any new OWASP Top 10 vulnerabilities and actually helps mitigate potential issues related to certificate management.

## Summary
  • The certificate expiry monitoring feature enhances security by providing early warning for certificate renewals.
  • The implementation has minor race conditions in the cooldown logic that should be addressed.
  • The synchronous processing in the request path could impact performance and should be made asynchronous in the future.
  • Overall, the feature is a positive security enhancement with good test coverage and minimal security concerns.

Tip: Mention me again using /security <request>.
Powered by Probe AI
Security Impact Reviewer Prompt

Copy link
Contributor

github-actions bot commented Aug 6, 2025

🚦 Connectivity Review Snapshot

Effort Tests Security Perf TL;DR
Medium 🔒 none 🟡 Certificate expiry monitoring uses Redis for cooldown tracking with potential race conditions in high-concurrency environments
## Connectivity Assessment
  • Redis Connections: The implementation uses Redis for storing cooldown state via the standard storage.RedisCluster handler. Two key prefixes are used: cert_check_cooldown: and cert_expiry_cooldown: to track when certificates were last checked and when events were last fired. The non-atomic check-then-set pattern could lead to race conditions under high load.
  • RPC Connections: No direct impact on RPC connections. In MDCB mode, certificate checks happen locally on each gateway instance, with no additional RPC traffic generated.
  • Synchronization Mechanisms: Uses goroutines for concurrent certificate processing but lacks proper synchronization for Redis operations. The implementation spawns a goroutine per certificate without limiting concurrency, which could lead to resource exhaustion with many certificates.
## Test Coverage Validation
  • Redis Tests: Integration tests properly validate Redis cooldown persistence across certificate checks, with tests for both check cooldowns and event cooldowns.
  • RPC Tests: No specific RPC tests since the feature doesn't directly interact with RPC systems.
  • Failure Scenario Tests: Tests cover Redis connection failures and certificate validation edge cases. The benchmark tests validate performance under load, but don't specifically test high-concurrency scenarios that might trigger race conditions.
## Security & Performance Impact
  • Authentication Changes: No changes to authentication mechanisms.
  • Performance Considerations: Certificate checks are performed synchronously in the request path, which could impact request latency, especially with many certificates or slow Redis operations. The code includes comments suggesting future improvements to make this asynchronous.
  • Error Handling: Robust error handling for Redis operations, with appropriate logging of failures. However, the non-atomic Redis operations could lead to duplicate events being fired in high-concurrency environments.
## Summary & Recommendations
  • Consider implementing atomic Redis operations (e.g., SET NX EX) to prevent race conditions in cooldown checks.
  • Implement a worker pool pattern with a fixed number of goroutines to prevent resource exhaustion when processing large numbers of certificates.
  • Consider making certificate checks asynchronous to avoid impacting request latency.
  • Add more explicit tests for high-concurrency scenarios to validate the robustness of the Redis cooldown mechanism.

Tip: Mention me again using /connectivity <request>.
Powered by Probe AI
Connectivity Issues Reviewer Prompt for Tyk Gateway

Copy link
Contributor

github-actions bot commented Aug 6, 2025

🚀 Performance Snapshot

Effort Perf Risk Hot Paths Benchmarks TL;DR
Medium 🟡 ⚠️ 📊 Certificate expiry monitoring adds synchronous checks in request path with potential race conditions in cooldown logic
## Performance Impact Analysis

This PR introduces certificate expiry monitoring to the Tyk Gateway, which adds additional processing during certificate validation. The implementation performs expiry checks synchronously in the request path, which could impact request latency, especially with many certificates. The cooldown mechanism using Redis helps mitigate this by preventing redundant checks, but the check-then-set pattern introduces race conditions that could lead to duplicate processing under high concurrency.

## Critical Areas
  1. Synchronous Certificate Checks: Certificate expiry checks happen in the request path, potentially adding latency to API requests that use mutual TLS.

  2. Redis Operations: Each certificate check involves Redis operations for cooldown management, which adds network I/O overhead.

  3. Race Conditions: The cooldown implementation uses a non-atomic check-then-set pattern that could lead to race conditions where multiple goroutines process the same certificate simultaneously.

  4. Goroutine Management: The implementation spawns a goroutine per certificate without limiting concurrency, which could lead to resource exhaustion with large numbers of certificates.

## Optimization Recommendations
  1. Make Certificate Checks Asynchronous: Consider moving certificate expiry checks out of the request path into a background process to avoid impacting API request latency.

  2. Implement Atomic Redis Operations: Extend the storage handler interface to support atomic operations (e.g., SET NX EX) to eliminate race conditions in cooldown logic.

  3. Add Worker Pool for Certificate Processing: Replace the unbounded goroutine creation with a fixed-size worker pool to limit resource usage when processing many certificates.

  4. Add Configuration for Concurrency Limits: Allow operators to configure the maximum number of concurrent certificate checks to prevent resource exhaustion.

  5. Implement Cooldown Invalidation: Add a mechanism to invalidate cooldowns when configuration changes to ensure consistent behavior after hot reloads.

## Summary
  • The certificate expiry monitoring feature is well-designed with comprehensive tests, but has performance implications that should be addressed.
  • The synchronous nature of certificate checks in the request path could impact API latency.
  • Race conditions in the cooldown logic could lead to duplicate processing and events.
  • Unbounded goroutine creation could cause resource exhaustion with many certificates.
  • The implementation would benefit from making checks asynchronous, using atomic Redis operations, and implementing a worker pool pattern.

Tip: Mention me again using /performance <request>.
Powered by Probe AI
Performance Impact Reviewer Prompt

Copy link

sonarqubecloud bot commented Aug 6, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants