Skip to content

[TT-9234][TT-15257] regression fixes for failing mdcb readiness check #7215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jul 10, 2025

Conversation

andrei-tyk
Copy link
Contributor

@andrei-tyk andrei-tyk commented Jul 9, 2025

User description

Description

This PR fixes the health check logic for RPC components when MDCB (Multi-Data Center Bridge) is operating in emergency mode, ensuring proper failover behavior during RPC connectivity issues.

Problem

When MDCB enters emergency mode due to RPC connectivity issues, the gateway was incorrectly marking RPC health check failures as critical, causing the entire gateway to report as unhealthy. This prevented proper failover operation where the gateway should continue serving requests using cached policies.

Solution

Modified the isCriticalFailure() function in gateway/health_check.go to consider RPC emergency mode status when determining if an RPC component failure is critical.

Related Issue

Motivation and Context

How This Has Been Tested

Screenshots (if appropriate)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Refactoring or add test (improvements in base code or adds test coverage to functionality)

Checklist

  • I ensured that the documentation is up to date
  • I explained why this PR updates go.mod in detail with reasoning why it's required
  • I would like a code coverage CI quality gate exception and have explained why

PR Type

Bug fix, Tests


Description

  • Fixes critical failure logic for RPC in emergency mode

  • Adds unit tests for RPC emergency mode scenarios

  • Updates test setup to handle emergency mode toggling

  • Ensures correct behavior for RPC health check failures


Changes diagram

flowchart LR
  A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"]
  B -- "unit tests for emergency mode" --> C["health_check_test.go"]
Loading

Changes walkthrough 📝

Relevant files
Bug fix
health_check.go
Add emergency mode check to RPC critical failure logic     

gateway/health_check.go

  • Adds emergency mode check to RPC critical failure logic
  • Ensures RPC is not critical in emergency mode
  • +2/-2     
    Tests
    health_check_test.go
    Add and update tests for RPC emergency mode logic               

    gateway/health_check_test.go

  • Adds tests for RPC critical failure in emergency mode
  • Updates test cases to toggle emergency mode
  • Imports RPC package for emergency mode control
  • +38/-0   

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • @buger
    Copy link
    Member

    buger commented Jul 9, 2025

    💔 The detected issue is not in one of the allowed statuses 💔

    Detected Status DoD Check
    Allowed Statuses In Dev,In Code Review,Ready for Testing,In Test,In Progress,In Review ✔️

    Please ensure your jira story is in one of the allowed statuses

    Copy link
    Contributor

    github-actions bot commented Jul 9, 2025

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Logic Change

    The logic for determining if the 'rpc' component is a critical failure now depends on whether the system is in emergency mode. This change should be carefully validated to ensure it does not introduce regressions in readiness checks, especially in scenarios where emergency mode toggling is frequent.

    // Consider RPC critical only if using RPC and gw not in emergency mode
    if component == "rpc" && gw.GetConfig().Policies.PolicySource == "rpc" && !rpc.IsEmergencyMode() {
    	return true
    }
    Test Coverage

    New tests have been added to cover the emergency mode logic for the 'rpc' component. Reviewers should ensure these tests accurately reflect the intended behavior and handle both emergency and non-emergency scenarios robustly.

    	name:      "rpc component is NOT critical when PolicySource is rpc but in emergency mode",
    	component: "rpc",
    	check: HealthCheckItem{
    		Status:        Fail,
    		ComponentType: System,
    	},
    	setupConfig: func(conf *config.Config) {
    		conf.Policies.PolicySource = "rpc"
    	},
    	setupFunc: func(t *testing.T) {
    		rpc.SetEmergencyMode(t, true)
    	},
    	expectedResult: false,
    },
    {
    	name:      "rpc component is critical when PolicySource is rpc and NOT in emergency mode",
    	component: "rpc",
    	check: HealthCheckItem{
    		Status:        Fail,
    		ComponentType: System,
    	},
    	setupConfig: func(conf *config.Config) {
    		conf.Policies.PolicySource = "rpc"
    	},
    	setupFunc: func(t *testing.T) {
    		rpc.SetEmergencyMode(t, false)
    	},
    	expectedResult: true,
    },

    Copy link
    Contributor

    github-actions bot commented Jul 9, 2025

    PR Code Suggestions ✨

    No code suggestions found for the PR.

    Copy link
    Contributor

    github-actions bot commented Jul 9, 2025

    API Changes

    no api changes detected

    Copy link

    sonarqubecloud bot commented Jul 9, 2025

    @buger
    Copy link
    Member

    buger commented Jul 9, 2025

    /performance /connectivity /dependency pls check

    Copy link
    Contributor

    github-actions bot commented Jul 9, 2025

    Dependency Impact Analysis for PR #7215

    Summary

    This PR makes a targeted change to how RPC component failures are evaluated in the health check system, specifically accounting for emergency mode. The changes are minimal and focused on fixing a regression in MDCB readiness checks.

    Dependency Changes

    Direct Dependencies

    • No new external dependencies added
    • Internal dependency added: The health check system now depends on the rpc package's emergency mode state via rpc.IsEmergencyMode()

    Dependency Version Changes

    • No changes to external dependency versions (no modifications to go.mod)

    Code Analysis

    Key Changes

    1. Modified RPC Critical Failure Logic:

      // Before:
      if component == "rpc" && gw.GetConfig().Policies.PolicySource == "rpc" {
          return true
      }
      
      // After:
      if component == "rpc" && gw.GetConfig().Policies.PolicySource == "rpc" && !rpc.IsEmergencyMode() {
          return true
      }
    2. Added Unit Tests:

      • Tests for RPC component criticality in both emergency and non-emergency modes
      • Updated test setup to toggle emergency mode appropriately

    Emergency Mode Context

    The RPC emergency mode is a fallback mechanism that allows the gateway to continue operating with cached data when the RPC connection fails. The PR ensures that when in emergency mode, RPC failures are not considered critical, which is the expected behavior.

    Impact Assessment

    Functional Impact

    • Positive: Fixes a regression where MDCB readiness checks were failing incorrectly
    • Behavior Change: RPC failures will no longer be considered critical when in emergency mode

    Architectural Impact

    • Low: The change maintains the existing architecture while fixing a specific issue
    • Coupling: Introduces a minor coupling between the health check system and RPC emergency mode state

    Risk Assessment

    • Low Risk: The change is minimal and focused on a specific condition
    • Well Tested: New unit tests cover both emergency and non-emergency mode scenarios

    Conclusion

    This PR has minimal dependency impact. It introduces a small internal dependency on the RPC package's emergency mode state but doesn't change any external dependencies. The change is well-contained and properly tested, making it a low-risk fix for the reported regression.


    Tip: Mention me again using /dependency <request>.
    Powered by Probe AI

    Copy link
    Contributor

    github-actions bot commented Jul 9, 2025

    Connectivity Review: MDCB Readiness Check Fix

    Overview

    This PR addresses a regression issue in the MDCB (Multi Data Center Bridge) readiness check by modifying how the RPC component's critical failure status is determined. The change ensures that when a gateway is in emergency mode, RPC failures don't trigger a critical health check failure.

    Key Changes

    • Modified isCriticalFailure() to check for emergency mode when evaluating RPC component failures
    • Added appropriate test coverage for both emergency and non-emergency mode scenarios

    Connectivity Impact Analysis

    Positive Impacts

    1. Improved Resilience: When in emergency mode (disconnected from MDCB), the gateway will no longer report a 503 Service Unavailable status, as it's functioning in its designed fallback mode.

    2. Accurate Health Reporting: The health check now correctly reflects the gateway's operational status. In emergency mode, the gateway is still "ready" to serve traffic using cached configurations.

    3. Consistent with Self-Healing Design: This change aligns with Tyk's self-healing architecture where emergency mode allows continued operation using local cache during RPC connectivity issues.

    Potential Concerns

    1. Monitoring Visibility: While this change improves resilience, operators should ensure they have proper monitoring for emergency mode activation to avoid missing persistent MDCB connectivity issues.

    2. Cache Staleness: Extended operation in emergency mode relies on cached data which could become stale. This is an inherent trade-off of the emergency mode design, not a flaw in this PR.

    Recommendation

    The change is appropriate and improves system resilience by ensuring health checks accurately reflect the gateway's operational status. The PR includes proper test coverage for both emergency and non-emergency scenarios.

    This change follows best practices for distributed systems by implementing graceful degradation and proper health reporting during connectivity issues.


    Tip: Mention me again using /connectivity <request>.
    Powered by Probe AI

    Copy link
    Contributor

    github-actions bot commented Jul 9, 2025

    Connectivity and Dependency Analysis: MDCB Readiness Check Fix

    Overview

    This PR fixes a regression in the MDCB (Multi Data Center Bridge) readiness check by modifying how RPC failures are evaluated in emergency mode. The change is minimal but important for proper health check behavior in distributed deployments.

    Connectivity Impact

    The PR has a positive impact on connectivity by ensuring that health checks correctly reflect the system's operational state:

    1. Fixed Readiness Check Logic: The key change modifies the isCriticalFailure function to consider RPC failures as non-critical when in emergency mode. This prevents false negatives in readiness checks when the system is properly operating in its fallback state.

    2. Emergency Mode Awareness: The health check system now properly recognizes emergency mode as a valid operational state. Before this fix, the readiness check would incorrectly report failure even when the system was functioning correctly in emergency mode.

    3. No Changes to Connection Handling: The actual RPC connection handling logic remains unchanged - the PR only affects how connection failures are interpreted by the health check system.

    Dependency Impact

    The PR has minimal dependency impact:

    1. No New Dependencies: The change doesn't introduce any new dependencies.

    2. Existing RPC Dependency: The code continues to use the existing rpc package, specifically the IsEmergencyMode() function to determine the current emergency mode state.

    3. Test Dependencies: The only new import is for testing purposes, adding github.com/TykTechnologies/tyk/rpc to the health check tests to enable emergency mode toggling during tests.

    Conclusion

    This PR represents a well-contained fix for a specific issue with MDCB readiness checks. It improves system reliability by ensuring health checks accurately reflect the system's operational state, particularly when operating in emergency mode. The changes are minimal, focused, and well-tested, with no negative impacts on connectivity or dependencies.

    The fix is particularly important for environments using Kubernetes or other orchestration systems that rely on readiness probes for service management, as it prevents unnecessary service disruptions when the gateway is operating in emergency mode.


    Tip: Mention me again using /performance <request>.
    Powered by Probe AI

    @andrei-tyk andrei-tyk merged commit a564981 into master Jul 10, 2025
    77 of 79 checks passed
    @andrei-tyk andrei-tyk deleted the TT-9234-regression-fixes branch July 10, 2025 07:19
    Copy link

    tykbot bot commented Jul 10, 2025

    Working on it! Note that it can take a few minutes.

    @andrei-tyk
    Copy link
    Contributor Author

    /release to release-5.9.0

    Copy link

    tykbot bot commented Jul 10, 2025

    Working on it! Note that it can take a few minutes.

    1 similar comment
    Copy link

    tykbot bot commented Jul 10, 2025

    Working on it! Note that it can take a few minutes.

    tykbot bot pushed a commit that referenced this pull request Jul 10, 2025
    ### **User description**
    <!-- Provide a general summary of your changes in the Title above -->
    
    ## Description
    
    This PR fixes the health check logic for RPC components when MDCB
    (Multi-Data Center Bridge) is operating in emergency mode, ensuring
    proper failover behavior during RPC connectivity issues.
    
    ## Problem
    When MDCB enters emergency mode due to RPC connectivity issues, the
    gateway was incorrectly marking RPC health check failures as critical,
    causing the entire gateway to report as unhealthy. This prevented proper
    failover operation where the gateway should continue serving requests
    using cached policies.
    ## Solution
    Modified the isCriticalFailure() function in gateway/health_check.go to
    consider RPC emergency mode status when determining if an RPC component
    failure is critical.
    <!-- Describe your changes in detail -->
    
    ## Related Issue
    
    <!-- This project only accepts pull requests related to open issues. -->
    <!-- If suggesting a new feature or change, please discuss it in an
    issue first. -->
    <!-- If fixing a bug, there should be an issue describing it with steps
    to reproduce. -->
    <!-- OSS: Please link to the issue here. Tyk: please create/link the
    JIRA ticket. -->
    
    ## Motivation and Context
    
    <!-- Why is this change required? What problem does it solve? -->
    
    ## How This Has Been Tested
    
    <!-- Please describe in detail how you tested your changes -->
    <!-- Include details of your testing environment, and the tests -->
    <!-- you ran to see how your change affects other areas of the code,
    etc. -->
    <!-- This information is helpful for reviewers and QA. -->
    
    ## Screenshots (if appropriate)
    
    ## Types of changes
    
    <!-- What types of changes does your code introduce? Put an `x` in all
    the boxes that apply: -->
    
    - [ ] Bug fix (non-breaking change which fixes an issue)
    - [ ] New feature (non-breaking change which adds functionality)
    - [ ] Breaking change (fix or feature that would cause existing
    functionality to change)
    - [ ] Refactoring or add test (improvements in base code or adds test
    coverage to functionality)
    
    ## Checklist
    
    <!-- Go over all the following points, and put an `x` in all the boxes
    that apply -->
    <!-- If there are no documentation updates required, mark the item as
    checked. -->
    <!-- Raise up any additional concerns not covered by the checklist. -->
    
    - [ ] I ensured that the documentation is up to date
    - [ ] I explained why this PR updates go.mod in detail with reasoning
    why it's required
    - [ ] I would like a code coverage CI quality gate exception and have
    explained why
    
    
    ___
    
    ### **PR Type**
    Bug fix, Tests
    
    
    ___
    
    ### **Description**
    - Fixes critical failure logic for RPC in emergency mode
    
    - Adds unit tests for RPC emergency mode scenarios
    
    - Updates test setup to handle emergency mode toggling
    
    - Ensures correct behavior for RPC health check failures
    
    
    ___
    
    ### **Changes diagram**
    
    ```mermaid
    flowchart LR
      A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"]
      B -- "unit tests for emergency mode" --> C["health_check_test.go"]
    ```
    
    
    ___
    
    
    
    ### **Changes walkthrough** 📝
    <table><thead><tr><th></th><th align="left">Relevant
    files</th></tr></thead><tbody><tr><td><strong>Bug
    fix</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check.go</strong><dd><code>Add emergency mode
    check to RPC critical failure logic</code>&nbsp; &nbsp; &nbsp;
    </dd></summary>
    <hr>
    
    gateway/health_check.go
    
    <li>Adds emergency mode check to RPC critical failure logic<br> <li>
    Ensures RPC is not critical in emergency mode
    
    
    </details>
    
    
      </td>
    <td><a
    href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a>&nbsp;
    &nbsp; &nbsp; </td>
    
    </tr>
    </table></td></tr><tr><td><strong>Tests</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check_test.go</strong><dd><code>Add and update
    tests for RPC emergency mode logic</code>&nbsp; &nbsp; &nbsp; &nbsp;
    &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary>
    <hr>
    
    gateway/health_check_test.go
    
    <li>Adds tests for RPC critical failure in emergency mode<br> <li>
    Updates test cases to toggle emergency mode<br> <li> Imports RPC package
    for emergency mode control
    
    
    </details>
    
    
      </td>
    <td><a
    href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a>&nbsp;
    &nbsp; </td>
    
    </tr>
    </table></td></tr></tr></tbody></table>
    
    ___
    
    > <details> <summary> Need help?</summary><li>Type <code>/help how to
    ...</code> in the comments thread for any questions about PR-Agent
    usage.</li><li>Check out the <a
    href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a>
    for more information.</li></details>
    
    (cherry picked from commit a564981)
    Copy link

    tykbot bot commented Jul 10, 2025

    Working on it! Note that it can take a few minutes.

    Copy link

    tykbot bot commented Jul 10, 2025

    @andrei-tyk Created merge PRs

    tykbot bot pushed a commit that referenced this pull request Jul 10, 2025
    ### **User description**
    <!-- Provide a general summary of your changes in the Title above -->
    
    ## Description
    
    This PR fixes the health check logic for RPC components when MDCB
    (Multi-Data Center Bridge) is operating in emergency mode, ensuring
    proper failover behavior during RPC connectivity issues.
    
    ## Problem
    When MDCB enters emergency mode due to RPC connectivity issues, the
    gateway was incorrectly marking RPC health check failures as critical,
    causing the entire gateway to report as unhealthy. This prevented proper
    failover operation where the gateway should continue serving requests
    using cached policies.
    ## Solution
    Modified the isCriticalFailure() function in gateway/health_check.go to
    consider RPC emergency mode status when determining if an RPC component
    failure is critical.
    <!-- Describe your changes in detail -->
    
    ## Related Issue
    
    <!-- This project only accepts pull requests related to open issues. -->
    <!-- If suggesting a new feature or change, please discuss it in an
    issue first. -->
    <!-- If fixing a bug, there should be an issue describing it with steps
    to reproduce. -->
    <!-- OSS: Please link to the issue here. Tyk: please create/link the
    JIRA ticket. -->
    
    ## Motivation and Context
    
    <!-- Why is this change required? What problem does it solve? -->
    
    ## How This Has Been Tested
    
    <!-- Please describe in detail how you tested your changes -->
    <!-- Include details of your testing environment, and the tests -->
    <!-- you ran to see how your change affects other areas of the code,
    etc. -->
    <!-- This information is helpful for reviewers and QA. -->
    
    ## Screenshots (if appropriate)
    
    ## Types of changes
    
    <!-- What types of changes does your code introduce? Put an `x` in all
    the boxes that apply: -->
    
    - [ ] Bug fix (non-breaking change which fixes an issue)
    - [ ] New feature (non-breaking change which adds functionality)
    - [ ] Breaking change (fix or feature that would cause existing
    functionality to change)
    - [ ] Refactoring or add test (improvements in base code or adds test
    coverage to functionality)
    
    ## Checklist
    
    <!-- Go over all the following points, and put an `x` in all the boxes
    that apply -->
    <!-- If there are no documentation updates required, mark the item as
    checked. -->
    <!-- Raise up any additional concerns not covered by the checklist. -->
    
    - [ ] I ensured that the documentation is up to date
    - [ ] I explained why this PR updates go.mod in detail with reasoning
    why it's required
    - [ ] I would like a code coverage CI quality gate exception and have
    explained why
    
    
    ___
    
    ### **PR Type**
    Bug fix, Tests
    
    
    ___
    
    ### **Description**
    - Fixes critical failure logic for RPC in emergency mode
    
    - Adds unit tests for RPC emergency mode scenarios
    
    - Updates test setup to handle emergency mode toggling
    
    - Ensures correct behavior for RPC health check failures
    
    
    ___
    
    ### **Changes diagram**
    
    ```mermaid
    flowchart LR
      A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"]
      B -- "unit tests for emergency mode" --> C["health_check_test.go"]
    ```
    
    
    ___
    
    
    
    ### **Changes walkthrough** 📝
    <table><thead><tr><th></th><th align="left">Relevant
    files</th></tr></thead><tbody><tr><td><strong>Bug
    fix</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check.go</strong><dd><code>Add emergency mode
    check to RPC critical failure logic</code>&nbsp; &nbsp; &nbsp;
    </dd></summary>
    <hr>
    
    gateway/health_check.go
    
    <li>Adds emergency mode check to RPC critical failure logic<br> <li>
    Ensures RPC is not critical in emergency mode
    
    
    </details>
    
    
      </td>
    <td><a
    href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a>&nbsp;
    &nbsp; &nbsp; </td>
    
    </tr>
    </table></td></tr><tr><td><strong>Tests</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check_test.go</strong><dd><code>Add and update
    tests for RPC emergency mode logic</code>&nbsp; &nbsp; &nbsp; &nbsp;
    &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary>
    <hr>
    
    gateway/health_check_test.go
    
    <li>Adds tests for RPC critical failure in emergency mode<br> <li>
    Updates test cases to toggle emergency mode<br> <li> Imports RPC package
    for emergency mode control
    
    
    </details>
    
    
      </td>
    <td><a
    href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a>&nbsp;
    &nbsp; </td>
    
    </tr>
    </table></td></tr></tr></tbody></table>
    
    ___
    
    > <details> <summary> Need help?</summary><li>Type <code>/help how to
    ...</code> in the comments thread for any questions about PR-Agent
    usage.</li><li>Check out the <a
    href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a>
    for more information.</li></details>
    
    (cherry picked from commit a564981)
    Copy link

    tykbot bot commented Jul 10, 2025

    Still working...

    Copy link

    tykbot bot commented Jul 10, 2025

    @andrei-tyk Created merge PRs

    Copy link

    tykbot bot commented Jul 10, 2025

    Still working...

    1 similar comment
    Copy link

    tykbot bot commented Jul 10, 2025

    Still working...

    tykbot bot pushed a commit that referenced this pull request Jul 10, 2025
    ### **User description**
    <!-- Provide a general summary of your changes in the Title above -->
    
    ## Description
    
    This PR fixes the health check logic for RPC components when MDCB
    (Multi-Data Center Bridge) is operating in emergency mode, ensuring
    proper failover behavior during RPC connectivity issues.
    
    ## Problem
    When MDCB enters emergency mode due to RPC connectivity issues, the
    gateway was incorrectly marking RPC health check failures as critical,
    causing the entire gateway to report as unhealthy. This prevented proper
    failover operation where the gateway should continue serving requests
    using cached policies.
    ## Solution
    Modified the isCriticalFailure() function in gateway/health_check.go to
    consider RPC emergency mode status when determining if an RPC component
    failure is critical.
    <!-- Describe your changes in detail -->
    
    ## Related Issue
    
    <!-- This project only accepts pull requests related to open issues. -->
    <!-- If suggesting a new feature or change, please discuss it in an
    issue first. -->
    <!-- If fixing a bug, there should be an issue describing it with steps
    to reproduce. -->
    <!-- OSS: Please link to the issue here. Tyk: please create/link the
    JIRA ticket. -->
    
    ## Motivation and Context
    
    <!-- Why is this change required? What problem does it solve? -->
    
    ## How This Has Been Tested
    
    <!-- Please describe in detail how you tested your changes -->
    <!-- Include details of your testing environment, and the tests -->
    <!-- you ran to see how your change affects other areas of the code,
    etc. -->
    <!-- This information is helpful for reviewers and QA. -->
    
    ## Screenshots (if appropriate)
    
    ## Types of changes
    
    <!-- What types of changes does your code introduce? Put an `x` in all
    the boxes that apply: -->
    
    - [ ] Bug fix (non-breaking change which fixes an issue)
    - [ ] New feature (non-breaking change which adds functionality)
    - [ ] Breaking change (fix or feature that would cause existing
    functionality to change)
    - [ ] Refactoring or add test (improvements in base code or adds test
    coverage to functionality)
    
    ## Checklist
    
    <!-- Go over all the following points, and put an `x` in all the boxes
    that apply -->
    <!-- If there are no documentation updates required, mark the item as
    checked. -->
    <!-- Raise up any additional concerns not covered by the checklist. -->
    
    - [ ] I ensured that the documentation is up to date
    - [ ] I explained why this PR updates go.mod in detail with reasoning
    why it's required
    - [ ] I would like a code coverage CI quality gate exception and have
    explained why
    
    
    ___
    
    ### **PR Type**
    Bug fix, Tests
    
    
    ___
    
    ### **Description**
    - Fixes critical failure logic for RPC in emergency mode
    
    - Adds unit tests for RPC emergency mode scenarios
    
    - Updates test setup to handle emergency mode toggling
    
    - Ensures correct behavior for RPC health check failures
    
    
    ___
    
    ### **Changes diagram**
    
    ```mermaid
    flowchart LR
      A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"]
      B -- "unit tests for emergency mode" --> C["health_check_test.go"]
    ```
    
    
    ___
    
    
    
    ### **Changes walkthrough** 📝
    <table><thead><tr><th></th><th align="left">Relevant
    files</th></tr></thead><tbody><tr><td><strong>Bug
    fix</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check.go</strong><dd><code>Add emergency mode
    check to RPC critical failure logic</code>&nbsp; &nbsp; &nbsp;
    </dd></summary>
    <hr>
    
    gateway/health_check.go
    
    <li>Adds emergency mode check to RPC critical failure logic<br> <li>
    Ensures RPC is not critical in emergency mode
    
    
    </details>
    
    
      </td>
    <td><a
    href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a>&nbsp;
    &nbsp; &nbsp; </td>
    
    </tr>
    </table></td></tr><tr><td><strong>Tests</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check_test.go</strong><dd><code>Add and update
    tests for RPC emergency mode logic</code>&nbsp; &nbsp; &nbsp; &nbsp;
    &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary>
    <hr>
    
    gateway/health_check_test.go
    
    <li>Adds tests for RPC critical failure in emergency mode<br> <li>
    Updates test cases to toggle emergency mode<br> <li> Imports RPC package
    for emergency mode control
    
    
    </details>
    
    
      </td>
    <td><a
    href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a>&nbsp;
    &nbsp; </td>
    
    </tr>
    </table></td></tr></tr></tbody></table>
    
    ___
    
    > <details> <summary> Need help?</summary><li>Type <code>/help how to
    ...</code> in the comments thread for any questions about PR-Agent
    usage.</li><li>Check out the <a
    href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a>
    for more information.</li></details>
    
    (cherry picked from commit a564981)
    Copy link

    tykbot bot commented Jul 10, 2025

    @andrei-tyk Seems like there is conflict and it require manual merge.

    tykbot bot pushed a commit that referenced this pull request Jul 10, 2025
    ### **User description**
    <!-- Provide a general summary of your changes in the Title above -->
    
    ## Description
    
    This PR fixes the health check logic for RPC components when MDCB
    (Multi-Data Center Bridge) is operating in emergency mode, ensuring
    proper failover behavior during RPC connectivity issues.
    
    ## Problem
    When MDCB enters emergency mode due to RPC connectivity issues, the
    gateway was incorrectly marking RPC health check failures as critical,
    causing the entire gateway to report as unhealthy. This prevented proper
    failover operation where the gateway should continue serving requests
    using cached policies.
    ## Solution
    Modified the isCriticalFailure() function in gateway/health_check.go to
    consider RPC emergency mode status when determining if an RPC component
    failure is critical.
    <!-- Describe your changes in detail -->
    
    ## Related Issue
    
    <!-- This project only accepts pull requests related to open issues. -->
    <!-- If suggesting a new feature or change, please discuss it in an
    issue first. -->
    <!-- If fixing a bug, there should be an issue describing it with steps
    to reproduce. -->
    <!-- OSS: Please link to the issue here. Tyk: please create/link the
    JIRA ticket. -->
    
    ## Motivation and Context
    
    <!-- Why is this change required? What problem does it solve? -->
    
    ## How This Has Been Tested
    
    <!-- Please describe in detail how you tested your changes -->
    <!-- Include details of your testing environment, and the tests -->
    <!-- you ran to see how your change affects other areas of the code,
    etc. -->
    <!-- This information is helpful for reviewers and QA. -->
    
    ## Screenshots (if appropriate)
    
    ## Types of changes
    
    <!-- What types of changes does your code introduce? Put an `x` in all
    the boxes that apply: -->
    
    - [ ] Bug fix (non-breaking change which fixes an issue)
    - [ ] New feature (non-breaking change which adds functionality)
    - [ ] Breaking change (fix or feature that would cause existing
    functionality to change)
    - [ ] Refactoring or add test (improvements in base code or adds test
    coverage to functionality)
    
    ## Checklist
    
    <!-- Go over all the following points, and put an `x` in all the boxes
    that apply -->
    <!-- If there are no documentation updates required, mark the item as
    checked. -->
    <!-- Raise up any additional concerns not covered by the checklist. -->
    
    - [ ] I ensured that the documentation is up to date
    - [ ] I explained why this PR updates go.mod in detail with reasoning
    why it's required
    - [ ] I would like a code coverage CI quality gate exception and have
    explained why
    
    
    ___
    
    ### **PR Type**
    Bug fix, Tests
    
    
    ___
    
    ### **Description**
    - Fixes critical failure logic for RPC in emergency mode
    
    - Adds unit tests for RPC emergency mode scenarios
    
    - Updates test setup to handle emergency mode toggling
    
    - Ensures correct behavior for RPC health check failures
    
    
    ___
    
    ### **Changes diagram**
    
    ```mermaid
    flowchart LR
      A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"]
      B -- "unit tests for emergency mode" --> C["health_check_test.go"]
    ```
    
    
    ___
    
    
    
    ### **Changes walkthrough** 📝
    <table><thead><tr><th></th><th align="left">Relevant
    files</th></tr></thead><tbody><tr><td><strong>Bug
    fix</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check.go</strong><dd><code>Add emergency mode
    check to RPC critical failure logic</code>&nbsp; &nbsp; &nbsp;
    </dd></summary>
    <hr>
    
    gateway/health_check.go
    
    <li>Adds emergency mode check to RPC critical failure logic<br> <li>
    Ensures RPC is not critical in emergency mode
    
    
    </details>
    
    
      </td>
    <td><a
    href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a>&nbsp;
    &nbsp; &nbsp; </td>
    
    </tr>
    </table></td></tr><tr><td><strong>Tests</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check_test.go</strong><dd><code>Add and update
    tests for RPC emergency mode logic</code>&nbsp; &nbsp; &nbsp; &nbsp;
    &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary>
    <hr>
    
    gateway/health_check_test.go
    
    <li>Adds tests for RPC critical failure in emergency mode<br> <li>
    Updates test cases to toggle emergency mode<br> <li> Imports RPC package
    for emergency mode control
    
    
    </details>
    
    
      </td>
    <td><a
    href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a>&nbsp;
    &nbsp; </td>
    
    </tr>
    </table></td></tr></tr></tbody></table>
    
    ___
    
    > <details> <summary> Need help?</summary><li>Type <code>/help how to
    ...</code> in the comments thread for any questions about PR-Agent
    usage.</li><li>Check out the <a
    href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a>
    for more information.</li></details>
    
    (cherry picked from commit a564981)
    Copy link

    tykbot bot commented Jul 10, 2025

    @andrei-tyk Seems like there is conflict and it require manual merge.

    buger added a commit that referenced this pull request Jul 10, 2025
    …eadiness check (#7215)
    
    [TT-9234] regression fixes for failing mdcb readiness check (#7215)
    
    ### **User description**
    <!-- Provide a general summary of your changes in the Title above -->
    
    ## Description
    
    This PR fixes the health check logic for RPC components when MDCB
    (Multi-Data Center Bridge) is operating in emergency mode, ensuring
    proper failover behavior during RPC connectivity issues.
    
    ## Problem
    When MDCB enters emergency mode due to RPC connectivity issues, the
    gateway was incorrectly marking RPC health check failures as critical,
    causing the entire gateway to report as unhealthy. This prevented proper
    failover operation where the gateway should continue serving requests
    using cached policies.
    ## Solution
    Modified the isCriticalFailure() function in gateway/health_check.go to
    consider RPC emergency mode status when determining if an RPC component
    failure is critical.
    <!-- Describe your changes in detail -->
    
    ## Related Issue
    
    <!-- This project only accepts pull requests related to open issues. -->
    <!-- If suggesting a new feature or change, please discuss it in an
    issue first. -->
    <!-- If fixing a bug, there should be an issue describing it with steps
    to reproduce. -->
    <!-- OSS: Please link to the issue here. Tyk: please create/link the
    JIRA ticket. -->
    
    ## Motivation and Context
    
    <!-- Why is this change required? What problem does it solve? -->
    
    ## How This Has Been Tested
    
    <!-- Please describe in detail how you tested your changes -->
    <!-- Include details of your testing environment, and the tests -->
    <!-- you ran to see how your change affects other areas of the code,
    etc. -->
    <!-- This information is helpful for reviewers and QA. -->
    
    ## Screenshots (if appropriate)
    
    ## Types of changes
    
    <!-- What types of changes does your code introduce? Put an `x` in all
    the boxes that apply: -->
    
    - [ ] Bug fix (non-breaking change which fixes an issue)
    - [ ] New feature (non-breaking change which adds functionality)
    - [ ] Breaking change (fix or feature that would cause existing
    functionality to change)
    - [ ] Refactoring or add test (improvements in base code or adds test
    coverage to functionality)
    
    ## Checklist
    
    <!-- Go over all the following points, and put an `x` in all the boxes
    that apply -->
    <!-- If there are no documentation updates required, mark the item as
    checked. -->
    <!-- Raise up any additional concerns not covered by the checklist. -->
    
    - [ ] I ensured that the documentation is up to date
    - [ ] I explained why this PR updates go.mod in detail with reasoning
    why it's required
    - [ ] I would like a code coverage CI quality gate exception and have
    explained why
    
    
    ___
    
    ### **PR Type**
    Bug fix, Tests
    
    
    ___
    
    ### **Description**
    - Fixes critical failure logic for RPC in emergency mode
    
    - Adds unit tests for RPC emergency mode scenarios
    
    - Updates test setup to handle emergency mode toggling
    
    - Ensures correct behavior for RPC health check failures
    
    
    ___
    
    ### **Changes diagram**
    
    ```mermaid
    flowchart LR
      A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"]
      B -- "unit tests for emergency mode" --> C["health_check_test.go"]
    ```
    
    
    ___
    
    
    
    ### **Changes walkthrough** 📝
    <table><thead><tr><th></th><th align="left">Relevant
    files</th></tr></thead><tbody><tr><td><strong>Bug
    fix</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check.go</strong><dd><code>Add emergency mode
    check to RPC critical failure logic</code>&nbsp; &nbsp; &nbsp;
    </dd></summary>
    <hr>
    
    gateway/health_check.go
    
    <li>Adds emergency mode check to RPC critical failure logic<br> <li>
    Ensures RPC is not critical in emergency mode
    
    
    </details>
    
    
      </td>
    <td><a
    href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a>&nbsp;
    &nbsp; &nbsp; </td>
    
    </tr>
    </table></td></tr><tr><td><strong>Tests</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check_test.go</strong><dd><code>Add and update
    tests for RPC emergency mode logic</code>&nbsp; &nbsp; &nbsp; &nbsp;
    &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary>
    <hr>
    
    gateway/health_check_test.go
    
    <li>Adds tests for RPC critical failure in emergency mode<br> <li>
    Updates test cases to toggle emergency mode<br> <li> Imports RPC package
    for emergency mode control
    
    
    </details>
    
    
      </td>
    <td><a
    href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a>&nbsp;
    &nbsp; </td>
    
    </tr>
    </table></td></tr></tr></tbody></table>
    
    ___
    
    > <details> <summary> Need help?</summary><li>Type <code>/help how to
    ...</code> in the comments thread for any questions about PR-Agent
    usage.</li><li>Check out the <a
    href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a>
    for more information.</li></details>
    andrei-tyk added a commit that referenced this pull request Jul 10, 2025
    … readiness check (#7215) (#7221)
    
    ### **User description**
    [TT-9234] regression fixes for failing mdcb readiness check (#7215)
    
    ### **User description**
    <!-- Provide a general summary of your changes in the Title above -->
    
    ## Description
    
    This PR fixes the health check logic for RPC components when MDCB
    (Multi-Data Center Bridge) is operating in emergency mode, ensuring
    proper failover behavior during RPC connectivity issues.
    
    ## Problem
    When MDCB enters emergency mode due to RPC connectivity issues, the
    gateway was incorrectly marking RPC health check failures as critical,
    causing the entire gateway to report as unhealthy. This prevented proper
    failover operation where the gateway should continue serving requests
    using cached policies.
    ## Solution
    Modified the isCriticalFailure() function in gateway/health_check.go to
    consider RPC emergency mode status when determining if an RPC component
    failure is critical.
    <!-- Describe your changes in detail -->
    
    ## Related Issue
    
    <!-- This project only accepts pull requests related to open issues. -->
    <!-- If suggesting a new feature or change, please discuss it in an
    issue first. -->
    <!-- If fixing a bug, there should be an issue describing it with steps
    to reproduce. -->
    <!-- OSS: Please link to the issue here. Tyk: please create/link the
    JIRA ticket. -->
    
    ## Motivation and Context
    
    <!-- Why is this change required? What problem does it solve? -->
    
    ## How This Has Been Tested
    
    <!-- Please describe in detail how you tested your changes -->
    <!-- Include details of your testing environment, and the tests -->
    <!-- you ran to see how your change affects other areas of the code,
    etc. -->
    <!-- This information is helpful for reviewers and QA. -->
    
    ## Screenshots (if appropriate)
    
    ## Types of changes
    
    <!-- What types of changes does your code introduce? Put an `x` in all
    the boxes that apply: -->
    
    - [ ] Bug fix (non-breaking change which fixes an issue)
    - [ ] New feature (non-breaking change which adds functionality)
    - [ ] Breaking change (fix or feature that would cause existing
    functionality to change)
    - [ ] Refactoring or add test (improvements in base code or adds test
    coverage to functionality)
    
    ## Checklist
    
    <!-- Go over all the following points, and put an `x` in all the boxes
    that apply -->
    <!-- If there are no documentation updates required, mark the item as
    checked. -->
    <!-- Raise up any additional concerns not covered by the checklist. -->
    
    - [ ] I ensured that the documentation is up to date
    - [ ] I explained why this PR updates go.mod in detail with reasoning
    why it's required
    - [ ] I would like a code coverage CI quality gate exception and have
    explained why
    
    
    ___
    
    ### **PR Type**
    Bug fix, Tests
    
    
    ___
    
    ### **Description**
    - Fixes critical failure logic for RPC in emergency mode
    
    - Adds unit tests for RPC emergency mode scenarios
    
    - Updates test setup to handle emergency mode toggling
    
    - Ensures correct behavior for RPC health check failures
    
    
    ___
    
    ### **Changes diagram**
    
    ```mermaid
    flowchart LR
      A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"]
      B -- "unit tests for emergency mode" --> C["health_check_test.go"]
    ```
    
    
    ___
    
    
    
    ### **Changes walkthrough** 📝
    <table><thead><tr><th></th><th align="left">Relevant
    files</th></tr></thead><tbody><tr><td><strong>Bug
    fix</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check.go</strong><dd><code>Add emergency mode
    check to RPC critical failure logic</code>&nbsp; &nbsp; &nbsp;
    </dd></summary>
    <hr>
    
    gateway/health_check.go
    
    <li>Adds emergency mode check to RPC critical failure logic<br> <li>
    Ensures RPC is not critical in emergency mode
    
    
    </details>
    
    
      </td>
    <td><a
    
    href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a>&nbsp;
    &nbsp; &nbsp; </td>
    
    </tr>
    </table></td></tr><tr><td><strong>Tests</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check_test.go</strong><dd><code>Add and update
    tests for RPC emergency mode logic</code>&nbsp; &nbsp; &nbsp; &nbsp;
    &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary>
    <hr>
    
    gateway/health_check_test.go
    
    <li>Adds tests for RPC critical failure in emergency mode<br> <li>
    Updates test cases to toggle emergency mode<br> <li> Imports RPC package
    for emergency mode control
    
    
    </details>
    
    
      </td>
    <td><a
    
    href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a>&nbsp;
    &nbsp; </td>
    
    </tr>
    </table></td></tr></tr></tbody></table>
    
    ___
    
    > <details> <summary> Need help?</summary><li>Type <code>/help how to
    ...</code> in the comments thread for any questions about PR-Agent
    usage.</li><li>Check out the <a
    href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a>
    for more information.</li></details>
    
    [TT-9234]:
    https://tyktech.atlassian.net/browse/TT-9234?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
    
    
    ___
    
    ### **PR Type**
    Bug fix, Tests
    
    
    ___
    
    ### **Description**
    - Fixes RPC critical failure logic to respect emergency mode
    
    - Adds unit tests for RPC emergency mode health check behavior
    
    - Updates test setup to toggle emergency mode for RPC
    
    - Ensures gateway health is correct during RPC failures in emergency
    mode
    
    
    ___
    
    ### **Changes diagram**
    
    ```mermaid
    flowchart LR
      A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"]
      B -- "unit tests for emergency mode" --> C["health_check_test.go"]
    ```
    
    
    ___
    
    
    
    ### **Changes walkthrough** 📝
    <table><thead><tr><th></th><th align="left">Relevant
    files</th></tr></thead><tbody><tr><td><strong>Bug
    fix</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check.go</strong><dd><code>Add emergency mode
    check to RPC critical failure logic</code>&nbsp; &nbsp; &nbsp;
    </dd></summary>
    <hr>
    
    gateway/health_check.go
    
    <li>Adds emergency mode check to RPC critical failure logic<br> <li>
    Ensures RPC is not critical when in emergency mode
    
    
    </details>
    
    
      </td>
    <td><a
    href="https://github.com/TykTechnologies/tyk/pull/7221/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a>&nbsp;
    &nbsp; &nbsp; </td>
    
    </tr>
    </table></td></tr><tr><td><strong>Tests</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check_test.go</strong><dd><code>Add and update
    tests for RPC emergency mode logic</code>&nbsp; &nbsp; &nbsp; &nbsp;
    &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary>
    <hr>
    
    gateway/health_check_test.go
    
    <li>Adds tests for RPC critical failure in emergency mode<br> <li>
    Updates test cases to toggle emergency mode<br> <li> Imports RPC package
    for emergency mode control
    
    
    </details>
    
    
      </td>
    <td><a
    href="https://github.com/TykTechnologies/tyk/pull/7221/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a>&nbsp;
    &nbsp; </td>
    
    </tr>
    </table></td></tr></tr></tbody></table>
    
    ___
    
    > <details> <summary> Need help?</summary><li>Type <code>/help how to
    ...</code> in the comments thread for any questions about PR-Agent
    usage.</li><li>Check out the <a
    href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a>
    for more information.</li></details>
    
    Co-authored-by: andrei-tyk <[email protected]>
    andrei-tyk added a commit that referenced this pull request Jul 10, 2025
    …eadiness check (#7215) (#7220)
    
    ### **User description**
    [TT-9234] regression fixes for failing mdcb readiness check (#7215)
    
    ### **User description**
    <!-- Provide a general summary of your changes in the Title above -->
    
    ## Description
    
    This PR fixes the health check logic for RPC components when MDCB
    (Multi-Data Center Bridge) is operating in emergency mode, ensuring
    proper failover behavior during RPC connectivity issues.
    
    ## Problem
    When MDCB enters emergency mode due to RPC connectivity issues, the
    gateway was incorrectly marking RPC health check failures as critical,
    causing the entire gateway to report as unhealthy. This prevented proper
    failover operation where the gateway should continue serving requests
    using cached policies.
    ## Solution
    Modified the isCriticalFailure() function in gateway/health_check.go to
    consider RPC emergency mode status when determining if an RPC component
    failure is critical.
    <!-- Describe your changes in detail -->
    
    ## Related Issue
    
    <!-- This project only accepts pull requests related to open issues. -->
    <!-- If suggesting a new feature or change, please discuss it in an
    issue first. -->
    <!-- If fixing a bug, there should be an issue describing it with steps
    to reproduce. -->
    <!-- OSS: Please link to the issue here. Tyk: please create/link the
    JIRA ticket. -->
    
    ## Motivation and Context
    
    <!-- Why is this change required? What problem does it solve? -->
    
    ## How This Has Been Tested
    
    <!-- Please describe in detail how you tested your changes -->
    <!-- Include details of your testing environment, and the tests -->
    <!-- you ran to see how your change affects other areas of the code,
    etc. -->
    <!-- This information is helpful for reviewers and QA. -->
    
    ## Screenshots (if appropriate)
    
    ## Types of changes
    
    <!-- What types of changes does your code introduce? Put an `x` in all
    the boxes that apply: -->
    
    - [ ] Bug fix (non-breaking change which fixes an issue)
    - [ ] New feature (non-breaking change which adds functionality)
    - [ ] Breaking change (fix or feature that would cause existing
    functionality to change)
    - [ ] Refactoring or add test (improvements in base code or adds test
    coverage to functionality)
    
    ## Checklist
    
    <!-- Go over all the following points, and put an `x` in all the boxes
    that apply -->
    <!-- If there are no documentation updates required, mark the item as
    checked. -->
    <!-- Raise up any additional concerns not covered by the checklist. -->
    
    - [ ] I ensured that the documentation is up to date
    - [ ] I explained why this PR updates go.mod in detail with reasoning
    why it's required
    - [ ] I would like a code coverage CI quality gate exception and have
    explained why
    
    
    ___
    
    ### **PR Type**
    Bug fix, Tests
    
    
    ___
    
    ### **Description**
    - Fixes critical failure logic for RPC in emergency mode
    
    - Adds unit tests for RPC emergency mode scenarios
    
    - Updates test setup to handle emergency mode toggling
    
    - Ensures correct behavior for RPC health check failures
    
    
    ___
    
    ### **Changes diagram**
    
    ```mermaid
    flowchart LR
      A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"]
      B -- "unit tests for emergency mode" --> C["health_check_test.go"]
    ```
    
    
    ___
    
    
    
    ### **Changes walkthrough** 📝
    <table><thead><tr><th></th><th align="left">Relevant
    files</th></tr></thead><tbody><tr><td><strong>Bug
    fix</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check.go</strong><dd><code>Add emergency mode
    check to RPC critical failure logic</code>&nbsp; &nbsp; &nbsp;
    </dd></summary>
    <hr>
    
    gateway/health_check.go
    
    <li>Adds emergency mode check to RPC critical failure logic<br> <li>
    Ensures RPC is not critical in emergency mode
    
    
    </details>
    
    
      </td>
    <td><a
    
    href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a>&nbsp;
    &nbsp; &nbsp; </td>
    
    </tr>
    </table></td></tr><tr><td><strong>Tests</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check_test.go</strong><dd><code>Add and update
    tests for RPC emergency mode logic</code>&nbsp; &nbsp; &nbsp; &nbsp;
    &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary>
    <hr>
    
    gateway/health_check_test.go
    
    <li>Adds tests for RPC critical failure in emergency mode<br> <li>
    Updates test cases to toggle emergency mode<br> <li> Imports RPC package
    for emergency mode control
    
    
    </details>
    
    
      </td>
    <td><a
    
    href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a>&nbsp;
    &nbsp; </td>
    
    </tr>
    </table></td></tr></tr></tbody></table>
    
    ___
    
    > <details> <summary> Need help?</summary><li>Type <code>/help how to
    ...</code> in the comments thread for any questions about PR-Agent
    usage.</li><li>Check out the <a
    href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a>
    for more information.</li></details>
    
    [TT-9234]:
    https://tyktech.atlassian.net/browse/TT-9234?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
    
    
    ___
    
    ### **PR Type**
    Bug fix, Tests
    
    
    ___
    
    ### **Description**
    - Fixes RPC critical failure logic to respect emergency mode
    
    - Adds unit tests for RPC emergency mode health check scenarios
    
    - Updates test setup to toggle RPC emergency mode as needed
    
    - Ensures gateway remains healthy in RPC emergency mode
    
    
    ___
    
    ### **Changes diagram**
    
    ```mermaid
    flowchart LR
      A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"]
      B -- "unit tests for emergency mode" --> C["health_check_test.go"]
    ```
    
    
    ___
    
    
    
    ### **Changes walkthrough** 📝
    <table><thead><tr><th></th><th align="left">Relevant
    files</th></tr></thead><tbody><tr><td><strong>Bug
    fix</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check.go</strong><dd><code>Add emergency mode
    check to RPC critical failure logic</code>&nbsp; &nbsp; &nbsp;
    </dd></summary>
    <hr>
    
    gateway/health_check.go
    
    <li>Adds emergency mode check to RPC critical failure logic<br> <li>
    Ensures RPC failures are not critical in emergency mode
    
    
    </details>
    
    
      </td>
    <td><a
    href="https://github.com/TykTechnologies/tyk/pull/7220/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a>&nbsp;
    &nbsp; &nbsp; </td>
    
    </tr>
    </table></td></tr><tr><td><strong>Tests</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check_test.go</strong><dd><code>Add and update
    tests for RPC emergency mode logic</code>&nbsp; &nbsp; &nbsp; &nbsp;
    &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary>
    <hr>
    
    gateway/health_check_test.go
    
    <li>Adds unit tests for RPC emergency mode logic<br> <li> Updates test
    cases to toggle emergency mode<br> <li> Imports RPC package for
    emergency mode control
    
    
    </details>
    
    
      </td>
    <td><a
    href="https://github.com/TykTechnologies/tyk/pull/7220/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a>&nbsp;
    &nbsp; </td>
    
    </tr>
    </table></td></tr></tr></tbody></table>
    
    ___
    
    > <details> <summary> Need help?</summary><li>Type <code>/help how to
    ...</code> in the comments thread for any questions about PR-Agent
    usage.</li><li>Check out the <a
    href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a>
    for more information.</li></details>
    
    Co-authored-by: andrei-tyk <[email protected]>
    buger added a commit that referenced this pull request Jul 10, 2025
    … readiness check (#7215)
    
    [TT-9234] regression fixes for failing mdcb readiness check (#7215)
    
    ### **User description**
    <!-- Provide a general summary of your changes in the Title above -->
    
    ## Description
    
    This PR fixes the health check logic for RPC components when MDCB
    (Multi-Data Center Bridge) is operating in emergency mode, ensuring
    proper failover behavior during RPC connectivity issues.
    
    ## Problem
    When MDCB enters emergency mode due to RPC connectivity issues, the
    gateway was incorrectly marking RPC health check failures as critical,
    causing the entire gateway to report as unhealthy. This prevented proper
    failover operation where the gateway should continue serving requests
    using cached policies.
    ## Solution
    Modified the isCriticalFailure() function in gateway/health_check.go to
    consider RPC emergency mode status when determining if an RPC component
    failure is critical.
    <!-- Describe your changes in detail -->
    
    ## Related Issue
    
    <!-- This project only accepts pull requests related to open issues. -->
    <!-- If suggesting a new feature or change, please discuss it in an
    issue first. -->
    <!-- If fixing a bug, there should be an issue describing it with steps
    to reproduce. -->
    <!-- OSS: Please link to the issue here. Tyk: please create/link the
    JIRA ticket. -->
    
    ## Motivation and Context
    
    <!-- Why is this change required? What problem does it solve? -->
    
    ## How This Has Been Tested
    
    <!-- Please describe in detail how you tested your changes -->
    <!-- Include details of your testing environment, and the tests -->
    <!-- you ran to see how your change affects other areas of the code,
    etc. -->
    <!-- This information is helpful for reviewers and QA. -->
    
    ## Screenshots (if appropriate)
    
    ## Types of changes
    
    <!-- What types of changes does your code introduce? Put an `x` in all
    the boxes that apply: -->
    
    - [ ] Bug fix (non-breaking change which fixes an issue)
    - [ ] New feature (non-breaking change which adds functionality)
    - [ ] Breaking change (fix or feature that would cause existing
    functionality to change)
    - [ ] Refactoring or add test (improvements in base code or adds test
    coverage to functionality)
    
    ## Checklist
    
    <!-- Go over all the following points, and put an `x` in all the boxes
    that apply -->
    <!-- If there are no documentation updates required, mark the item as
    checked. -->
    <!-- Raise up any additional concerns not covered by the checklist. -->
    
    - [ ] I ensured that the documentation is up to date
    - [ ] I explained why this PR updates go.mod in detail with reasoning
    why it's required
    - [ ] I would like a code coverage CI quality gate exception and have
    explained why
    
    
    ___
    
    ### **PR Type**
    Bug fix, Tests
    
    
    ___
    
    ### **Description**
    - Fixes critical failure logic for RPC in emergency mode
    
    - Adds unit tests for RPC emergency mode scenarios
    
    - Updates test setup to handle emergency mode toggling
    
    - Ensures correct behavior for RPC health check failures
    
    
    ___
    
    ### **Changes diagram**
    
    ```mermaid
    flowchart LR
      A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"]
      B -- "unit tests for emergency mode" --> C["health_check_test.go"]
    ```
    
    
    ___
    
    
    
    ### **Changes walkthrough** 📝
    <table><thead><tr><th></th><th align="left">Relevant
    files</th></tr></thead><tbody><tr><td><strong>Bug
    fix</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check.go</strong><dd><code>Add emergency mode
    check to RPC critical failure logic</code>&nbsp; &nbsp; &nbsp;
    </dd></summary>
    <hr>
    
    gateway/health_check.go
    
    <li>Adds emergency mode check to RPC critical failure logic<br> <li>
    Ensures RPC is not critical in emergency mode
    
    
    </details>
    
    
      </td>
    <td><a
    href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a>&nbsp;
    &nbsp; &nbsp; </td>
    
    </tr>
    </table></td></tr><tr><td><strong>Tests</strong></td><td><table>
    <tr>
      <td>
        <details>
    <summary><strong>health_check_test.go</strong><dd><code>Add and update
    tests for RPC emergency mode logic</code>&nbsp; &nbsp; &nbsp; &nbsp;
    &nbsp; &nbsp; &nbsp; &nbsp; </dd></summary>
    <hr>
    
    gateway/health_check_test.go
    
    <li>Adds tests for RPC critical failure in emergency mode<br> <li>
    Updates test cases to toggle emergency mode<br> <li> Imports RPC package
    for emergency mode control
    
    
    </details>
    
    
      </td>
    <td><a
    href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a>&nbsp;
    &nbsp; </td>
    
    </tr>
    </table></td></tr></tr></tbody></table>
    
    ___
    
    > <details> <summary> Need help?</summary><li>Type <code>/help how to
    ...</code> in the comments thread for any questions about PR-Agent
    usage.</li><li>Check out the <a
    href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a>
    for more information.</li></details>
    @TykTechnologies TykTechnologies deleted a comment from github-actions bot Jul 10, 2025
    @buger
    Copy link
    Member

    buger commented Jul 10, 2025

    /probe pls tell me how we can write the tests for this functionality. We should have MDCB/RPC mock available. Our gateway is also already can be run as separate entity. So in theory we should be able to cover majority of fault tollerance scenarious, and simulate various activities? I want you to do deep research, and come up to me with example test examples, and also list of test scenarious we can introduce to test hell out of this part, to ensure that regressions like this will not be possible.

    Copy link
    Contributor

    Comprehensive Testing Strategy for MDCB Emergency Mode and Health Checks

    Based on my analysis of the code, here's a comprehensive testing strategy to ensure the MDCB readiness check functionality works correctly during emergency mode:

    1. Unit Tests for Emergency Mode Logic

    func TestEmergencyModeHealthChecks(t *testing.T) {
        // Test that RPC failures are not critical in emergency mode
        conf := config.Config{}
        conf.Policies.PolicySource = "rpc"
        gw := NewGateway(conf, nil)
        
        component := "rpc"
        check := HealthCheckItem{
            Status: Fail,
            ComponentType: System,
        }
        
        // Test normal mode - RPC failure should be critical
        rpc.SetEmergencyMode(t, false)
        assert.True(t, gw.isCriticalFailure(component, check))
        
        // Test emergency mode - RPC failure should NOT be critical
        rpc.SetEmergencyMode(t, true)
        assert.False(t, gw.isCriticalFailure(component, check))
    }

    2. Integration Tests with RPC Mock

    func TestHealthCheckWithMockedRPC(t *testing.T) {
        // Setup gateway with RPC policy source
        conf := func(globalConf *config.Config) {
            globalConf.Policies.PolicySource = "rpc"
            globalConf.HealthCheck.EnableHealthChecks = true
        }
        ts := StartTest(conf)
        defer ts.Close()
        
        // Mock RPC
        dispatcher := gorpc.NewDispatcher()
        dispatcher.AddFunc("Login", func(clientAddr, userKey string) bool {
            return true
        })
        rpcMock := startRPCMock(dispatcher)
        defer stopRPCMock(rpcMock)
        
        // Test health check in normal mode
        recorder := httptest.NewRecorder()
        req := httptest.NewRequest("GET", "/tyk/health", nil)
        ts.Gw.HealthCheckHandler(recorder, req)
        assert.Equal(t, http.StatusOK, recorder.Code)
        
        // Force emergency mode and RPC failure
        rpc.SetEmergencyMode(t, true)
        ts.Gw.healthCheckInfo.Store(map[string]HealthCheckItem{
            "rpc": {Status: Fail, ComponentType: System},
        })
        
        // Test health check in emergency mode
        recorder = httptest.NewRecorder()
        req = httptest.NewRequest("GET", "/tyk/health", nil)
        ts.Gw.HealthCheckHandler(recorder, req)
        
        // Should still return 200 OK with warning status
        assert.Equal(t, http.StatusOK, recorder.Code)
        
        var response HealthCheckResponse
        json.Unmarshal(recorder.Body.Bytes(), &response)
        assert.Equal(t, Warn, response.Status)
    }

    3. Testing Readiness Endpoint

    func TestReadinessEndpointInEmergencyMode(t *testing.T) {
        // Setup gateway with RPC policy source
        conf := func(globalConf *config.Config) {
            globalConf.Policies.PolicySource = "rpc"
            globalConf.HealthCheck.EnableHealthChecks = true
        }
        ts := StartTest(conf)
        defer ts.Close()
        
        // Mock RPC
        dispatcher := gorpc.NewDispatcher()
        rpcMock := startRPCMock(dispatcher)
        defer stopRPCMock(rpcMock)
        
        // Force emergency mode and RPC failure
        rpc.SetEmergencyMode(t, true)
        ts.Gw.healthCheckInfo.Store(map[string]HealthCheckItem{
            "rpc": {Status: Fail, ComponentType: System},
        })
        
        // Test readiness endpoint
        recorder := httptest.NewRecorder()
        req := httptest.NewRequest("GET", "/tyk/ready", nil)
        ts.Gw.readyCheckHandler(recorder, req)
        
        // Should return 200 OK even in emergency mode
        assert.Equal(t, http.StatusOK, recorder.Code)
    }

    4. Testing Connection Failure Scenarios

    func TestConnectionFailureToEmergencyMode(t *testing.T) {
        // Setup gateway with non-existent RPC server
        conf := func(globalConf *config.Config) {
            globalConf.Policies.PolicySource = "rpc"
            globalConf.SlaveOptions.UseRPC = true
            globalConf.SlaveOptions.ConnectionString = "localhost:9999" // Non-existent
        }
        ts := StartTest(conf)
        defer ts.Close()
        
        // Wait for connection attempt to fail
        time.Sleep(100 * time.Millisecond)
        
        // Verify emergency mode is activated
        assert.True(t, rpc.IsEmergencyMode())
        
        // Verify health check passes with warning
        recorder := httptest.NewRecorder()
        req := httptest.NewRequest("GET", "/tyk/health", nil)
        ts.Gw.HealthCheckHandler(recorder, req)
        assert.Equal(t, http.StatusOK, recorder.Code)
    }

    5. Testing Recovery from Emergency Mode

    func TestRecoveryFromEmergencyMode(t *testing.T) {
        // Start with no RPC server
        conf := func(globalConf *config.Config) {
            globalConf.Policies.PolicySource = "rpc"
            globalConf.SlaveOptions.UseRPC = true
            globalConf.SlaveOptions.ConnectionString = "localhost:9090"
        }
        ts := StartTest(conf)
        defer ts.Close()
        
        // Wait for emergency mode
        time.Sleep(100 * time.Millisecond)
        assert.True(t, rpc.IsEmergencyMode())
        
        // Start RPC server
        dispatcher := gorpc.NewDispatcher()
        dispatcher.AddFunc("Login", func(clientAddr, userKey string) bool {
            return true
        })
        server := startRPCServer(t, "localhost:9090", dispatcher)
        defer server.Stop()
        
        // Force reconnection
        rpc.Reset()
        rpc.Connect(rpc.Config{
            ConnectionString: "localhost:9090",
        }, false, map[string]interface{}{
            "Login": func(clientAddr, userKey string) bool {
                return true
            },
        }, nil, nil, nil)
        
        // Wait for recovery
        time.Sleep(500 * time.Millisecond)
        
        // Verify emergency mode is deactivated
        assert.False(t, rpc.IsEmergencyMode())
        
        // Verify health check shows full pass
        recorder := httptest.NewRecorder()
        req := httptest.NewRequest("GET", "/tyk/health", nil)
        ts.Gw.HealthCheckHandler(recorder, req)
        assert.Equal(t, http.StatusOK, recorder.Code)
    }

    6. Testing with Kubernetes Probes

    func TestKubernetesProbes(t *testing.T) {
        // Setup gateway with RPC policy source
        conf := func(globalConf *config.Config) {
            globalConf.Policies.PolicySource = "rpc"
            globalConf.HealthCheck.EnableHealthChecks = true
        }
        ts := StartTest(conf)
        defer ts.Close()
        
        // Mock RPC
        dispatcher := gorpc.NewDispatcher()
        rpcMock := startRPCMock(dispatcher)
        defer stopRPCMock(rpcMock)
        
        // Force emergency mode and RPC failure
        rpc.SetEmergencyMode(t, true)
        ts.Gw.healthCheckInfo.Store(map[string]HealthCheckItem{
            "rpc": {Status: Fail, ComponentType: System},
        })
        
        // Test liveness probe - should pass
        recorder := httptest.NewRecorder()
        req := httptest.NewRequest("GET", "/tyk/health", nil)
        ts.Gw.HealthCheckHandler(recorder, req)
        assert.Equal(t, http.StatusOK, recorder.Code)
        
        // Test readiness probe - should pass
        recorder = httptest.NewRecorder()
        req = httptest.NewRequest("GET", "/tyk/ready", nil)
        ts.Gw.readyCheckHandler(recorder, req)
        assert.Equal(t, http.StatusOK, recorder.Code)
    }

    By implementing these tests, we can ensure that the gateway correctly handles RPC failures and emergency mode transitions, maintaining proper health check behavior throughout. This will prevent regressions like the one fixed in this PR, where the gateway was incorrectly marking RPC failures as critical even in emergency mode.


    Tip: Mention me again using /probe <request>.
    Powered by Probe AI

    lghiur added a commit that referenced this pull request Jul 11, 2025
    @andrei-tyk andrei-tyk changed the title [TT-9234] regression fixes for failing mdcb readiness check [TT-9234][TT-15257] regression fixes for failing mdcb readiness check Jul 15, 2025
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    3 participants