-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[TT-9234][TT-15257] regression fixes for failing mdcb readiness check #7215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- added better error handling for case when graceful shutdown timeout is 0 - correctly initialized the graceful shutdown duration
…T-9234-regression-fixes
💔 The detected issue is not in one of the allowed statuses 💔
Please ensure your jira story is in one of the allowed statuses |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
API Changes no api changes detected |
|
/performance /connectivity /dependency pls check |
Dependency Impact Analysis for PR #7215SummaryThis PR makes a targeted change to how RPC component failures are evaluated in the health check system, specifically accounting for emergency mode. The changes are minimal and focused on fixing a regression in MDCB readiness checks. Dependency ChangesDirect Dependencies
Dependency Version Changes
Code AnalysisKey Changes
Emergency Mode ContextThe RPC emergency mode is a fallback mechanism that allows the gateway to continue operating with cached data when the RPC connection fails. The PR ensures that when in emergency mode, RPC failures are not considered critical, which is the expected behavior. Impact AssessmentFunctional Impact
Architectural Impact
Risk Assessment
ConclusionThis PR has minimal dependency impact. It introduces a small internal dependency on the RPC package's emergency mode state but doesn't change any external dependencies. The change is well-contained and properly tested, making it a low-risk fix for the reported regression. Tip: Mention me again using |
Connectivity Review: MDCB Readiness Check FixOverviewThis PR addresses a regression issue in the MDCB (Multi Data Center Bridge) readiness check by modifying how the RPC component's critical failure status is determined. The change ensures that when a gateway is in emergency mode, RPC failures don't trigger a critical health check failure. Key Changes
Connectivity Impact AnalysisPositive Impacts
Potential Concerns
RecommendationThe change is appropriate and improves system resilience by ensuring health checks accurately reflect the gateway's operational status. The PR includes proper test coverage for both emergency and non-emergency scenarios. This change follows best practices for distributed systems by implementing graceful degradation and proper health reporting during connectivity issues. Tip: Mention me again using |
Connectivity and Dependency Analysis: MDCB Readiness Check FixOverviewThis PR fixes a regression in the MDCB (Multi Data Center Bridge) readiness check by modifying how RPC failures are evaluated in emergency mode. The change is minimal but important for proper health check behavior in distributed deployments. Connectivity ImpactThe PR has a positive impact on connectivity by ensuring that health checks correctly reflect the system's operational state:
Dependency ImpactThe PR has minimal dependency impact:
ConclusionThis PR represents a well-contained fix for a specific issue with MDCB readiness checks. It improves system reliability by ensuring health checks accurately reflect the system's operational state, particularly when operating in emergency mode. The changes are minimal, focused, and well-tested, with no negative impacts on connectivity or dependencies. The fix is particularly important for environments using Kubernetes or other orchestration systems that rely on readiness probes for service management, as it prevents unnecessary service disruptions when the gateway is operating in emergency mode. Tip: Mention me again using |
Working on it! Note that it can take a few minutes. |
/release to release-5.9.0 |
Working on it! Note that it can take a few minutes. |
1 similar comment
Working on it! Note that it can take a few minutes. |
### **User description** <!-- Provide a general summary of your changes in the Title above --> ## Description This PR fixes the health check logic for RPC components when MDCB (Multi-Data Center Bridge) is operating in emergency mode, ensuring proper failover behavior during RPC connectivity issues. ## Problem When MDCB enters emergency mode due to RPC connectivity issues, the gateway was incorrectly marking RPC health check failures as critical, causing the entire gateway to report as unhealthy. This prevented proper failover operation where the gateway should continue serving requests using cached policies. ## Solution Modified the isCriticalFailure() function in gateway/health_check.go to consider RPC emergency mode status when determining if an RPC component failure is critical. <!-- Describe your changes in detail --> ## Related Issue <!-- This project only accepts pull requests related to open issues. --> <!-- If suggesting a new feature or change, please discuss it in an issue first. --> <!-- If fixing a bug, there should be an issue describing it with steps to reproduce. --> <!-- OSS: Please link to the issue here. Tyk: please create/link the JIRA ticket. --> ## Motivation and Context <!-- Why is this change required? What problem does it solve? --> ## How This Has Been Tested <!-- Please describe in detail how you tested your changes --> <!-- Include details of your testing environment, and the tests --> <!-- you ran to see how your change affects other areas of the code, etc. --> <!-- This information is helpful for reviewers and QA. --> ## Screenshots (if appropriate) ## Types of changes <!-- What types of changes does your code introduce? Put an `x` in all the boxes that apply: --> - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Refactoring or add test (improvements in base code or adds test coverage to functionality) ## Checklist <!-- Go over all the following points, and put an `x` in all the boxes that apply --> <!-- If there are no documentation updates required, mark the item as checked. --> <!-- Raise up any additional concerns not covered by the checklist. --> - [ ] I ensured that the documentation is up to date - [ ] I explained why this PR updates go.mod in detail with reasoning why it's required - [ ] I would like a code coverage CI quality gate exception and have explained why ___ ### **PR Type** Bug fix, Tests ___ ### **Description** - Fixes critical failure logic for RPC in emergency mode - Adds unit tests for RPC emergency mode scenarios - Updates test setup to handle emergency mode toggling - Ensures correct behavior for RPC health check failures ___ ### **Changes diagram** ```mermaid flowchart LR A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"] B -- "unit tests for emergency mode" --> C["health_check_test.go"] ``` ___ ### **Changes walkthrough** 📝 <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Bug fix</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check.go</strong><dd><code>Add emergency mode check to RPC critical failure logic</code> </dd></summary> <hr> gateway/health_check.go <li>Adds emergency mode check to RPC critical failure logic<br> <li> Ensures RPC is not critical in emergency mode </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a> </td> </tr> </table></td></tr><tr><td><strong>Tests</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check_test.go</strong><dd><code>Add and update tests for RPC emergency mode logic</code> </dd></summary> <hr> gateway/health_check_test.go <li>Adds tests for RPC critical failure in emergency mode<br> <li> Updates test cases to toggle emergency mode<br> <li> Imports RPC package for emergency mode control </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a> </td> </tr> </table></td></tr></tr></tbody></table> ___ > <details> <summary> Need help?</summary><li>Type <code>/help how to ...</code> in the comments thread for any questions about PR-Agent usage.</li><li>Check out the <a href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a> for more information.</li></details> (cherry picked from commit a564981)
Working on it! Note that it can take a few minutes. |
@andrei-tyk Created merge PRs |
### **User description** <!-- Provide a general summary of your changes in the Title above --> ## Description This PR fixes the health check logic for RPC components when MDCB (Multi-Data Center Bridge) is operating in emergency mode, ensuring proper failover behavior during RPC connectivity issues. ## Problem When MDCB enters emergency mode due to RPC connectivity issues, the gateway was incorrectly marking RPC health check failures as critical, causing the entire gateway to report as unhealthy. This prevented proper failover operation where the gateway should continue serving requests using cached policies. ## Solution Modified the isCriticalFailure() function in gateway/health_check.go to consider RPC emergency mode status when determining if an RPC component failure is critical. <!-- Describe your changes in detail --> ## Related Issue <!-- This project only accepts pull requests related to open issues. --> <!-- If suggesting a new feature or change, please discuss it in an issue first. --> <!-- If fixing a bug, there should be an issue describing it with steps to reproduce. --> <!-- OSS: Please link to the issue here. Tyk: please create/link the JIRA ticket. --> ## Motivation and Context <!-- Why is this change required? What problem does it solve? --> ## How This Has Been Tested <!-- Please describe in detail how you tested your changes --> <!-- Include details of your testing environment, and the tests --> <!-- you ran to see how your change affects other areas of the code, etc. --> <!-- This information is helpful for reviewers and QA. --> ## Screenshots (if appropriate) ## Types of changes <!-- What types of changes does your code introduce? Put an `x` in all the boxes that apply: --> - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Refactoring or add test (improvements in base code or adds test coverage to functionality) ## Checklist <!-- Go over all the following points, and put an `x` in all the boxes that apply --> <!-- If there are no documentation updates required, mark the item as checked. --> <!-- Raise up any additional concerns not covered by the checklist. --> - [ ] I ensured that the documentation is up to date - [ ] I explained why this PR updates go.mod in detail with reasoning why it's required - [ ] I would like a code coverage CI quality gate exception and have explained why ___ ### **PR Type** Bug fix, Tests ___ ### **Description** - Fixes critical failure logic for RPC in emergency mode - Adds unit tests for RPC emergency mode scenarios - Updates test setup to handle emergency mode toggling - Ensures correct behavior for RPC health check failures ___ ### **Changes diagram** ```mermaid flowchart LR A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"] B -- "unit tests for emergency mode" --> C["health_check_test.go"] ``` ___ ### **Changes walkthrough** 📝 <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Bug fix</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check.go</strong><dd><code>Add emergency mode check to RPC critical failure logic</code> </dd></summary> <hr> gateway/health_check.go <li>Adds emergency mode check to RPC critical failure logic<br> <li> Ensures RPC is not critical in emergency mode </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a> </td> </tr> </table></td></tr><tr><td><strong>Tests</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check_test.go</strong><dd><code>Add and update tests for RPC emergency mode logic</code> </dd></summary> <hr> gateway/health_check_test.go <li>Adds tests for RPC critical failure in emergency mode<br> <li> Updates test cases to toggle emergency mode<br> <li> Imports RPC package for emergency mode control </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a> </td> </tr> </table></td></tr></tr></tbody></table> ___ > <details> <summary> Need help?</summary><li>Type <code>/help how to ...</code> in the comments thread for any questions about PR-Agent usage.</li><li>Check out the <a href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a> for more information.</li></details> (cherry picked from commit a564981)
Still working... |
@andrei-tyk Created merge PRs |
Still working... |
1 similar comment
Still working... |
### **User description** <!-- Provide a general summary of your changes in the Title above --> ## Description This PR fixes the health check logic for RPC components when MDCB (Multi-Data Center Bridge) is operating in emergency mode, ensuring proper failover behavior during RPC connectivity issues. ## Problem When MDCB enters emergency mode due to RPC connectivity issues, the gateway was incorrectly marking RPC health check failures as critical, causing the entire gateway to report as unhealthy. This prevented proper failover operation where the gateway should continue serving requests using cached policies. ## Solution Modified the isCriticalFailure() function in gateway/health_check.go to consider RPC emergency mode status when determining if an RPC component failure is critical. <!-- Describe your changes in detail --> ## Related Issue <!-- This project only accepts pull requests related to open issues. --> <!-- If suggesting a new feature or change, please discuss it in an issue first. --> <!-- If fixing a bug, there should be an issue describing it with steps to reproduce. --> <!-- OSS: Please link to the issue here. Tyk: please create/link the JIRA ticket. --> ## Motivation and Context <!-- Why is this change required? What problem does it solve? --> ## How This Has Been Tested <!-- Please describe in detail how you tested your changes --> <!-- Include details of your testing environment, and the tests --> <!-- you ran to see how your change affects other areas of the code, etc. --> <!-- This information is helpful for reviewers and QA. --> ## Screenshots (if appropriate) ## Types of changes <!-- What types of changes does your code introduce? Put an `x` in all the boxes that apply: --> - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Refactoring or add test (improvements in base code or adds test coverage to functionality) ## Checklist <!-- Go over all the following points, and put an `x` in all the boxes that apply --> <!-- If there are no documentation updates required, mark the item as checked. --> <!-- Raise up any additional concerns not covered by the checklist. --> - [ ] I ensured that the documentation is up to date - [ ] I explained why this PR updates go.mod in detail with reasoning why it's required - [ ] I would like a code coverage CI quality gate exception and have explained why ___ ### **PR Type** Bug fix, Tests ___ ### **Description** - Fixes critical failure logic for RPC in emergency mode - Adds unit tests for RPC emergency mode scenarios - Updates test setup to handle emergency mode toggling - Ensures correct behavior for RPC health check failures ___ ### **Changes diagram** ```mermaid flowchart LR A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"] B -- "unit tests for emergency mode" --> C["health_check_test.go"] ``` ___ ### **Changes walkthrough** 📝 <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Bug fix</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check.go</strong><dd><code>Add emergency mode check to RPC critical failure logic</code> </dd></summary> <hr> gateway/health_check.go <li>Adds emergency mode check to RPC critical failure logic<br> <li> Ensures RPC is not critical in emergency mode </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a> </td> </tr> </table></td></tr><tr><td><strong>Tests</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check_test.go</strong><dd><code>Add and update tests for RPC emergency mode logic</code> </dd></summary> <hr> gateway/health_check_test.go <li>Adds tests for RPC critical failure in emergency mode<br> <li> Updates test cases to toggle emergency mode<br> <li> Imports RPC package for emergency mode control </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a> </td> </tr> </table></td></tr></tr></tbody></table> ___ > <details> <summary> Need help?</summary><li>Type <code>/help how to ...</code> in the comments thread for any questions about PR-Agent usage.</li><li>Check out the <a href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a> for more information.</li></details> (cherry picked from commit a564981)
@andrei-tyk Seems like there is conflict and it require manual merge. |
### **User description** <!-- Provide a general summary of your changes in the Title above --> ## Description This PR fixes the health check logic for RPC components when MDCB (Multi-Data Center Bridge) is operating in emergency mode, ensuring proper failover behavior during RPC connectivity issues. ## Problem When MDCB enters emergency mode due to RPC connectivity issues, the gateway was incorrectly marking RPC health check failures as critical, causing the entire gateway to report as unhealthy. This prevented proper failover operation where the gateway should continue serving requests using cached policies. ## Solution Modified the isCriticalFailure() function in gateway/health_check.go to consider RPC emergency mode status when determining if an RPC component failure is critical. <!-- Describe your changes in detail --> ## Related Issue <!-- This project only accepts pull requests related to open issues. --> <!-- If suggesting a new feature or change, please discuss it in an issue first. --> <!-- If fixing a bug, there should be an issue describing it with steps to reproduce. --> <!-- OSS: Please link to the issue here. Tyk: please create/link the JIRA ticket. --> ## Motivation and Context <!-- Why is this change required? What problem does it solve? --> ## How This Has Been Tested <!-- Please describe in detail how you tested your changes --> <!-- Include details of your testing environment, and the tests --> <!-- you ran to see how your change affects other areas of the code, etc. --> <!-- This information is helpful for reviewers and QA. --> ## Screenshots (if appropriate) ## Types of changes <!-- What types of changes does your code introduce? Put an `x` in all the boxes that apply: --> - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Refactoring or add test (improvements in base code or adds test coverage to functionality) ## Checklist <!-- Go over all the following points, and put an `x` in all the boxes that apply --> <!-- If there are no documentation updates required, mark the item as checked. --> <!-- Raise up any additional concerns not covered by the checklist. --> - [ ] I ensured that the documentation is up to date - [ ] I explained why this PR updates go.mod in detail with reasoning why it's required - [ ] I would like a code coverage CI quality gate exception and have explained why ___ ### **PR Type** Bug fix, Tests ___ ### **Description** - Fixes critical failure logic for RPC in emergency mode - Adds unit tests for RPC emergency mode scenarios - Updates test setup to handle emergency mode toggling - Ensures correct behavior for RPC health check failures ___ ### **Changes diagram** ```mermaid flowchart LR A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"] B -- "unit tests for emergency mode" --> C["health_check_test.go"] ``` ___ ### **Changes walkthrough** 📝 <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Bug fix</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check.go</strong><dd><code>Add emergency mode check to RPC critical failure logic</code> </dd></summary> <hr> gateway/health_check.go <li>Adds emergency mode check to RPC critical failure logic<br> <li> Ensures RPC is not critical in emergency mode </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a> </td> </tr> </table></td></tr><tr><td><strong>Tests</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check_test.go</strong><dd><code>Add and update tests for RPC emergency mode logic</code> </dd></summary> <hr> gateway/health_check_test.go <li>Adds tests for RPC critical failure in emergency mode<br> <li> Updates test cases to toggle emergency mode<br> <li> Imports RPC package for emergency mode control </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a> </td> </tr> </table></td></tr></tr></tbody></table> ___ > <details> <summary> Need help?</summary><li>Type <code>/help how to ...</code> in the comments thread for any questions about PR-Agent usage.</li><li>Check out the <a href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a> for more information.</li></details> (cherry picked from commit a564981)
@andrei-tyk Seems like there is conflict and it require manual merge. |
…eadiness check (#7215) [TT-9234] regression fixes for failing mdcb readiness check (#7215) ### **User description** <!-- Provide a general summary of your changes in the Title above --> ## Description This PR fixes the health check logic for RPC components when MDCB (Multi-Data Center Bridge) is operating in emergency mode, ensuring proper failover behavior during RPC connectivity issues. ## Problem When MDCB enters emergency mode due to RPC connectivity issues, the gateway was incorrectly marking RPC health check failures as critical, causing the entire gateway to report as unhealthy. This prevented proper failover operation where the gateway should continue serving requests using cached policies. ## Solution Modified the isCriticalFailure() function in gateway/health_check.go to consider RPC emergency mode status when determining if an RPC component failure is critical. <!-- Describe your changes in detail --> ## Related Issue <!-- This project only accepts pull requests related to open issues. --> <!-- If suggesting a new feature or change, please discuss it in an issue first. --> <!-- If fixing a bug, there should be an issue describing it with steps to reproduce. --> <!-- OSS: Please link to the issue here. Tyk: please create/link the JIRA ticket. --> ## Motivation and Context <!-- Why is this change required? What problem does it solve? --> ## How This Has Been Tested <!-- Please describe in detail how you tested your changes --> <!-- Include details of your testing environment, and the tests --> <!-- you ran to see how your change affects other areas of the code, etc. --> <!-- This information is helpful for reviewers and QA. --> ## Screenshots (if appropriate) ## Types of changes <!-- What types of changes does your code introduce? Put an `x` in all the boxes that apply: --> - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Refactoring or add test (improvements in base code or adds test coverage to functionality) ## Checklist <!-- Go over all the following points, and put an `x` in all the boxes that apply --> <!-- If there are no documentation updates required, mark the item as checked. --> <!-- Raise up any additional concerns not covered by the checklist. --> - [ ] I ensured that the documentation is up to date - [ ] I explained why this PR updates go.mod in detail with reasoning why it's required - [ ] I would like a code coverage CI quality gate exception and have explained why ___ ### **PR Type** Bug fix, Tests ___ ### **Description** - Fixes critical failure logic for RPC in emergency mode - Adds unit tests for RPC emergency mode scenarios - Updates test setup to handle emergency mode toggling - Ensures correct behavior for RPC health check failures ___ ### **Changes diagram** ```mermaid flowchart LR A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"] B -- "unit tests for emergency mode" --> C["health_check_test.go"] ``` ___ ### **Changes walkthrough** 📝 <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Bug fix</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check.go</strong><dd><code>Add emergency mode check to RPC critical failure logic</code> </dd></summary> <hr> gateway/health_check.go <li>Adds emergency mode check to RPC critical failure logic<br> <li> Ensures RPC is not critical in emergency mode </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a> </td> </tr> </table></td></tr><tr><td><strong>Tests</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check_test.go</strong><dd><code>Add and update tests for RPC emergency mode logic</code> </dd></summary> <hr> gateway/health_check_test.go <li>Adds tests for RPC critical failure in emergency mode<br> <li> Updates test cases to toggle emergency mode<br> <li> Imports RPC package for emergency mode control </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a> </td> </tr> </table></td></tr></tr></tbody></table> ___ > <details> <summary> Need help?</summary><li>Type <code>/help how to ...</code> in the comments thread for any questions about PR-Agent usage.</li><li>Check out the <a href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a> for more information.</li></details>
… readiness check (#7215) (#7221) ### **User description** [TT-9234] regression fixes for failing mdcb readiness check (#7215) ### **User description** <!-- Provide a general summary of your changes in the Title above --> ## Description This PR fixes the health check logic for RPC components when MDCB (Multi-Data Center Bridge) is operating in emergency mode, ensuring proper failover behavior during RPC connectivity issues. ## Problem When MDCB enters emergency mode due to RPC connectivity issues, the gateway was incorrectly marking RPC health check failures as critical, causing the entire gateway to report as unhealthy. This prevented proper failover operation where the gateway should continue serving requests using cached policies. ## Solution Modified the isCriticalFailure() function in gateway/health_check.go to consider RPC emergency mode status when determining if an RPC component failure is critical. <!-- Describe your changes in detail --> ## Related Issue <!-- This project only accepts pull requests related to open issues. --> <!-- If suggesting a new feature or change, please discuss it in an issue first. --> <!-- If fixing a bug, there should be an issue describing it with steps to reproduce. --> <!-- OSS: Please link to the issue here. Tyk: please create/link the JIRA ticket. --> ## Motivation and Context <!-- Why is this change required? What problem does it solve? --> ## How This Has Been Tested <!-- Please describe in detail how you tested your changes --> <!-- Include details of your testing environment, and the tests --> <!-- you ran to see how your change affects other areas of the code, etc. --> <!-- This information is helpful for reviewers and QA. --> ## Screenshots (if appropriate) ## Types of changes <!-- What types of changes does your code introduce? Put an `x` in all the boxes that apply: --> - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Refactoring or add test (improvements in base code or adds test coverage to functionality) ## Checklist <!-- Go over all the following points, and put an `x` in all the boxes that apply --> <!-- If there are no documentation updates required, mark the item as checked. --> <!-- Raise up any additional concerns not covered by the checklist. --> - [ ] I ensured that the documentation is up to date - [ ] I explained why this PR updates go.mod in detail with reasoning why it's required - [ ] I would like a code coverage CI quality gate exception and have explained why ___ ### **PR Type** Bug fix, Tests ___ ### **Description** - Fixes critical failure logic for RPC in emergency mode - Adds unit tests for RPC emergency mode scenarios - Updates test setup to handle emergency mode toggling - Ensures correct behavior for RPC health check failures ___ ### **Changes diagram** ```mermaid flowchart LR A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"] B -- "unit tests for emergency mode" --> C["health_check_test.go"] ``` ___ ### **Changes walkthrough** 📝 <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Bug fix</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check.go</strong><dd><code>Add emergency mode check to RPC critical failure logic</code> </dd></summary> <hr> gateway/health_check.go <li>Adds emergency mode check to RPC critical failure logic<br> <li> Ensures RPC is not critical in emergency mode </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a> </td> </tr> </table></td></tr><tr><td><strong>Tests</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check_test.go</strong><dd><code>Add and update tests for RPC emergency mode logic</code> </dd></summary> <hr> gateway/health_check_test.go <li>Adds tests for RPC critical failure in emergency mode<br> <li> Updates test cases to toggle emergency mode<br> <li> Imports RPC package for emergency mode control </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a> </td> </tr> </table></td></tr></tr></tbody></table> ___ > <details> <summary> Need help?</summary><li>Type <code>/help how to ...</code> in the comments thread for any questions about PR-Agent usage.</li><li>Check out the <a href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a> for more information.</li></details> [TT-9234]: https://tyktech.atlassian.net/browse/TT-9234?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ ___ ### **PR Type** Bug fix, Tests ___ ### **Description** - Fixes RPC critical failure logic to respect emergency mode - Adds unit tests for RPC emergency mode health check behavior - Updates test setup to toggle emergency mode for RPC - Ensures gateway health is correct during RPC failures in emergency mode ___ ### **Changes diagram** ```mermaid flowchart LR A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"] B -- "unit tests for emergency mode" --> C["health_check_test.go"] ``` ___ ### **Changes walkthrough** 📝 <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Bug fix</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check.go</strong><dd><code>Add emergency mode check to RPC critical failure logic</code> </dd></summary> <hr> gateway/health_check.go <li>Adds emergency mode check to RPC critical failure logic<br> <li> Ensures RPC is not critical when in emergency mode </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7221/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a> </td> </tr> </table></td></tr><tr><td><strong>Tests</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check_test.go</strong><dd><code>Add and update tests for RPC emergency mode logic</code> </dd></summary> <hr> gateway/health_check_test.go <li>Adds tests for RPC critical failure in emergency mode<br> <li> Updates test cases to toggle emergency mode<br> <li> Imports RPC package for emergency mode control </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7221/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a> </td> </tr> </table></td></tr></tr></tbody></table> ___ > <details> <summary> Need help?</summary><li>Type <code>/help how to ...</code> in the comments thread for any questions about PR-Agent usage.</li><li>Check out the <a href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a> for more information.</li></details> Co-authored-by: andrei-tyk <[email protected]>
…eadiness check (#7215) (#7220) ### **User description** [TT-9234] regression fixes for failing mdcb readiness check (#7215) ### **User description** <!-- Provide a general summary of your changes in the Title above --> ## Description This PR fixes the health check logic for RPC components when MDCB (Multi-Data Center Bridge) is operating in emergency mode, ensuring proper failover behavior during RPC connectivity issues. ## Problem When MDCB enters emergency mode due to RPC connectivity issues, the gateway was incorrectly marking RPC health check failures as critical, causing the entire gateway to report as unhealthy. This prevented proper failover operation where the gateway should continue serving requests using cached policies. ## Solution Modified the isCriticalFailure() function in gateway/health_check.go to consider RPC emergency mode status when determining if an RPC component failure is critical. <!-- Describe your changes in detail --> ## Related Issue <!-- This project only accepts pull requests related to open issues. --> <!-- If suggesting a new feature or change, please discuss it in an issue first. --> <!-- If fixing a bug, there should be an issue describing it with steps to reproduce. --> <!-- OSS: Please link to the issue here. Tyk: please create/link the JIRA ticket. --> ## Motivation and Context <!-- Why is this change required? What problem does it solve? --> ## How This Has Been Tested <!-- Please describe in detail how you tested your changes --> <!-- Include details of your testing environment, and the tests --> <!-- you ran to see how your change affects other areas of the code, etc. --> <!-- This information is helpful for reviewers and QA. --> ## Screenshots (if appropriate) ## Types of changes <!-- What types of changes does your code introduce? Put an `x` in all the boxes that apply: --> - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Refactoring or add test (improvements in base code or adds test coverage to functionality) ## Checklist <!-- Go over all the following points, and put an `x` in all the boxes that apply --> <!-- If there are no documentation updates required, mark the item as checked. --> <!-- Raise up any additional concerns not covered by the checklist. --> - [ ] I ensured that the documentation is up to date - [ ] I explained why this PR updates go.mod in detail with reasoning why it's required - [ ] I would like a code coverage CI quality gate exception and have explained why ___ ### **PR Type** Bug fix, Tests ___ ### **Description** - Fixes critical failure logic for RPC in emergency mode - Adds unit tests for RPC emergency mode scenarios - Updates test setup to handle emergency mode toggling - Ensures correct behavior for RPC health check failures ___ ### **Changes diagram** ```mermaid flowchart LR A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"] B -- "unit tests for emergency mode" --> C["health_check_test.go"] ``` ___ ### **Changes walkthrough** 📝 <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Bug fix</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check.go</strong><dd><code>Add emergency mode check to RPC critical failure logic</code> </dd></summary> <hr> gateway/health_check.go <li>Adds emergency mode check to RPC critical failure logic<br> <li> Ensures RPC is not critical in emergency mode </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a> </td> </tr> </table></td></tr><tr><td><strong>Tests</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check_test.go</strong><dd><code>Add and update tests for RPC emergency mode logic</code> </dd></summary> <hr> gateway/health_check_test.go <li>Adds tests for RPC critical failure in emergency mode<br> <li> Updates test cases to toggle emergency mode<br> <li> Imports RPC package for emergency mode control </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a> </td> </tr> </table></td></tr></tr></tbody></table> ___ > <details> <summary> Need help?</summary><li>Type <code>/help how to ...</code> in the comments thread for any questions about PR-Agent usage.</li><li>Check out the <a href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a> for more information.</li></details> [TT-9234]: https://tyktech.atlassian.net/browse/TT-9234?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ ___ ### **PR Type** Bug fix, Tests ___ ### **Description** - Fixes RPC critical failure logic to respect emergency mode - Adds unit tests for RPC emergency mode health check scenarios - Updates test setup to toggle RPC emergency mode as needed - Ensures gateway remains healthy in RPC emergency mode ___ ### **Changes diagram** ```mermaid flowchart LR A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"] B -- "unit tests for emergency mode" --> C["health_check_test.go"] ``` ___ ### **Changes walkthrough** 📝 <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Bug fix</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check.go</strong><dd><code>Add emergency mode check to RPC critical failure logic</code> </dd></summary> <hr> gateway/health_check.go <li>Adds emergency mode check to RPC critical failure logic<br> <li> Ensures RPC failures are not critical in emergency mode </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7220/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a> </td> </tr> </table></td></tr><tr><td><strong>Tests</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check_test.go</strong><dd><code>Add and update tests for RPC emergency mode logic</code> </dd></summary> <hr> gateway/health_check_test.go <li>Adds unit tests for RPC emergency mode logic<br> <li> Updates test cases to toggle emergency mode<br> <li> Imports RPC package for emergency mode control </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7220/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a> </td> </tr> </table></td></tr></tr></tbody></table> ___ > <details> <summary> Need help?</summary><li>Type <code>/help how to ...</code> in the comments thread for any questions about PR-Agent usage.</li><li>Check out the <a href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a> for more information.</li></details> Co-authored-by: andrei-tyk <[email protected]>
… readiness check (#7215) [TT-9234] regression fixes for failing mdcb readiness check (#7215) ### **User description** <!-- Provide a general summary of your changes in the Title above --> ## Description This PR fixes the health check logic for RPC components when MDCB (Multi-Data Center Bridge) is operating in emergency mode, ensuring proper failover behavior during RPC connectivity issues. ## Problem When MDCB enters emergency mode due to RPC connectivity issues, the gateway was incorrectly marking RPC health check failures as critical, causing the entire gateway to report as unhealthy. This prevented proper failover operation where the gateway should continue serving requests using cached policies. ## Solution Modified the isCriticalFailure() function in gateway/health_check.go to consider RPC emergency mode status when determining if an RPC component failure is critical. <!-- Describe your changes in detail --> ## Related Issue <!-- This project only accepts pull requests related to open issues. --> <!-- If suggesting a new feature or change, please discuss it in an issue first. --> <!-- If fixing a bug, there should be an issue describing it with steps to reproduce. --> <!-- OSS: Please link to the issue here. Tyk: please create/link the JIRA ticket. --> ## Motivation and Context <!-- Why is this change required? What problem does it solve? --> ## How This Has Been Tested <!-- Please describe in detail how you tested your changes --> <!-- Include details of your testing environment, and the tests --> <!-- you ran to see how your change affects other areas of the code, etc. --> <!-- This information is helpful for reviewers and QA. --> ## Screenshots (if appropriate) ## Types of changes <!-- What types of changes does your code introduce? Put an `x` in all the boxes that apply: --> - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Refactoring or add test (improvements in base code or adds test coverage to functionality) ## Checklist <!-- Go over all the following points, and put an `x` in all the boxes that apply --> <!-- If there are no documentation updates required, mark the item as checked. --> <!-- Raise up any additional concerns not covered by the checklist. --> - [ ] I ensured that the documentation is up to date - [ ] I explained why this PR updates go.mod in detail with reasoning why it's required - [ ] I would like a code coverage CI quality gate exception and have explained why ___ ### **PR Type** Bug fix, Tests ___ ### **Description** - Fixes critical failure logic for RPC in emergency mode - Adds unit tests for RPC emergency mode scenarios - Updates test setup to handle emergency mode toggling - Ensures correct behavior for RPC health check failures ___ ### **Changes diagram** ```mermaid flowchart LR A["isCriticalFailure logic"] -- "add emergency mode check" --> B["RPC component handling"] B -- "unit tests for emergency mode" --> C["health_check_test.go"] ``` ___ ### **Changes walkthrough** 📝 <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Bug fix</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check.go</strong><dd><code>Add emergency mode check to RPC critical failure logic</code> </dd></summary> <hr> gateway/health_check.go <li>Adds emergency mode check to RPC critical failure logic<br> <li> Ensures RPC is not critical in emergency mode </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-978a2d1427d9209765e541618af10683944c6396df1a6fb8b5221e4f16658a6a">+2/-2</a> </td> </tr> </table></td></tr><tr><td><strong>Tests</strong></td><td><table> <tr> <td> <details> <summary><strong>health_check_test.go</strong><dd><code>Add and update tests for RPC emergency mode logic</code> </dd></summary> <hr> gateway/health_check_test.go <li>Adds tests for RPC critical failure in emergency mode<br> <li> Updates test cases to toggle emergency mode<br> <li> Imports RPC package for emergency mode control </details> </td> <td><a href="https://github.com/TykTechnologies/tyk/pull/7215/files#diff-08e29946afc7757a9c7baaef04b1a81964640437a684ff6306d1a0c933ac3f6a">+38/-0</a> </td> </tr> </table></td></tr></tr></tbody></table> ___ > <details> <summary> Need help?</summary><li>Type <code>/help how to ...</code> in the comments thread for any questions about PR-Agent usage.</li><li>Check out the <a href="https://qodo-merge-docs.qodo.ai/usage-guide/">documentation</a> for more information.</li></details>
/probe pls tell me how we can write the tests for this functionality. We should have MDCB/RPC mock available. Our gateway is also already can be run as separate entity. So in theory we should be able to cover majority of fault tollerance scenarious, and simulate various activities? I want you to do deep research, and come up to me with example test examples, and also list of test scenarious we can introduce to test hell out of this part, to ensure that regressions like this will not be possible. |
Comprehensive Testing Strategy for MDCB Emergency Mode and Health ChecksBased on my analysis of the code, here's a comprehensive testing strategy to ensure the MDCB readiness check functionality works correctly during emergency mode: 1. Unit Tests for Emergency Mode Logicfunc TestEmergencyModeHealthChecks(t *testing.T) {
// Test that RPC failures are not critical in emergency mode
conf := config.Config{}
conf.Policies.PolicySource = "rpc"
gw := NewGateway(conf, nil)
component := "rpc"
check := HealthCheckItem{
Status: Fail,
ComponentType: System,
}
// Test normal mode - RPC failure should be critical
rpc.SetEmergencyMode(t, false)
assert.True(t, gw.isCriticalFailure(component, check))
// Test emergency mode - RPC failure should NOT be critical
rpc.SetEmergencyMode(t, true)
assert.False(t, gw.isCriticalFailure(component, check))
} 2. Integration Tests with RPC Mockfunc TestHealthCheckWithMockedRPC(t *testing.T) {
// Setup gateway with RPC policy source
conf := func(globalConf *config.Config) {
globalConf.Policies.PolicySource = "rpc"
globalConf.HealthCheck.EnableHealthChecks = true
}
ts := StartTest(conf)
defer ts.Close()
// Mock RPC
dispatcher := gorpc.NewDispatcher()
dispatcher.AddFunc("Login", func(clientAddr, userKey string) bool {
return true
})
rpcMock := startRPCMock(dispatcher)
defer stopRPCMock(rpcMock)
// Test health check in normal mode
recorder := httptest.NewRecorder()
req := httptest.NewRequest("GET", "/tyk/health", nil)
ts.Gw.HealthCheckHandler(recorder, req)
assert.Equal(t, http.StatusOK, recorder.Code)
// Force emergency mode and RPC failure
rpc.SetEmergencyMode(t, true)
ts.Gw.healthCheckInfo.Store(map[string]HealthCheckItem{
"rpc": {Status: Fail, ComponentType: System},
})
// Test health check in emergency mode
recorder = httptest.NewRecorder()
req = httptest.NewRequest("GET", "/tyk/health", nil)
ts.Gw.HealthCheckHandler(recorder, req)
// Should still return 200 OK with warning status
assert.Equal(t, http.StatusOK, recorder.Code)
var response HealthCheckResponse
json.Unmarshal(recorder.Body.Bytes(), &response)
assert.Equal(t, Warn, response.Status)
} 3. Testing Readiness Endpointfunc TestReadinessEndpointInEmergencyMode(t *testing.T) {
// Setup gateway with RPC policy source
conf := func(globalConf *config.Config) {
globalConf.Policies.PolicySource = "rpc"
globalConf.HealthCheck.EnableHealthChecks = true
}
ts := StartTest(conf)
defer ts.Close()
// Mock RPC
dispatcher := gorpc.NewDispatcher()
rpcMock := startRPCMock(dispatcher)
defer stopRPCMock(rpcMock)
// Force emergency mode and RPC failure
rpc.SetEmergencyMode(t, true)
ts.Gw.healthCheckInfo.Store(map[string]HealthCheckItem{
"rpc": {Status: Fail, ComponentType: System},
})
// Test readiness endpoint
recorder := httptest.NewRecorder()
req := httptest.NewRequest("GET", "/tyk/ready", nil)
ts.Gw.readyCheckHandler(recorder, req)
// Should return 200 OK even in emergency mode
assert.Equal(t, http.StatusOK, recorder.Code)
} 4. Testing Connection Failure Scenariosfunc TestConnectionFailureToEmergencyMode(t *testing.T) {
// Setup gateway with non-existent RPC server
conf := func(globalConf *config.Config) {
globalConf.Policies.PolicySource = "rpc"
globalConf.SlaveOptions.UseRPC = true
globalConf.SlaveOptions.ConnectionString = "localhost:9999" // Non-existent
}
ts := StartTest(conf)
defer ts.Close()
// Wait for connection attempt to fail
time.Sleep(100 * time.Millisecond)
// Verify emergency mode is activated
assert.True(t, rpc.IsEmergencyMode())
// Verify health check passes with warning
recorder := httptest.NewRecorder()
req := httptest.NewRequest("GET", "/tyk/health", nil)
ts.Gw.HealthCheckHandler(recorder, req)
assert.Equal(t, http.StatusOK, recorder.Code)
} 5. Testing Recovery from Emergency Modefunc TestRecoveryFromEmergencyMode(t *testing.T) {
// Start with no RPC server
conf := func(globalConf *config.Config) {
globalConf.Policies.PolicySource = "rpc"
globalConf.SlaveOptions.UseRPC = true
globalConf.SlaveOptions.ConnectionString = "localhost:9090"
}
ts := StartTest(conf)
defer ts.Close()
// Wait for emergency mode
time.Sleep(100 * time.Millisecond)
assert.True(t, rpc.IsEmergencyMode())
// Start RPC server
dispatcher := gorpc.NewDispatcher()
dispatcher.AddFunc("Login", func(clientAddr, userKey string) bool {
return true
})
server := startRPCServer(t, "localhost:9090", dispatcher)
defer server.Stop()
// Force reconnection
rpc.Reset()
rpc.Connect(rpc.Config{
ConnectionString: "localhost:9090",
}, false, map[string]interface{}{
"Login": func(clientAddr, userKey string) bool {
return true
},
}, nil, nil, nil)
// Wait for recovery
time.Sleep(500 * time.Millisecond)
// Verify emergency mode is deactivated
assert.False(t, rpc.IsEmergencyMode())
// Verify health check shows full pass
recorder := httptest.NewRecorder()
req := httptest.NewRequest("GET", "/tyk/health", nil)
ts.Gw.HealthCheckHandler(recorder, req)
assert.Equal(t, http.StatusOK, recorder.Code)
} 6. Testing with Kubernetes Probesfunc TestKubernetesProbes(t *testing.T) {
// Setup gateway with RPC policy source
conf := func(globalConf *config.Config) {
globalConf.Policies.PolicySource = "rpc"
globalConf.HealthCheck.EnableHealthChecks = true
}
ts := StartTest(conf)
defer ts.Close()
// Mock RPC
dispatcher := gorpc.NewDispatcher()
rpcMock := startRPCMock(dispatcher)
defer stopRPCMock(rpcMock)
// Force emergency mode and RPC failure
rpc.SetEmergencyMode(t, true)
ts.Gw.healthCheckInfo.Store(map[string]HealthCheckItem{
"rpc": {Status: Fail, ComponentType: System},
})
// Test liveness probe - should pass
recorder := httptest.NewRecorder()
req := httptest.NewRequest("GET", "/tyk/health", nil)
ts.Gw.HealthCheckHandler(recorder, req)
assert.Equal(t, http.StatusOK, recorder.Code)
// Test readiness probe - should pass
recorder = httptest.NewRecorder()
req = httptest.NewRequest("GET", "/tyk/ready", nil)
ts.Gw.readyCheckHandler(recorder, req)
assert.Equal(t, http.StatusOK, recorder.Code)
} By implementing these tests, we can ensure that the gateway correctly handles RPC failures and emergency mode transitions, maintaining proper health check behavior throughout. This will prevent regressions like the one fixed in this PR, where the gateway was incorrectly marking RPC failures as critical even in emergency mode. Tip: Mention me again using |
User description
Description
This PR fixes the health check logic for RPC components when MDCB (Multi-Data Center Bridge) is operating in emergency mode, ensuring proper failover behavior during RPC connectivity issues.
Problem
When MDCB enters emergency mode due to RPC connectivity issues, the gateway was incorrectly marking RPC health check failures as critical, causing the entire gateway to report as unhealthy. This prevented proper failover operation where the gateway should continue serving requests using cached policies.
Solution
Modified the isCriticalFailure() function in gateway/health_check.go to consider RPC emergency mode status when determining if an RPC component failure is critical.
Related Issue
Motivation and Context
How This Has Been Tested
Screenshots (if appropriate)
Types of changes
Checklist
PR Type
Bug fix, Tests
Description
Fixes critical failure logic for RPC in emergency mode
Adds unit tests for RPC emergency mode scenarios
Updates test setup to handle emergency mode toggling
Ensures correct behavior for RPC health check failures
Changes diagram
Changes walkthrough 📝
health_check.go
Add emergency mode check to RPC critical failure logic
gateway/health_check.go
health_check_test.go
Add and update tests for RPC emergency mode logic
gateway/health_check_test.go