Skip to content

[extension/storage/filestorage] Fix recreate from panic #41802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

briandavis-viz
Copy link

Description

  • This enables recovery from a panic when the bbolt db is corrupted and renames the file when a panic occurs.
  • This changes the recreate behavior to not rename the file upon every start of the collector.

Link to tracking issue

Resolves #36840
Resolves #35899

Testing

  1. Start collector with file extension configured. This should create the file with some content within it.
  2. Stop the collector.
  3. Manually edit or "break" the bbolt db file by adding characters in random places, forcefully causing a panic
  4. Start the collector
  5. See logs:
{"level":"info","ts":"2025-07-31T23:26:34.218-0400","msg":"Setting up own telemetry...","resource":{"service.instance.id":"d830aeff-7993-49f7-9817-a0c96af3498d","service.name":"otelcontribcol","service.version":"0.131.0-dev"}}
{"level":"info","ts":"2025-07-31T23:26:34.242-0400","msg":"Starting otelcontribcol...","resource":{"service.instance.id":"d830aeff-7993-49f7-9817-a0c96af3498d","service.name":"otelcontribcol","service.version":"0.131.0-dev"},"Version":"0.131.0-dev","NumCPU":6}
{"level":"info","ts":"2025-07-31T23:26:34.242-0400","msg":"Starting extensions...","resource":{"service.instance.id":"d830aeff-7993-49f7-9817-a0c96af3498d","service.name":"otelcontribcol","service.version":"0.131.0-dev"}}
{"level":"info","ts":"2025-07-31T23:26:34.242-0400","msg":"Extension is starting...","resource":{"service.instance.id":"d830aeff-7993-49f7-9817-a0c96af3498d","service.name":"otelcontribcol","service.version":"0.131.0-dev"},"otelcol.component.id":"file_storage/persistent_queue_storage","otelcol.component.kind":"extension"}
{"level":"info","ts":"2025-07-31T23:26:34.242-0400","msg":"Extension started.","resource":{"service.instance.id":"d830aeff-7993-49f7-9817-a0c96af3498d","service.name":"otelcontribcol","service.version":"0.131.0-dev"},"otelcol.component.id":"file_storage/persistent_queue_storage","otelcol.component.kind":"extension"}
#### This line ####
{"level":"warn","ts":"2025-07-31T23:26:34.242-0400","msg":"Database corruption detected, recreating database file","resource":{"service.instance.id":"d830aeff-7993-49f7-9817-a0c96af3498d","service.name":"otelcontribcol","service.version":"0.131.0-dev"},"otelcol.component.id":"file_storage/persistent_queue_storage","otelcol.component.kind":"extension","file":"/data/otelcol/persistent_queue_storage/exporter_otlphttp_general_logs","panic":"assertion failed: Page expected to be: 96, but self identifies as 4909050520039286100"}
{"level":"info","ts":"2025-07-31T23:26:34.243-0400","msg":"Corrupted database file renamed","resource":{"service.instance.id":"d830aeff-7993-49f7-9817-a0c96af3498d","service.name":"otelcontribcol","service.version":"0.131.0-dev"},"otelcol.component.id":"file_storage/persistent_queue_storage","otelcol.component.kind":"extension","original":"/data/otelcol/persistent_queue_storage/exporter_otlphttp_general_logs","backup":"/data/otelcol/persistent_queue_storage/exporter_otlphttp_general_logs.backup"}
#### through this line ####
{"level":"info","ts":"2025-07-31T23:26:34.244-0400","msg":"New queue metadata key not found, attempting to load legacy format.","resource":{"service.instance.id":"d830aeff-7993-49f7-9817-a0c96af3498d","service.name":"otelcontribcol","service.version":"0.131.0-dev"},"otelcol.component.id":"otlphttp/general","otelcol.component.kind":"exporter","otelcol.signal":"logs"}
{"level":"info","ts":"2025-07-31T23:26:34.244-0400","msg":"Initializing new persistent queue","resource":{"service.instance.id":"d830aeff-7993-49f7-9817-a0c96af3498d","service.name":"otelcontribcol","service.version":"0.131.0-dev"},"otelcol.component.id":"otlphttp/general","otelcol.component.kind":"exporter","otelcol.signal":"logs"}
{"level":"info","ts":"2025-07-31T23:26:34.244-0400","msg":"Successfully migrated to consolidated metadata format","resource":{"service.instance.id":"d830aeff-7993-49f7-9817-a0c96af3498d","service.name":"otelcontribcol","service.version":"0.131.0-dev"},"otelcol.component.id":"otlphttp/general","otelcol.component.kind":"exporter","otelcol.signal":"logs"}
{"level":"info","ts":"2025-07-31T23:26:34.255-0400","msg":"Everything is ready. Begin running and processing data.","resource":{"service.instance.id":"d830aeff-7993-49f7-9817-a0c96af3498d","service.name":"otelcontribcol","service.version":"0.131.0-dev"}}

Documentation

Updated the existing documentation with behavioral changes for recreate option.

Changes the `Recreate` option's behavior in file storage to act as a panic recovery mechanism.

Previously, `Recreate` would unconditionally rename the database file upon startup if enabled. Now, when `Recreate` is true, it only renames the existing database file to a `.backup` and creates a new one if an attempt to open the database results in a panic, typically due to corruption.

This improvement allows for automatic recovery from corrupted database files, preventing data loss in most cases, and ensures that healthy databases are not unnecessarily recreated.
@briandavis-viz briandavis-viz requested a review from a team as a code owner August 5, 2025 20:24
Copy link

linux-foundation-easycla bot commented Aug 5, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@briandavis-viz
Copy link
Author

/label waiting-for-code-owners

@briandavis-viz
Copy link
Author

@swiatekm @VihasMakwana

zap.Any("panic", r))

// Rename the corrupted file
backupName := absoluteName + ".backup"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the backup file already exists? Can we use a snapshot of current time instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atoulme I'll be completely transparent and share that the goal of this PR is to further enhance the existing logic introduced in 0.131.0, and to prevent data loss on restarts.

It's not something that has been taken into consideration as a part of this change, but am assuming that the codeowners were okay with the previous implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

otelcol-contrib file_storage does not recover gracefully upon potentially corrupted database recover from filestorage panic on corrupted DB
3 participants