Skip to content

[exporter/prometheusremotewrite] WAL is broken #41785

@MichaelThamm

Description

@MichaelThamm

Component(s)

exporter/prometheusremotewrite

What happened?

Description

According to this related PrometheusRemoteWrite (PRW) issue, the WAL has been "broken for years". This is evident when the PRW exporter tries to write to the /api/v1/write Prometheus API and received a 400 - Bad Request response due to "error processing WAL entries" causing "Permanent error: out of order sample" and eventually "out of bounds". When removing the WAL config from the PRW exporter:

exporters:
  prometheusremotewrite/0:
    endpoint: http://prom-0.prom-endpoints.how-to.svc.cluster.local:9090/api/v1/write
    tls:
      insecure_skip_verify: false
-   wal:
-     directory: /otelcol

Then the issue is resolved.

Steps to Reproduce

We use Juju to deploy our infra:

TL;DR Deploy a metrics source (e.g. Alertmanager), deploy a metrics sink (e.g. Prometheus) and configure them in the otel-collector receivers and exporters.

Expected Result

Metrics arrive in Prometheus with a working WAL.

Actual Result

Metrics arrive in Prometheus, but there are errors hinting at a broken WAL in the otel-collector logs.

Collector version

0.130.1

Environment information

Environment

OS: Ubuntu 24.04.2 LTS

OpenTelemetry Collector configuration

connectors: {}
exporters:
  debug:
    verbosity: basic
  prometheusremotewrite/0:
    endpoint: http://prom-0.prom-endpoints.how-to.svc.cluster.local:9090/api/v1/write
    tls:
      insecure_skip_verify: false
    wal:
      directory: /otelcol
extensions:
  file_storage:
    directory: /otelcol
  health_check:
    endpoint: 0.0.0.0:13133
processors:
  attributes:
    actions:
      - action: upsert
        key: loki.attribute.labels
        value: container, job, filename, juju_application, juju_charm, juju_model, juju_model_uuid, juju_unit, snap_name, path
  resource:
    attributes:
      - action: insert
        key: loki.format
        value: raw
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
        - job_name: juju_how-to_7b30903e_otelcol_self-monitoring
          scrape_interval: 60s
          static_configs:
            - labels:
                instance: how-to_7b30903e_otelcol_otelcol/0
                juju_application: otelcol
                juju_charm: opentelemetry-collector-k8s
                juju_model: how-to
                juju_model_uuid: 7b30903e-8941-4a40-864c-0cbbf277c57f
                juju_unit: otelcol/0
              targets:
                - 0.0.0.0:8888
        - job_name: juju_how-to_7b30903e_am_prometheus_scrape
          metrics_path: /metrics
          relabel_configs:
            - regex: (.*)
              separator: _
              source_labels:
                - juju_model
                - juju_model_uuid
                - juju_application
              target_label: instance
          scheme: http
          static_configs:
            - labels:
                juju_application: am
                juju_charm: alertmanager-k8s
                juju_model: how-to
                juju_model_uuid: 7b30903e-8941-4a40-864c-0cbbf277c57f
              targets:
                - am-0.am-endpoints.how-to.svc.cluster.local:9093
          tls_config:
            insecure_skip_verify: false
service:
  extensions:
    - health_check
    - file_storage
  pipelines:
    logs:
      exporters:
        - debug
      processors:
        - resource
        - attributes
      receivers:
        - otlp
    metrics:
      exporters:
        - prometheusremotewrite/0
      receivers:
        - otlp
        - prometheus
    traces:
      exporters:
        - debug
      receivers:
        - otlp
  telemetry:
    logs:
      level: DEBUG
    metrics:
      level: normal

Log output

2025-08-05T14:11:50.231Z [otelcol] 2025-08-05T14:11:50.231Z     error   prw.wal [email protected]/wal.go:245    error processing WAL entries    {"resource": {"service.instance.id": "00ba5573-5bb4-4294-b1ca-1f84b32dbf29", "service.name": "otelcol", "service.version": "0.130.1"}, "otelcol.component.id": "prometheusremotewrite/0", "otelcol.component.kind": "exporter", "otelcol.signal": "metrics", "error": "Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): out of order sample\n; Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): out of order sample\n; Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): out of order sample\n", "errorCauses": [{"error": "Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): out of order sample\n"}, {"error": "Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): out of order sample\n"}, {"error": "Permanent error: Permanent error: Permanent error: remote write returned HTTP status 400 Bad Request; err = %!w(<nil>): out of order sample\n"}]}

Additional context

No response

Tip

React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions