Skip to content

GMime parsing bad-header behavior differs from some common implementations #52

@russianfool

Description

@russianfool

GMime has sub-optimal behavior when faced with malformed headers. There's some logic in parser_step_headers to proceed past some bad lines, but as-implemented there are header combinations that will fool GMime into stopping header parsing early.

Specifically, I want to focus on sub-sections of the RFC 7103, because clients (e.g. Outlook?, Evolution/Camel, Thunderbird) seem to implement approach 1, whereas GMime implements approach 2:

  1. Header Anomalies
    This section covers common syntactic and semantic anomalies found in
    a message header and presents suggested methods of mitigation.
    ...
    7.2. Non-Header Lines

Some messages contain a line of text in the header that is not a
valid message header field of any kind. For example:

  From: [email protected] {1}
  To: [email protected] {2}
  Subject: This is your reminder {3}
  about the football game tonight {4}
  Date: Wed, 20 Oct 2010 20:53:35 -0400 {5}

The suggested ways of dealing with this kind of stuff are:

  1. Some agents choose to separate the header of the message from the
    body only at the first empty line (that is, a CRLF immediately
    followed by another CRLF).

  2. Some agents assume this anomaly should be interpreted to mean the
    body starts at line {4}, as the end of the header is assumed by
    encountering something that is not a valid header field or folded
    portion thereof.

  3. Some agents assume this should be interpreted as an intended
    header folding as described above and thus simply append a single
    space character (ASCII 0x20) and the content of line {4} to that
    of line {3}.

  4. Some agents reject this outright as line {4} is neither a valid
    header field nor a folded continuation of a header field prior to
    an empty line.

GMime mainly does 2 (with the message/content headers heuristic), and doesn't implement 3 (headers are never treated as a mis-fold and "recovered").

--- Below here might not be interesting in light of Outlook not actually implementing 3 either.

I found the rspamd thread on this - I'd like to improve the behavior though, if possible, instead of abandoning GMime altogether. What kind of error recovery is in line with your goals for GMime?

Currently I've modified the logic in parser_step_headers to be a bit more tolerant, but...

  1. For the pasting of lines, manipulating the appended line (ends up in raw_value) means we're changing the email representation.
  2. It's very hard to reconcile the "guess we've hit a broken mailer" jump-to-content code already in GMime and this kind of RFC 7103 approach 3 behavior. From my limited testing, Outlook doesn't care to jump to content, even for multipart parts.
  3. If we do move it outside (... e.g. roll a priv_headers_to_object or whatever with enough options to support all the construct callers, and store broken lines as key-less lines in the array), then that information isn't available to decide whether to end header parsing (and you might have to dump a bunch of fake headers and re-parse as body).

Moving it outside seems like a better approach (just because you have more flexibility in seeking across the different lines and don't have to worry about the 4k limit), but nothing about this screams elegant; what are your thoughts? I guess header_cb should be called on the re-folded version as well?

The two most-interesting cases RE actual samples:

  1. Stopping header parsing too early because of badly-folded headers. Outlook 2016, Thunderbird 60, etc. chew through these bad boys just fine. This sometimes causes GMime to miss important headers (boundaries, content types, etc.).

From: [email protected]
To: [email protected]
Message-ID: [email protected]
Subject: Confirmed email
X-IBM-SpamModules-Scores: A=1; B=2;
C=3; D=4; E=5;
X-IBM-SpamModules-Versions: A=1; B=2; C=3;
D=4; E=5; F=6;
G=7; H=8;
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable

...
  1. Malformed header lines that could benefit from re-folding, e.g. as described here:

Content-Type: application/x-zip-compressed; x-unix-mode=0600;
name="7DDA4_foo_9E5D72.zip"`

Although it does seem like the re-folding needs to be line-parsing-aware, since I have cases like:

Content-Type: application/x-zip-compressed; x-unix-mode=0600; name="
7DDA4_foo_9E5D72.zip"`

Maybe that's overkill though, haven't tested whether common clients end up recovering the attachment name properly here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions