GMime parsing bad-header behavior differs from some common implementations

GMime has sub-optimal behavior when faced with malformed headers. There's some logic in `parser_step_headers` to proceed past some bad lines, but as-implemented there are header combinations that will fool GMime into stopping header parsing early.

Specifically, I want to focus on sub-sections of the [RFC 7103](https://tools.ietf.org/html/rfc7103#section-7.2), because clients (e.g. Outlook?, Evolution/Camel, Thunderbird) seem to implement approach 1, whereas GMime implements approach 2:
> 7.  Header Anomalies
>     This section covers common syntactic and semantic anomalies found in
>     a message header and presents suggested methods of mitigation.
> ...
> 7.2.  Non-Header Lines
>
>   Some messages contain a line of text in the header that is not a
>   valid message header field of any kind.  For example:
>
>       From: user@example.com {1}
>       To: userpal@example.net {2}
>       Subject: This is your reminder {3}
>       about the football game tonight {4}
>       Date: Wed, 20 Oct 2010 20:53:35 -0400 {5}

The suggested ways of dealing with this kind of stuff are:
>    1.  Some agents choose to separate the header of the message from the
>      body only at the first empty line (that is, a CRLF immediately
>      followed by another CRLF).
>
>   2.  Some agents assume this anomaly should be interpreted to mean the
>       body starts at line {4}, as the end of the header is assumed by
>       encountering something that is not a valid header field or folded
>       portion thereof.
>
>   3.  Some agents assume this should be interpreted as an intended
>       header folding as described above and thus simply append a single
>       space character (ASCII 0x20) and the content of line {4} to that
>       of line {3}.
>
>   4.  Some agents reject this outright as line {4} is neither a valid
>       header field nor a folded continuation of a header field prior to
>       an empty line.

GMime mainly does 2 (with the message/content headers heuristic), and doesn't implement 3 (headers are never treated as a mis-fold and "recovered").

--- Below here might not be interesting in light of Outlook not actually implementing 3 either.

I found the rspamd thread on this - I'd like to improve the behavior though, if possible, instead of abandoning GMime altogether. What kind of error recovery is in line with your goals for GMime?

Currently I've modified the logic in parser_step_headers to be a bit more tolerant, but...
1. For the pasting of lines, manipulating the appended line (ends up in raw_value) means we're changing the email representation.
2. It's very hard to reconcile the "guess we've hit a broken mailer" jump-to-content code already in GMime and this kind of RFC 7103 approach 3 behavior. From my limited testing, Outlook doesn't care to jump to content, even for multipart parts.
3. If we do move it outside (... e.g. roll a `priv_headers_to_object` or whatever with enough options to support all the construct callers, and store broken lines as key-less lines in the array), then that information isn't available to decide whether to end header parsing (and you might have to dump a bunch of fake headers and re-parse as body).

Moving it outside seems like a better approach (just because you have more flexibility in seeking across the different lines and don't have to worry about the 4k limit), but nothing about this screams elegant; what are your thoughts? I guess `header_cb` should be called on the re-folded version as well?

#### The two most-interesting cases RE actual samples:
1. Stopping header parsing too early because of badly-folded headers. Outlook 2016, Thunderbird 60, etc. chew through these bad boys just fine. This sometimes causes GMime to miss important headers (boundaries, content types, etc.).

> From: a.a@gmail.com
> To: b.b@gmail.com
> Message-ID: <12345@google.com>
> Subject: Confirmed email
> X-IBM-SpamModules-Scores: A=1; B=2;
> C=3; D=4; E=5;
> X-IBM-SpamModules-Versions:  A=1; B=2; C=3;
> D=4; E=5; F=6;
> G=7; H=8;
> Content-Type: text/html; charset=utf-8
> Content-Transfer-Encoding: quoted-printable
>
> <html><head> ...

2. Malformed header lines that could benefit from re-folding, e.g. as described [here](https://github.com/jhillyerd/enmime/issues/2):
> Content-Type: application/x-zip-compressed; x-unix-mode=0600;
> name="7DDA4_foo_9E5D72.zip"`

Although it does seem like the re-folding needs to be line-parsing-aware, since I have cases like:

> Content-Type: application/x-zip-compressed; x-unix-mode=0600; name="
> 7DDA4_foo_9E5D72.zip"`

Maybe that's overkill though, haven't tested whether common clients end up recovering the attachment name properly here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GMime parsing bad-header behavior differs from some common implementations #52

The two most-interesting cases RE actual samples:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

GMime parsing bad-header behavior differs from some common implementations #52

Description

The two most-interesting cases RE actual samples:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions