Skip to content

VCL failure undefined behavior #3825

@dridi

Description

@dridi

Consider the following test case:

varnishtest "backend fetch despite http_max_hdr overflow"

server s1 {
	rxreq
	expect req.http.hdr20 == set
	expect req.http.hdr32 == <undef>
	txresp
} -start

varnish v1 -cliok "param.set http_max_hdr 32"
varnish v1 -vcl+backend {
	sub vcl_backend_fetch {
		set bereq.http.hdr01 = "set";
		set bereq.http.hdr02 = "set";
		set bereq.http.hdr03 = "set";
		set bereq.http.hdr04 = "set";
		set bereq.http.hdr05 = "set";
		set bereq.http.hdr06 = "set";
		set bereq.http.hdr07 = "set";
		set bereq.http.hdr08 = "set";
		set bereq.http.hdr09 = "set";
		set bereq.http.hdr10 = "set";
		set bereq.http.hdr11 = "set";
		set bereq.http.hdr12 = "set";
		set bereq.http.hdr13 = "set";
		set bereq.http.hdr14 = "set";
		set bereq.http.hdr15 = "set";
		set bereq.http.hdr16 = "set";
		set bereq.http.hdr17 = "set";
		set bereq.http.hdr18 = "set";
		set bereq.http.hdr19 = "set";
		set bereq.http.hdr19 = "set";
		set bereq.http.hdr20 = "set";
		set bereq.http.hdr21 = "set";
		set bereq.http.hdr22 = "set";
		set bereq.http.hdr23 = "set";
		set bereq.http.hdr24 = "set";
		set bereq.http.hdr25 = "set";
		set bereq.http.hdr26 = "set";
		set bereq.http.hdr27 = "set";
		set bereq.http.hdr28 = "set";
		set bereq.http.hdr29 = "set";
		set bereq.http.hdr30 = "set";
		set bereq.http.hdr31 = "set";
		set bereq.http.hdr32 = "set";
	}
} -start

client c1 {
	txreq
	rxresp
	expect resp.status == 200
} -run

It passes, but arguably, it shouldn't. Even worse, the problem is caught and logged, but not acted upon and it points us in a wrong direction:

**** v1    vsl|       1002 BereqHeader     b hdr23: set
**** v1    vsl|       1002 LostHeader      b hdr24: set
**** v1    vsl|       1002 Error           b out of workspace (bo)
**** v1    vsl|       1002 LostHeader      b hdr25: set
**** v1    vsl|       1002 Error           b out of workspace (bo)
**** v1    vsl|       1002 LostHeader      b hdr26: set
**** v1    vsl|       1002 Error           b out of workspace (bo)
**** v1    vsl|       1002 LostHeader      b hdr27: set
**** v1    vsl|       1002 Error           b out of workspace (bo)
**** v1    vsl|       1002 LostHeader      b hdr28: set
**** v1    vsl|       1002 Error           b out of workspace (bo)
**** v1    vsl|       1002 LostHeader      b hdr29: set
**** v1    vsl|       1002 Error           b out of workspace (bo)
**** v1    vsl|       1002 LostHeader      b hdr30: set
**** v1    vsl|       1002 Error           b out of workspace (bo)
**** v1    vsl|       1002 LostHeader      b hdr31: set
**** v1    vsl|       1002 Error           b out of workspace (bo)
**** v1    vsl|       1002 LostHeader      b hdr32: set
**** v1    vsl|       1002 Error           b out of workspace (bo)
**** v1    vsl|       1002 VCL_return      b fetch

We started looking into this at v-s after a case where things were failing (I'll cover this aspect below) but bumping workspace didn't solve anything. Of course that would be expected when the limit we exceed is not workspace but max_http_hdr. It would only be fair to have an error message like Exceeded max_http_hdr parameter instead.

Something equivalent on the client side would fail, because before resp delivery we have a workspace overflow safety net that we seem to be lacking on the backend side.

Leaving the max_http_hdr overflow aside, even this test case passes today:

varnishtest "backend fetch despite workspace overflow"

server s1 {
        rxreq
        txresp
} -start

varnish v1 -vcl+backend {
        import vtc;

        sub vcl_backend_fetch {
                vtc.workspace_overflow(backend);
        }
} -start

client c1 {
        txreq
        rxresp
        expect resp.status == 200
} -run

Considering first that HTTP is an application protocol, and that headers act as variable placeholders in VCL, losing any header could result in a misbehaving application. Waiting until near the end of a task is already too late to fail the task. For VCL execution, we have the ability to VRT_fail() the task, and every single VCL statement is followed by a VRT failure check. We should do the same with header failures.

The problem with HTTP headers in the cache process is that they log errors directly on the VSL buffer and (ab)use the workspace to signal a failure. We can't (and probably shouldn't) trigger a VRT failure because we can't have a stable VRT_CTX for the duration of the task. Instead I suggest a subtle change:

--- bin/varnishd/cache/cache.h
+++ bin/varnishd/cache/cache.h
@@ -155,7 +155,7 @@ struct http {
        uint16_t                nhd;            /* Next free hd */
 
        enum VSL_tag_e          logtag;         /* Must be SLT_*Method */
-       struct vsl_log          *vsl;
+       struct worker           *wrk;
 
        struct ws               *ws;
        uint16_t                status;

With this, the VSL buffer would be one indirection away for actual logging, and for failures we could imagine having a WRK_Fail() function to log an error and fail the current task. That introduces new questions, for example a task may disembark a worker and reembark a new one.

With this, we would make HTTP headers failures converge towards a task failure, so they would be caught in VCL just like any other failure after every single VCL statement. Error conditions should generally converge towards failing the task, instead of requiring that all types of error conditions compound safety nets at some arbitrary point in tasks.

For errors happening in core code, I suggest we add a task failure check in the both req and fetch FSM loops.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions