Combination of edge cases triggers ~200ms delay in TTFB

I recently spun up a private instance of WPT using the ami-2a1cb142 EC2 image. In general I’ve been really happy with it.

Now I’m tracking an usual edge case. Here are the factors:

  • Chrome, on Windows ( the EC2 instance mentioned above )
  • Nginx w/ SPDY enabled, and gzip compression ( level 5 or 6 is what I’ve tested with )
  • A file that is 1,602 bytes or smaller ( exact size might vary depending on how well it compresses ) that is the first resource requested in a SPDY connection

To simplify testing this I’ve reproduced the conditions on my private site with these two files:

These contain mostly random characters, the only difference between the two is a single character ( test-fast.html is 1,603 bytes, test-slow.html is 1,602 ).

The test-fast.html request is very quick, works the way I’d expect. The test-slow.html has a ~200ms delay in the time to first byte.

If these resources were just part of a page and the initial file requested in the SPDY connection was larger then this ~200ms delay never happens. It doesn’t happen for smaller files that are requested after the initial connection.

I’ve provided HAR files for test results from my private instance:

The only difference in the two tests is the resource requested, and the only difference between those is one byte.

The delay doesn’t happen when testing with IE 10 - https://josephscott.org/dev/spdy/josephscott.org.141023_JS_DJ-IE10.har

So far I’ve only been able to recreate this delay using the WPT EC2 instance. But, I’ve got one other data point that confuses this as well. I found a small file on cloudflare.com to test against - https://www.cloudflare.com/static/javascripts/live/fp-c71f8d1e6cd0e89831d25176cca548d0/page/index.js?v=CB6-2013-08-01-1 - and it doesn’t have the ~200ms delay.

This leads me to believe that the wptdriver isn’t the only variable, but at the same time I’m only able to produce this delay on my EC2 instance.

In addition to sites at work and reproducing it on my personal site, there are others where this happens too:

You can also see this when running tests from webpagetest.org:

It isn’t an exact comparison to my private instance, but you can still see the TTFB delay in there.

I haven’t been able to reproduce this on my Mac, an older Windows 7 desktop, the Windows 8 system of a co-worker. I can’t even reproduce this on separate non-WPT EC2 images:

  • Windows_Server-2012-R2_RTM-English-64Bit-Base-2014.09.10 (ami-904be6f8)
  • Windows_Server-2008-R2_SP1-English-64Bit-Base-2014.09.10 (ami-dc65c8b4)

Based on the data that I have so far it looks like this could be a bug on the Nginx side ( since CloudFlare doesn’t seem to have this issue ), but not being able to reproduce it on other Windows clients still leaves me with a small chance that this is WPT related ( or one of the components, like dummynet ).

Has anyone else seen this issue? If not, can others reproduce it with their WPT installs? And the big piece, can anyone reproduce this on a Windows system that has never had WPT installed?

Thank you.

If you’re on twitter I recommend pinging @igrigorik and pointing him to the write-up to see if he has any thoughts.

If it happened with all browsers my guess would have been something related to the SSL window size. 200ms is exactly the time for Windows delayed-acks though so something more bizarre may be going on.

Thanks, I’ll do that.

The delayed ack issue was one of the ideas that came up in discussions at work. Since this doesn’t happen with Chrome over plain HTTPS with SPDY turned off I think we can rule that out.

Only other thing that comes to mind is if Nginx isn’t setting TCP_NO_DELAY on the SPDY connections for some reason and Nagle is kicking in. Still, I’d expect to see it in the cert negotiation in that case too.

BTW, the times in Chrome are what are reported by Chrome while IE and Firefox timings are measured directly (though since WPT doesn’t decode SPDY yet you can’t see the raw wire-requests in Firefox). Could also be something in Chrome’s measurements but I’d expect to see it elsewhere.

A tcpdump might help though there’s going to be a bit of guessing to figure out which packets are the data packets.

All of the above look the same. Running with “native” (no shaping):

It looks like the good old 200ms delayed ack problem on windows.

Here is the same slow test, over plain HTTPS ( SPDY turned off ):

If you compare that with the “fast” test with SPDY:

The fast SPDY test had a TTFB of 38ms. The slow plain HTTPS test had a TTFB of 31ms. For our purposes I’ll call this basically the same. Now compare those to the slow test with SPDY turned on that you linked to:

The TTFB jumps to 237ms.

If this is the standard Windows ~200ms delay, then it only happens with SPDY turned on.

Hmm, looks like Nginx’s tcp_nodelay only kicks in when a socket goes into keep-alive (so subsequent responses don’t get buffered): Module ngx_http_core_module

Admittedly it’s been a long time since I was solving Nagle issues and it was in a completely different space but I thought all web servers basically disabled it across-the-board.

Do you have any tcp_nopush or sendfile settings configured on the server? If it is using sendfile and tcp_nopush is enabled I could see it kicking in.

I’ve had these turned on in Nginx for all tests against my personal site:

  • sendfile on;
  • tcp_nopush on;
  • tcp_nodelay on;

None of those options were changed between plain HTTPS and SPDY tests.

Hmm, doesn’t seem like it’s a TCP_NODELAY thing. In the SPDY tcpdump that Ilya shared, the server packets have the PSH bit set on small packets which is usually a good sign that it’s doing nodelay. :-/

I’d recommend turning off the tcp_nopush setting. It defaults to off and push is what avoids Nagle interacting with the 200ms delayed acks. tcp_nodelay defaults to on which is where you want it as well.

That would still be an odd interaction with SPDY.

Turned it off, ran another test - http://www.webpagetest.org/result/141023_4C_12EZ/ - still getting the ~200ms delay.

I’ve turned it back on, to preserve the original testing conditions.

I’ve running more tests and I think I’ve got another useful data point.

A couple of weeks ago I was experimenting with ‘gzip_static on;’ in Nginx. I noticed that when that was turned on, and there was a pre-compressed file available, that the HTTP responses from Nginx looked more like a traditional static file ( which makes sense, it really is at that point ).

Specifically, look at the ‘Response’ section of Request #1 for the test that @igrigorik ran earlier in this thread - http://www.webpagetest.org/result/141023_KN_6fcd65a19900d95376fc023b977e6a18/1/details/

What is missing from that response is the “Content-Length:” header. I didn’t think much of this originally. I figured that when Nginx was doing compression on the fly it simply didn’t do the extra work to include that value in the response.

Then I remembered that with “gzip_static on;” it would include it in the length in the response. So I combined that with this experiment, turning gzip_static on and creating a pre-compressed test-slow.html.gz for Nginx to use ( gzip -6 test-slow.html ). A quick call with curl confirmed that with all of this in place “Content-Length: 1019” was included in the response.

Now for the moment of truth, I ran a new WPT test: http://www.webpagetest.org/result/141023_Z6_13WH/

The TTFB is back down to 35ms again!

So perhaps the issue really is with Nginx and the cross sections of SPDY and compression.
[hr]
I should have looked closer, when compressing on the fly Nginx is doing “Transfer-Encoding: chunked”.
[hr]
This does a good job of explaining the situation with Nginx, HTTP compression, and content length - GZIP disables Content-Length header, and thus breaks caching. · Issue #240 · metacpan/metacpan-api · GitHub
[hr]
I think I’ve figured this out. Doing more tests to confirm.

More edge cases turning up as I do more tests. I’ve found options to deal with the situation where a single file is requested over SPDY. Unfortunately those options don’t appear to work when multiple requests come across a SPDY connection, even though it is still just the first one that experiences the delay.

Still digging.

I’ve been running ( LOTS ) more tests against our private instance to narrow this down more.

After trying several different options I settled with adding “gzip_min_length 1603;” to the Nginx config. I was able to get a better TTFB for both of my initial tests:

I’ve got another test that looks like requesting two resources. That ran into the same ~200ms TTFB delay as well, even with “gzip_min_length 1603;” set. The second request was not delayed, even though it was also for a very small file.

In the two request case it came down to the first request being less than 1,009 bytes triggering the delay:

I tried shrinking small.html by a few bytes and increasing small.css by the same amount. Still triggered the delay: http://www.webpagetest.org/result/141027_NN_d0dcaeba5c073a939030594b181a2fb8/

The first request still appears to be the key here. And the 1603 limit isn’t sufficient, because files below 1,009 bytes also triggered the delay.

At this point it looks like we could avoid the delay by doing the following:

  • Add “gzip_min_length 1603;” to Nginx config
  • Make sure all files are 1,009 bytes or more

At best these are hacky work arounds. And I’m not even 100% sure if 1,603 and 1,009 are consistent points, or just artifacts of my particular test setup.

Good news!

A patch was provided by the folks at Nginx, which I’ve tested and confirmed that it removed the ~200ms delay that was happening under these specific conditions. This morning that patch was committed to the Nginx repo - Changeset 5923:2c10db908b8c for src – nginx

Very cool. I expect that will actually happen a lot in the wild under all sorts of different conditions so it’s probably quite a win for nginx perf. In your tests it was pretty clear because it was showing in TTFB but from the looks of the patch it might commonly just show as an extra 200ms delivering the last bit of data which might be less obvious but still pretty nasty.

The latest release of Nginx, 1.7.8 ( released 02 Dec 2014 ) includes this fix - http://nginx.org/en/CHANGES - “Feature: now the “tcp_nodelay” directive works with SPDY connections.”

I’ve written up a summary of the background and fix at https://josephscott.org/archives/2014/12/nginx-1-7-8-fixes-200ms-delay-with-spdy/