WPT claiming very slow speeds, seemingly inaccurate?

I’ve been working on improving the speed of a retail site for a while now, and have made significant, objective gains in a variety of ways. These improvements have been backed up by most automated test results and by real world experience. Even the speeds reported by WPT have improved a little, however, they are much worse than I can recreate or corroborate externally.

http://www.webpagetest.org/result/170117_DV_73ceb8dd508b7fc0a9bd0026d5e821c2/

Shows an average total load time of 8-9 seconds. The start render time is nearly 3 seconds! That’d be terrible, if it were actually the case. But even on my phone the site loads faster than that. Other testers, such as pingdom, also report much higher speeds.

1.53s? That is actually so fast it is suspect in the opposite direction. Though I’m not sure what their “Load Time” actually is, it may not be fully loaded- perhaps just visually complete+interactable? On the other hand, the waterfall does show all the resources loading within the 1.5s time-frame, so maybe it is complete. Either way, it harshly comflicts with WPT, since that is half even the start-render time reported here. And first-byte time, which I’d think would be pretty consistent, is just 150ms there compared to 450ms here.

I had read in another thread here that I found via google from 2014 that WPT can simulate slower connections for more realistic results (which I already knew). In that thread, the person posing a similar question to this had been on a cable-equivalent line and that explained the difference. The answer was to use “native” as the connection setting on a test. However, my above test is supposedly on 20mbs-down fiber. Furthermore, I tried a few tests on native and the results got worse. Here is one example:

http://www.webpagetest.org/result/170117_GQ_4428c1263bee46272a405cb0ce717379/

What’s going on here? Almost 4 seconds for start-render, 14 seconds for full load? That’s absurd and contradicted by all other sources.
Is there some bug on our side that confuses WPT? Something with the CDN that gives it low priority? I’m about out of ideas.

And of course, the obligatory, “And what improvements would you recommend?” Though that is less important. Note that the images can’t be compressed more despite what it thinks. We already have slight visible jpeg artifacting as-is, and as a visuals-focused retailer image quality is critical. I’m not sure about that F on caching either; some of it is tracking tags that simply can’t be cached, but it additionally seems to think most page content caching for ~24 hours is an issue. That doesn’t seem right, right?

We’re on the Demandware platfor-er, sorry, ‘Salesforce Commerce Cloud’ platform, if that is of any relevance.

Any and all advice appreciated, thanks.

Looking at the test, it looks like it was CPU constrained, not bandwidth. If you run the tests on the “Dulles Thinkpad” location you can largely eliminate the CPU constraints but those are pushing the really high-end of what most users will have.

Here is what it looks like running on the thinkpad with no traffic-shaping: https://www.webpagetest.org/result/170123_Q6_6580382dfbfbf6ca472e2c6243a164dc/

I can’t speak to how pingdom runs the tests but more often than not the biggest difference is because of connection profiles but the UA claims to be Chrome 39 so it could be using an old version of Chrome running on Linux on a server somewhere.

When you test manually, are you wiping out your cache first?

With close to 50 separate js and css files all loading in the head, I can guarantee that just about every first-time visitor is going to have a really slow experience. Step #1 would be to merge a bunch of those together so there are just a couple of each (site-wide and page-specific is a common way to separate).

Here are the tests run on real phones with high-speed wifi: https://www.webpagetest.org/result/170123_K7_c78790769488789385880485c8c9ae86/

and with a more realistic 3G connection (some requests were timing out): https://www.webpagetest.org/result/170123_QA_1e580749b6ed37b3ad947762ef319ae1/

At the end of the day, if you want to see the performance your actual users are seeing you need to install something like Google Analytics or Soasta mPulse which report timings from visitors and you can check that against your testing (though it will only report for users that stay long enough to report the beacons).

I’m certainly biased towards WPT but from everything I’ve seen, its numbers look more representative and you should be yelling at demandware about the application performance because it looks like all the slow bits are from how their platform publishes the pages.

Thanks! +rep’d you (like a week ago, but I only now have follow up).

I’ll use the thinkpad in the future. Those numbers are better (e.i. closer to my experience), though I still don’t see anything that could explain the pingdom numbers. That doesn’t matter much though, I use them mostly just for the better UI on the waterfall and lists of combinable css and js. WPT has a lot more depth and data in general (and is open source / isn’t trying to sell anything, which is always a big plus for trustability).

I do clear the cache, though I recently learned Chrome’s Dev tools Network tab has a convenient ‘Disable cache’ checkbox for this manner of testing. It also gives exact times for DOMContentLoaded, Load, and Finish; Load and Finish are equivalent to Document Complete and Fully Loaded in WPT, apparently.

We actually have GA. I didn’t know it had speed stats, lol. Looking at them though, they are all over the place: broken by browser+version the fully loaded average times range from, for example, ~3.5s on chrome 55 to ~16s on chrome 56… Wtf?

I’ve combined a massive number of css and js files since your post. Of the many speed adjustments I’d being working on I had that as a lower priority, but you lead me to think otherwise. Local css files dropped from 15 to 3, and js from 22 to 5. However, the remaining bulk of the js files are from external tracking tags that I can’t do anything about, sadly.

I’ve now run more tests on the same test parameters:

That is the same setup as your first test link, but significantly slower in time on the average test. The requests and page size are significantly improved- around 60 fewer total requests. This doesn’t make sense at face value.

The one thing I notice that might explain this though, is that in both sets of tests there seem to be 2 buckets of speeds. The older test had 3.0, 3.5, and 6.5 times, so buckets of ~3.2 and ~6.5. The new test had times 5.0, 4.8, 2.8, 4.7, and 2.7, which is buckets of ~2.7 and ~4.8. Could it be that these two buckets are what is truly comparable? If some script or include is hanging for a rather long time sometimes and creates a significantly different load time based on if it happens. Then, the old test got lucky with only 1/3 hanging while this new one got 3/5 (I say lucky since I ran another 5 round test and got 3/5 again). Thus, the averages are comparing a fast ‘before’ bucket with a slow ‘after’ bucket. Given that, 2.7 is better than 3.2, and 4.8 is better than 6.5: reasonable improvement, eh?
Does that seem like a possible scenario? If so, do you know of any way that I might be able to isolate the delay? I’m not seeing anything obvious in the waterfall; run 1 has a 1s hang on criteo js, but run 2 and 4 (the other slower ones) loaded that normally…

I’ve run another test, this time a set of 9. It continues to show the ‘2 buckets’ of speeds, though I was luckier this time, apparently, as I got only 3/9 slower times.

https://www.webpagetest.org/result/170208_AZ_HJEY/

6 tests very near 2.5 seconds, which I’d be pretty happy with considering that was more like 8s a couple months ago. But 3 tests have times nearer 5 seconds, which is still pretty garbage. I still don’t see any common cause separating the buckets though.

I notice the DOM Content Loaded purple bar is significantly wider on them, though not enough to equal the whole difference: ~.77 seconds in bad cases compared to ~.26 on the good. I don’t know what if anything that could mean. I don’t even understand how it has a width- I though DOMCL was a milestone of loading that happened at a specific point…?