wptdriver 298 broke private instances?

About 4 hours ago, new agent instances started failing to run Chrome tests. From what I’ve been told, the agents auto-update their code? Is it possible that there was a recent update that caused the symptoms we’re seeing?

The .har file looks like this:

[font=Courier]
{
“log”: {
“version”: “1.1”,
“creator”: {
“name”: “WebPagetest”,
“version”: “2.19”
},
“pages”: [
{
“startedDateTime”: “1970-01-01T00:00:00.000+00:00”,
“title”: "Run 1, First View for ",
“id”: “page_1_0”,
“pageTimings”: {
“onLoad”: 0,
“onContentLoad”: -1,
“_startRender”: 0
},
“_run”: 1,
“_cached”: 0,
“_step”: 1,
“_loadTime”: 0,
“_TTFB”: 0,
“_render”: 0,
“_fullyLoaded”: 0,
“_docTime”: 0,
“_domTime”: 0,
“_aft”: 0,
“_titleTime”: 0,
“_loadEventStart”: 0,
“_loadEventEnd”: 0,
“_domContentLoadedEventStart”: 0,
“_domContentLoadedEventEnd”: 0,
“_domLoading”: 0,
“_domInteractive”: 0,
“_lastVisualChange”: 0,
“_server_rtt”: 0,
“_firstPaint”: 0
},
{
“startedDateTime”: “1970-01-01T00:00:00.000+00:00”,
“title”: "Run 1, Repeat View for ",
“id”: “page_1_1”,
“pageTimings”: {
“onLoad”: 0,
“onContentLoad”: -1,
“_startRender”: 0
},
“_run”: 1,
“_cached”: 1,
“_step”: 1,
“_loadTime”: 0,
“_TTFB”: 0,
“_render”: 0,
“_fullyLoaded”: 0,
“_docTime”: 0,
“_domTime”: 0,
“_aft”: 0,
“_titleTime”: 0,
“_loadEventStart”: 0,
“_loadEventEnd”: 0,
“_domContentLoadedEventStart”: 0,
“_domContentLoadedEventEnd”: 0,
“_domLoading”: 0,
“_domInteractive”: 0,
“_lastVisualChange”: 0,
“_server_rtt”: 0,
“_firstPaint”: 0
}
],
“browser”: {
“name”: null,
“version”: null
},
“entries”: [

    ]
}

}
[/font]

We switched our default over to IE and that appears to be working fine.

Another thing we noticed was that older agents did not display the problem.

Any chance that the changes to Fixed some cases where the forces www.webpagetest.org -> agent.webpag… · WPO-Foundation/webpagetest@40fffb5 · GitHub might be the cause? Perhaps an uninitialized variable? (Forgive, I’m not at all familiar with the codebase.)

Thanks!

Some questions to help narrow down what might be causing it:

1 - Is it just Chrome tests that are failing?
2 - Is the server running the EC2 AMI or is it your own install?
3 - If your own install, is it configured to auto-update the server code or the agents?
4 - Are the test agents the EC2 AMI’s or your own?
5 - What agent version is displayed on your /getTesters.php page?

There was a agent change ~5 hours ago but it was around how the agents communicate with the server and specifically just rewriting www.webpagetest.org to agent.webpagetest.org which shouldn’t have any impact on anything but the public instance: Fixed some cases where the forces www.webpagetest.org -> agent.webpag… · WPO-Foundation/webpagetest@40fffb5 · GitHub

There was also a 297 update earlier today but that was confined to the Firefox extension (it is now signed and required for Firefox 48).

The public server was migrated to HTTPS + HTTP/2 in the last few hours as well but that shouldn’t impact a private instance. The only interaction is that they pull agent updates from the public server but it wouldn’t impact tests.

There is also a Chrome update rolling out right now (refresh of 52) that might be causing issues but nothing that I have seen.

Argh, I think I got to the root cause and it should be fixed now though you may need to reboot your EC2 agents (or they should automatically recover in the next hour or so).

I have some IP-based rate limiting for the software installs to prevent bad agents stuck in download loops from burning through bandwidth like crazy but the move of the main domain to fastly routed all traffic through a few edge IP’s and all of a sudden all of the installer checks were getting blocked.

I just disabled the rate limiting and I’ll be very careful if and when I put it back in place to hopefully avoid this from happening in the future.

Thanks, Patrick.

I have failing Chrome and Firefox tests on my private instance with client ami id: ami-4a84a220 in region us-east-1. It started when I stopped all my client instances in us-east-1, when they came back up only IE and Safari tests work now.

I’ve attached a screen shot of the test result, which states: browser failed to load

Has anyone seen this, or know how to fix it?. I have made any changes to the config files.

Thanks

What version of wptdriver are the agents running (should be visible in /getTesters.php)? The current release is 2.19.0.307

The earlier AMI should have updated the browsers as well but in case it didn’t, Firefox 48 requires a relatively recent release. Not sure what might be going on with Chrome but that depends on how old the agent is.

Hi Patrick,

The version for for us-east-1 and all other regions according to /getTesters.php is 2.19.0.307

I’m looking for more of an log file, ideally I’d like to remote into the client machine, though we’re having some firewall issues rules that need to be changed to allow that.

Are there any other versions I can check. I’m on the latest patch in the /update directory too.

Thanks!

Only other thing that comes to mind is that both Firefox and Chrome need to get installed by wptdriver at instance start-up and they pull the installers down from my CDN. Looking at the CDN stats it looks like the bandwidth and request counts are normal but it’s possible there was a network issue or something that caused it to fail.

Can you reboot the instance to see if that solves it? That should cause it to try the install again (though it really should try every hour regardless and should try repeatedly at startup but that may be a bug).

I’ve been fighting similar issues over the last few days. Some agents fail to ever start up correctly. The bad instances show about 1GB of additional free disk space suggesting some type of failure to install software correctly. Is there a known problem with browser installations?

Patrick,

I can confirm the Chrome and Firefox installs are missing on the broke instances via RDP. This causes roughly half of the instances to fail. Any idea how to fix the downloads on startup?

Thanks!