Unrealistic performance at beginning of agent start (EC2)

Hey Patrick -

I’m using the EC2 autoscale feature to keep costs down (since I only run tests a few hours a day), and I’ve noticed an unwelcome trend - it seems that the first few tests that run on any agent after it boots are extremely inaccurate. Load times can be as much as 4x slower, and if you look at the waterfall it appears the browser simply halts loading the page for several seconds.

Here’s an example of the same url, run first at the beginning of the queue, and again at the end of the queue (only a few minutes apart, all settings and content the same). I have confirmed this is consistent behavior, happening every day for weeks since I set it up, and on different urls too. If you want more examples I have plenty.

I’m guessing that the agent is doing something in the background after it starts up the browser. Could it be that the EC2 images are so old that they’re churning trying to update themselves to the latest code and browser?

Any suggestions other than keeping the agents always running, or throwing away the first X minutes of testing?

Thanks so much,
Shane

What size instances?

The OS should not auto-update and the browsers are installed when the instance starts up (and testing doesn’t start until they have finished installing).

That said, it’s clear that SOMETHING is eating the CPU time. Any chance you can enable tcpdump capture which might help identify if maybe Chrome is downloading something it shouldn’t be.

Thanks Patrick -

I’m using c4.large instances (I’m the nerd that ran a bunch of tests on instance sizes, so I’m sure we have enough power here :wink: )

‘Enabling tcpdump capture’ as in logging into the machine and monitoring wireshark while it runs? or is there an easier setting I’m not aware of…

There’s a per-test setting… Advanced tab of advanced settings, in the middle: “Capture network packet trace (tcpdump)”

The tcpdump will show up as a download to the left of the waterfall.

Awesome, glad I checked. Updated my API calls to enable that and will post back in a few after we have some data. Thanks, and have a great weekend!

Hey Patrick -

I have a few test runs that have the network packet trace data. Here’s one for reference.

I opened it up in Wireshark, and saw a huge gap in time between 1.9 and 9 seconds, where’s it seems to be doing nothing. Here’s a screenshot, am I reading this correctly?

If this verifies that nothing is being downloaded in the background, what else could the agent be doing?

Thanks,
Shane

I wonder if there is some screwy IPv6 stuff going on: https://technet.microsoft.com/en-us/library/cc722030(v=ws.10).aspx

If you’re feeling adventerous you can connect to a launched instance (administrator pw is 2dialit) go into the interface settings and make sure IPv6 is disabled, create an image from the instance and see if that does anything (and/or set the reg keys from the above article).

I finally have a dummynet replacement that works on Server 2012 R2 or later and should work in EC2 so right after Thanksgiving I should be able to build new AMIs with Server 2016 and IPv6 disabled and see if that helps.

It’s really bizarre that I don’t see the issue and neither does SpeedCurve and we both launch and destroy hundreds of instances weekly :frowning:

Is it always at the same point in the waterfall and always for the same page? I wonder if there is something that is only IPv6 reachable that is causing it to try to bring up a tunnel.

Tried to login to an agent while it was running and wasn’t able to. Was using Remote Desktop Connection to the public IP address shown in my running instance list (of my EC2 account)…verified that it was actually up through the getTesters.php screen, yet still couldn’t connect. Was launched as part of a ‘default’ security group which had all ports open (side note - I realize this is a huge security risk, how do I specify the security group of the autoscale instances?). What am I missing? I know I’ve been able to connect years ago (but those were manually started).

It is not always at the same point within a page, and it is not always the same page - but does seem to happen only on the first 2-3 tests that are run after an instance starts up. Here’s a test of Google.com, which was 3rd in line in the queue, showing a page load of 5s, when we know that isn’t the case. The waterfalls of all affects tests seem to show periods of simple inactivity, not waiting in a particular phase.

I am equally puzzled why no one else has noticed this. I’m using an EC2 image for the wpt server, so there shouldn’t be anything out of the ordinary. Are you and SpeedCurve using autoscale to launch your new weekly instances?

The default security group should generally be no access from anywhere (all ports blocked) AFAIK.

Both SpeedCurve and my testing use separate scripts to do the scaling “manually” (before starting a batch of tests and killing them when done) but at least my code looks just like the auto-scaling code.

Alright - was able to recreate the issue while logged into the agent and I’m pretty sure I know what’s going on, yay!

For the record, the default security group had all ports open, but was restricted to connections from those in the same security group. Once I changed ‘source’ to my IP address, I could connect. So glad to say that’s calmed my security concerns.

Here’s a timeline of what happened after I saw the agent auto-launch due to a new test, in minutes:
0:09 - was finally able to connect via remote desktop. Immediately saw an error in the WPT driver window that said “Problem loading settings, trying again…”
0:11 - Installing Chrome
0:14 - Installing Firefox
0:18 - Installing Flash
0:20 - Installing Python and running python script
0:22 - Started first WPT test

During the first test, it was obvious the box was doing something other than just running the test.
Here’s a screenshot of a blank page during the test, that lasted for almost 30 seconds. A list of processes in the background.
Next here’s a screenshot of the test complete, still CPU and network usage, although it is processing the test.
This screenshot shows a few minutes past the test completion, yet still some disk and network activity
Finally here you can see things calm, as they should be.
I decided to run a second test after this calm activity, and you can see the normal expected resource usage of a test.

To prove that the issue happened during my spying, here’s the first test, and here’s the second. Same URL, minutes apart, huge performance difference.

Do you see what I see?

I see a “SoftwareUpdate.exe” from Apple and a “Setup.exe” from Google Chrome during the first test. Actually there’s also a “GoogleUpdate.exe”. I wasn’t able to capture that they’re utilizing too much CPU, but definitely shows they’re eating up network (and something’s obviously eating up disk even while it waits for work after the first run).

Could it be that even though Chrome and Safari have just been installed they’re still downloading (or checking for) updates?

How do you interpret this?

It’s possible. Safari won’t have any updates because the browser is ancient and hasn’t been supported in years. I should be able to add logic to wptdriver to automatically kill it.

For Chrome, we install 54 but it may not be the absolute latest sub-version so it’s entirely possible that it is doing patch updates in the background. Since we automatically install the desired Chrome build automatically ourselves I should be able to disable automatic updates for Chrome as well.

I’ll have an updated wptdriver later today that hopefully addresses both.

I just released 334 which (hopefully) turns off Chrome updates for EC2 instances and kills Apple’s SoftwareUpdate.exe. I need to do some testing to make sure the registry keys work as advertised.

Hey Patrick -

Was waiting to collect some solid data, and I can tell you that it doesn’t appear to have fixed the issue. I’ll login to the agents again shortly to see which update processes are still running…

Thanks (hope you had a nice thxgiving!)

Don’t know how much it will help but I just rolled out 336. The agent has always had logic to wait for the CPU to go idle at the start of a test (to wait for the browser to finish initializing/etc). The logic could be skewed by multi-core systems though since it looked at the overall CPU utilization so I tweaked it to adjust for the number of cores.

I also added some logic at the agent startup time to wait up to 10 minutes for the machine to go idle before starting. With EC2 this means the old agent on the iimage will install the browsers, update to the new agent and then the new agent will start and wait for the machine to go idle.

It still won’t help if the update goes idle for a while, runs for longer than 30 seconds after starting the browser or starts in the middle of a test.

Separately, Chrome 55 should be coming out any day now in which case the initial browser install will be the latest and not have any available updates (at least until a post-release update is pushed)

Hey Patrick -

Thanks again for your work on this. I logged in today to watch it run some tests (since even after the 336 release I didn’t see any difference), and I’m seeing the agent install all the software like before, then wait for idle CPU, then it kicks me off of remote desktop (I’m assuming it’s restarting?). I’ll log back in, and I see WPT Driver running, waiting for idle CPU, then I’m logged off again. It’s done this 5 times in a row already…

I also noticed that my WPT server hasn’t completed any work in the last 36 hours even though there is stuff in the queue and agents are starting via ec2 autoscale.

Any idea what is going on?

Looks like something may have gone sideways with the 341 agent push: https://github.com/WPO-Foundation/webpagetest/issues/777

I just pushed 342 so hopefully that fixes whatever went wrong.

Thanks for the quick response - that fix does seem to be working now, at least the agents are staying up and running tests now.
[hr]
Ok, with that fixed I logged in to watch the tests being run, and I’m still seeing GoogleUpdate.exe running while a test is running, causing this false slow performance.

Here’s the test: http://wpt.machmetrics.com/result/161206_MG_c8d6ef752199508df08e42c2c3f1f9f3/
Here’s a screenshot showing update processes during test: http://i.imgur.com/zcnv3M8.png
And again, here’s the same url minutes later showing you the problem is not the site: http://wpt.machmetrics.com/result/161207_4A_7d1e42890d00f05615a47bf5484f2437/

Any other remaining things to try? If not, I may either abandon autoscale and manage my own ami’s with updates off, or pad the beginning of every queue with 10 or so tests that get thrown away…

For those curious, I’ve worked around this by padding the beginning of every queue with some ‘warm up’ tests that I ignore. Although I hate that this is wasting resources/time, I realize this is a tricky problem to solve.

Let me know if anyone else comes up with some better alternatives!

Thanks