Long waiting time on "processing result" - whats happening there?

Hello,

in the last weeks I’ve setup a new server that should replace our very outdated wpt/osm instance. This new server contains an instance of webpagetest (newest private instance, running on Debian Stretch) as well as two agents, capsuled in Windows 7 kvms. On top is a docker instance of open speed monitor that controls and receives results from the wpt.

Basically everything runs fine and while monitoring the whole system(s) through nagios I can see that the load is very well balanced, means there is no system (neither Linux host nor kvms) that gets an overload or out of ressources.

However, there is one thing I can’t explain to myself: I noticed that (even though the new kvms perform significantly better in terms of reaction time than the very old wpt system where the agents ran on the same windows system where the private instance runs) it takes about 30 minutes longer for an agent to complete all the tests (about 40) than on the old system. When looking at the agents I saw that this is the result of a comparably long phase of the wpt-driver saying “Processing Result” after a browser is finished. But in this time, the agent idles or at least it seems so - no noticably cpu or memory consumption, no heavy network… It stays there for about 40 seconds to a few minutes before starting the next job.

So my question is: What exactly is happening at this stage? Communication with the private instance? On which site does the work happen here? I weren’t able do find anything in the docs or any option to eneble more logging.

I hope somebody can help me out here.

Greetings

Should be this logic here: https://github.com/WPO-Foundation/webpagetest/blob/master/agent/wptdriver/webpagetest.cc#L1228

Most of that would likely be in processing any Chrome trace files which should show a lot of activity (may depend on I/O performance of the VM).

There should be a test_timing.log that gets uploaded with each test result which would have better granular details.

One possible cause is if python is installed so it can run the trace parsing but ujson isn’t so it is using a slow json parser it could take longer than needed. If the slow bit is in the trace parsing try “pip install ujson” and see if it helps.

Hi pmeenan,

thank you for your answer. Today I was able to invest some time with your hints on this - but unfortunately no luck:

  • I wasn’t able to locate a file named test_timing.log (also grepped for this everywehere, just to be sure). I can only find test.log in the results folder, which shows (this is already surprising to me) that the post processing doesn’t waste time. See log at the end of this post, according to the log post processing is finished immediately and this is the last line of the log.
  • Anyway, I tried your suggestion because I did not have the ujson module installed (but python was), so I did install and test it. Module works without errors, but no change in processing time - still get “Test complete - processing result” in between tests.

Any chance the more granular log has to be activated anywhere or so?

2017/07/24 12:00:00 - Test Created
2017/07/24 12:02:26 - Starting test (initiated by tester WPTAGENT02-192.168.100.115)
2017/07/24 12:03:45 - Test Run Complete. Run: 1, Cached: 0, Done: , Tester: WPTAGENT02-192.168.100.115
2017/07/24 12:04:25 - Test Run Complete. Run: 1, Cached: 1, Done: , Tester: WPTAGENT02-192.168.100.115
2017/07/24 12:04:25 - Test Run Complete. Run: 1, Cached: 1, Done: 1, Tester: WPTAGENT02-192.168.100.115
2017/07/24 12:04:25 - Test Complete
2017/07/24 12:04:25 - Test post-processing complete

That’s the view from the server side and doesn’t include the agent-side processing.

Could be the logs aren’t included because it isn’t in the version of wptdriver that you have on the agents. Can you try updating to the latest? Instructions are here: https://sites.google.com/a/webpagetest.org/docs/private-instances#TOC-Updating-Test-Agents

The easiest way is to add the server-side setting and just have it update automatically but that may take an hour to kick in (you can visit /cron/hourly.php to trigger it).

Thank you, I wasn’t aware of this option and thought agent updates are only provided through new instance releases (though I now saw this is even mentioned at the github release notes…).

Agent was at build 348 (the one from the 3.0 instance archive), now updated to latest from cdn (387).

I’ll wait for the next run and will check for the log.

Update:

Now I got the logs, esp. test_timing.log. Unfortunately, it doesn’t strike me - I guess the values are milliseconds (would make sense when viewing at ‘Run Test’ and ‘Measure Step’). That said, it looks OK to me? All other values are in a relatively low range then, nothing that hints on that >=30s delay on most tests.

Example:

Extension Start=1510
Extension Blank=95
Launch Browser=7029
Wait For Idle=5398
Measure Step=24601
Process Requests=94
Run Optimization Checks=6
Process Video=441
Save Result=473
Run Test=32934
Upload Images=1

The python processing if it were to happen (which it doesn’t look like it is because there is no trace processing time) would be included in the “Upload Images” time so the problem isn’t there.

The UI doesn’t update to another message until it has actually uploaded the result, waited for the browser to exit and moved on to requesting the next test (and some of those are after the upload so can’t be included in the timing log).

The only bits that make sense for possibly taking a long time are:

1 - workdone.php call to the server with the result, if the server is extremely slow to process the result or for some reason hangs the request

2 - The wait for the browser to gracefully exit waits for 10 seconds before force-killing the process.

I may be able to add more status messages to help track down which of those is causing the issue but it may be better to set up a debug build which has a lot more logging (through requires running dbgview to capture the messages).

Hmmh, would have been too easy I guess…

Browser exits properly after every test, at least I confirmed by watching the taskmgr - the window closes as soon as wptdriver switches to “test complete - processing result” and there remain no browser processes, they disappear immediately (only using FF actually).

I guess the least time-consuming step is switching to nginx just for a test. Should be pretty easy, after I saw your nginx.conf for including. This way I can see if this is somehow webserver-related, because I still don’t see any heavy load on the server side. Actually I’m using latest Apache on Debian Stretch.

I’ll post an update this week if the webserver-switch made a difference when I managed to test it.

Just an (very late, but I didn’t found a ‘real’ solution to this, so I didn’t remember this thread enough) final update to this: The switch to nginx did not result in huge improvements, but imho at least a significant amount. However, I finally decided to expand the agent infrastructure (e.g. setting up another agent and optimizing which agent should do which tests). Now I’m down to about 20 minutes overall for the complete test set, which is absolutely OK for me.