WPT private/docker setup - agents fail to finish tests

Hi all -

I recently updated a private setup of WPT with one server and ~20 agents to the latest docker images and I’m experiencing very unusual behavior. When submitting tests with multiple runs (both “first view only” and “first & repeat view”), the tests take extremely long or time out and they only seem to report the last run. Below test log is an example:

2021/04/12 10:19:04 - Test Created
2021/04/12 10:19:45 - Extracting 723215 byte uploaded file '/tmp/phpQMAgYZ' to './results/21/04/12/FS/8c1c912e87d81e592a17db08c1f3dae0'
2021/04/12 10:19:45 - Test Run Complete. Run: 3, Cached: 0, Done: 1, Tester: wptagent001-10.x.x.x
2021/04/12 10:19:45 - 1 of 3 tests complete
2021/04/12 10:19:45 - Done Processing. Run: 3, Cached: 0, Done: 1, Tester: wptagent001-10.x.x.x

Notice that the agent has completed Run 3 but there is no output for Run 1 and 2 - at this stage, the test is in “being tested” status and will stay there until it times out …

The end result is this:

I have tried different things like:

  • running the agents “bare” on CentOS 7
  • using older docker images (tried 20.01 and 20.05)
  • toggling various parameters like --shaper none, --xvfb --dockerized

… but all to no avail.

Here are some of the things i found in the agent logs:

From a Chrome test:

chrome: no process found
[87:87:0412/100005.574484:ERROR:browser_dm_token_storage_linux.cc(94)] Error: /etc/machine-id contains 0 characters (32 were expected).
[87:109:0412/100016.923430:ERROR:bus.cc(393)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory

DevTools listening on ws://127.0.0.1:9222/devtools/browser/d951c911-7f99-46f5-8799-135f140810fd
[87:121:0412/100017.080992:ERROR:bus.cc(393)] Failed to connect to the bus: Could not parse server address: Unknown address type (examples of valid types are "tcp" and on UNIX "unix")
[118:118:0412/100017.211604:ERROR:vaapi_wrapper.cc(573)] Could not get a valid VA display

[87:98:0412/100050.119091:ERROR:zygote_communication_linux.cc(276)] Failed to send GetTerminationStatus message to zygote
[0412/100050.115783:ERROR:nacl_helper_linux.cc(307)] NaCl helper process running without a sandbox!
Most likely you need to configure your SUID sandbox correctly

[87:87:0412/100050.129556:ERROR:zygote_communication_linux.cc(276)] Failed to send GetTerminationStatus message to zygote

chrome: no process found
10:00:50.605 - Uploading result
/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py:1020: InsecureRequestWarning: Unverified HTTPS request is being made to host 'license.webpagetest.org'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,

From a Firefox test:

ffmpeg: no process found
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
Terminated
firefox: no process found
firefox-trunk: no process found

Current system setup:

Docker version 20.10.5, build 55c4c88
OS: CentOS Linux release 7.9.2009 (Core)
Server install page shows green in all the important places
Server has 16GB of ram and plenty of free disk space, no swapping going on
Agents have 4GB and ~50GB of free disk, 15% CPU util, no swapping
Agents are started with --dockerized agent option and --shm-size=1g and --cap-add=NET_ADMIN docker flags

I don’t know where else i can look - anybody have any ideas?

Thank you so much!!

1 Like

Using my other account - just pinging here to see if anybody has any ideas about ^^^
Thanks!

1 Like

@pmeenan @tkadlec - sorry for the mention but i was hoping that maybe you can help raise awareness for this thread … i’ve been stuck on this for a while and don’t know what I’m doing wrong or what else i can try or look for :weary: thank you so much!

1 Like

Marcel, have you tried increasing the verbosity of the agent logs to the maximum with the EXTRA_ARGS environment variable? May be worth a try if you haven’t:

-e "EXTRA_ARGS=-vvvv"
1 Like

hey @josebolos - thx for the reply!
I have made some progress and it looks like it may have been an issue with the server. I decided to build a new server (centos 8 this time instead of 7) and that is currently running the server (using docker webpagetest/server:release) and it has not shown any issues yet - I hope this fixes it … :crossed_fingers:

1 Like

To finish this thread - my webpagetest system is running pretty good now. Besides rebuilding the server, i also had to run the agents with --shm-size=2g because i was getting a lot of “cannot allocate memory - error 12” issues! FYI: the agents have 4GB of RAM.

The agents are currently running with the following commands:

$ sudo modprobe ifb numifbs=1
$ sudo docker run -d -e SERVER_URL="http://wpt_server/work/" -e LOCATION="batch_agents" -e KEY="our_key" -e EXTRA_ARGS="-vvv --name $HOSTNAME --dockerized" --cap-add=NET_ADMIN --name wptagent --shm-size=2g --restart always --init webpagetest/agent:release

I also added a cron job to restart the agents every hour (/etc/cron.d/wpt-restart):

#!/bin/bash
eval $(date +M=%M)
# Only restart agent in quiet part of the hour
if [[ "$M" > "50" ]] && [[ "$M" < "58" ]]
then
  docker restart wptagent
fi

So far, so good.

Maybe somebody can use this info in their own environment! Happy performance testing!

1 Like