Hey Patrick- Hope you had a nice new year
I’ve set up a private instance and am still fine tuning some of the settings, but something I’ve seen repeatedly is that the agents start to perform great, and then something happens after a bit of time and they take much longer to run tests. Sometimes they start to lag after a few hours. Other times it’s been days. Stats reported from the agents are about 2x as long as initial stats, and the throughput of each agent goes down by about 1/2. Have you ever seen this before?
The agents are running on Amazon EC2, using wpt-ie11-20140912 (ami-561cb13e). I started with small agents but most recently had been using medium instance sizes so I don’t think the size is causing this. I had seen this problem on another private instance I was playing with several months ago, but this is a completely new instance and it’s happening again.
I’ve remote desktop’d into the agents to see if there’s something obvious, but I don’t see a zombie process or anything weird. I can see from getTesters.php that the CPU usage goes up from normally 40% to 80%, but I can’t pinpoint why. Can you think of anything I can be checking on the agents to see what’s causing this performance degradation?
I’ve attached an image showing the performance over time. It never seems to recover once it starts to choke. Yellow line is doc load time. Orange line is TTFB (so nothing weird is going on with the network). You can see the jump halfway through
[attachment=464]
If I restart the agents everything works great again…for a limited time until it happens again.
Any suggestions on debugging this?