Agent performance degrading over time

Hey Patrick- Hope you had a nice new year :slight_smile:

I’ve set up a private instance and am still fine tuning some of the settings, but something I’ve seen repeatedly is that the agents start to perform great, and then something happens after a bit of time and they take much longer to run tests. Sometimes they start to lag after a few hours. Other times it’s been days. Stats reported from the agents are about 2x as long as initial stats, and the throughput of each agent goes down by about 1/2. Have you ever seen this before?

The agents are running on Amazon EC2, using wpt-ie11-20140912 (ami-561cb13e). I started with small agents but most recently had been using medium instance sizes so I don’t think the size is causing this. I had seen this problem on another private instance I was playing with several months ago, but this is a completely new instance and it’s happening again.

I’ve remote desktop’d into the agents to see if there’s something obvious, but I don’t see a zombie process or anything weird. I can see from getTesters.php that the CPU usage goes up from normally 40% to 80%, but I can’t pinpoint why. Can you think of anything I can be checking on the agents to see what’s causing this performance degradation?

I’ve attached an image showing the performance over time. It never seems to recover once it starts to choke. Yellow line is doc load time. Orange line is TTFB (so nothing weird is going on with the network). You can see the jump halfway through
[attachment=464]

If I restart the agents everything works great again…for a limited time until it happens again.

Any suggestions on debugging this?

I haven’t seen that happen on any of the VMs or physical machines that I run, the CPU utilization stays consistent so I don’t think the increased utilization is coming from the software itself.

I haven’t looked at the EC2 instance performance over time though so something might be going on there. I do recommend medium because small instances are way over-taxed. Are you using m1 or m3? It could be EC2 choking on I/O or it could be doing some crazy throttling.

It might be worth testing sunspider or some other cpu-centric test when it is in a good and bad state to see if it is an external resource problem (i.e. crappy sharing on EC2). Do you also see it across regions?

Thanks for the quick reply!

Was using t2.medium only from Virginia, but will try a couple m3.medium instances from California and report back in a bit…

Oh, t2 is burstable/variable performance. You need to use m1 or m3 which provide consistant performance. That would totally explain it.

You sir, are a genius. And me, a moron :slight_smile:

Thanks for seeing that. Totally makes sense. Will let you know if it doesn’t improve things. THANK YOU!