There were changes made back on 12/18-19 where the AWS SDK was updated (which fixed using Frankfurt) and to better scale down. It used to only turn off instances that were idle in the last hour but if any jobs were submitted they would still be spread across instances so it would keep all of the instances active. The change would mark instances as being offline based on the scale factor which would let them go idle and be terminated.
The termination is only checked on hourly increments from when the instances started so they won’t go away right away (since EC2 is billed hourly).
Is there any chance that individual jobs are getting submitted and processed in between batches?
Looking at the /getTesters.php page should tell you how long it has been since the last work for a given agent. Are those all showing long times (and are all of the live instances showing up)?
The instance isn’t currently shared, and the test history suggests no jobs are getting submitted between batches.
/getTesters.php shows the one and only agent that is up at the moment, and that it has been idle 186 minutes…
Is the change of AWS SDK significant - do we need to update that in some way given we’re on the old AMI and WPT code is auto-updated?
I’ll take a look and see if something broke with going all the way back down to zero instances. The code is running on the public WPT as well but I never have it drop below 1 so I may have missed something.
The AWS SDK change should be pulled in automatically since the AMI updates itself (the SDK is bundled in with the WPT code).
I just tried out the same AMI with the same options and it spun up a test machine to run the tests I needed and then shut it down at the end of the hour (pretty much as expected) so I’m at a bit of a loss.
Can you check the AWS IAM permissions for the key you assigned to make sure that it has permission to terminate instances?
OK, what I’m seeing is a discrepancy between AWS and WPT
My AWS Agent fired up and then shutdown after a period of non-use, but both getLocations and getTesters report it as still being present - getTesters reports it’s not checked in for 40 mins at this point.
I seem to remember that when the agent hasn’t check-in for a while it does eventually get removed but can’t remember whether I’ve just made that up or what the time out it if not!
Oh, yes. getTesters.php will keep agents in the list for up to 60 minutes but the list is only purged when an agent actually polls for work from that location so the last tester will show forever, just with a REALLY long time since last check.
If it’s causing grief I can see about having both of those only show active agents and have a separate flag that includes offline agents.
Thats definitely not the problem I was having - AWS console confirmed the instances weren’t being terminated.
Last nights runs appear to have terminated correctly, however, so I’ll keep a tab on when this does happen next time I see it.
This is still a problem. This morning, I have one agent instance which is not stopped, and isn’t listed in /getTesters.php?f=html.
Looks like it failed to be stopped - from error.log.20150125:
00:00:02 - EC2:Launching EC2 instance. Region: us-east-1, AMI: ami-XXXXX, error: The instance ID 'i-YYYYY' does not exist
00:00:05 - EC2:Listing running EC2 instances: The instance ID 'i-YYYYY' does not exist
In fact, the instance DOES exist, and it looks like the call to terminate instance intermittently fails. This is a repeated pattern which has happened in the past (similar entries in the error logs).
Perhaps a retry mechanism for terminating instances?
I’ll double-check the logic but it should always pull the full list of running instances and if it sees one that shouldn’t be running and is tagged with the WPT tags it should try to terminate it every time it checks (every 5 minutes but the terminations are done only close to hourly increments on the run time).
I’ve been travelling for around 3 days, and this morning I’ve 4 agent instances which are not terminated, and not listed in /getTesters.php. All have the log error shown above.
I’ve raised this as an issue now, since it is certainly happening with a regularity on my private instance: https://github.com/WPO-Foundation/webpagetest/issues/397
I have also from time to time an instance which aren’t terminated.
As I see in AWS console they are not tagged like the other instances with “Name=WebPagetest Agent” but have empty names.
In the log I have for these instances always startup messages like:
09:05:10 - Instance i-5959c1bd started: m3.medium ami ami-d0c76fa7 in eu-west-1 for eu-west-1,eu-west-1_IE11 with user data: wpt_server=xx.xx.xx.xx wpt_loc=eu-west-1,eu-west-1_IE11 wpt_key=xxxxx
09:05:10 - Error: Launching EC2 instance. Region: eu-west-1, AMI: ami-d0c76fa7, error: The instance ID 'i-5959c1bd' does not exist