AWS Private instance: Agents not being stopped/terminated

Have a private instance on AWS using AMI ami-24199c4c, which has been working for some time. However, recently the agent instances don’t seem to be being terminated when the work queue is empty.

The user data for the server instance:

[code]ec2_key=XXX
ec2_secret=XXX
api_key=XXX
headless=0

EC2.ScaleFactor=50
EC2.us-east-1.min=0
EC2.us-east-1.max=2[/code]

Have there been recent changes that might affect this behaviour?

There were changes made back on 12/18-19 where the AWS SDK was updated (which fixed using Frankfurt) and to better scale down. It used to only turn off instances that were idle in the last hour but if any jobs were submitted they would still be spread across instances so it would keep all of the instances active. The change would mark instances as being offline based on the scale factor which would let them go idle and be terminated.

The termination is only checked on hourly increments from when the instances started so they won’t go away right away (since EC2 is billed hourly).

Is there any chance that individual jobs are getting submitted and processed in between batches?

Looking at the /getTesters.php page should tell you how long it has been since the last work for a given agent. Are those all showing long times (and are all of the live instances showing up)?

The instance isn’t currently shared, and the test history suggests no jobs are getting submitted between batches.
/getTesters.php shows the one and only agent that is up at the moment, and that it has been idle 186 minutes…

Is the change of AWS SDK significant - do we need to update that in some way given we’re on the old AMI and WPT code is auto-updated?

I’ll take a look and see if something broke with going all the way back down to zero instances. The code is running on the public WPT as well but I never have it drop below 1 so I may have missed something.

The AWS SDK change should be pulled in automatically since the AMI updates itself (the SDK is bundled in with the WPT code).

Taking a look now to see what I can find.

btw, if any errors happen they should get logged to /var/www/webpagetest/www/log/error.log.

You may need to ‘sudo su’ to be able to see the actual log but there may be hints in there.

I just tried out the same AMI with the same options and it spun up a test machine to run the tests I needed and then shut it down at the end of the hour (pretty much as expected) so I’m at a bit of a loss.

Can you check the AWS IAM permissions for the key you assigned to make sure that it has permission to terminate instances?

Key has full EC2 permissions:

{ "Version": "2012-10-17", "Statement": [ { "Action": "ec2:*", "Effect": "Allow", "Resource": "*" }, { "Effect": "Allow", "Action": "elasticloadbalancing:*", "Resource": "*" }, { "Effect": "Allow", "Action": "cloudwatch:*", "Resource": "*" }, { "Effect": "Allow", "Action": "autoscaling:*", "Resource": "*" } ] }

[quote=“pmeenan, post:6, topic:9165”]btw, if any errors happen they should get logged to /var/www/webpagetest/www/log/error.log.
[/quote]

The only error log has one line:

01:00:02 - Error launching EC2 instance. Region: us-east-1, AMI: ami-561cb13e, error: The instance ID 'i-eba59b15' does not exist

…and that is because I terminated the agent (because it was hanging around idle). So no clue there.

I don’t know if this is significant, but we have a bunch of benchmarks running overnight.

OK, what I’m seeing is a discrepancy between AWS and WPT

My AWS Agent fired up and then shutdown after a period of non-use, but both getLocations and getTesters report it as still being present - getTesters reports it’s not checked in for 40 mins at this point.

I seem to remember that when the agent hasn’t check-in for a while it does eventually get removed but can’t remember whether I’ve just made that up or what the time out it if not!

Oh, yes. getTesters.php will keep agents in the list for up to 60 minutes but the list is only purged when an agent actually polls for work from that location so the last tester will show forever, just with a REALLY long time since last check.

If it’s causing grief I can see about having both of those only show active agents and have a separate flag that includes offline agents.

Thats definitely not the problem I was having - AWS console confirmed the instances weren’t being terminated.
Last nights runs appear to have terminated correctly, however, so I’ll keep a tab on when this does happen next time I see it.

I think this is the full set of permissions needed - think there may be one or two extra in there.

{ "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1420829356000", "Effect": "Allow", "Action": [ "ec2:CreateTags", "ec2:DescribeRegions", "ec2:DescribeVolumes", "ec2:DeleteVolume", "ec2:DescribeInstances", "ec2:RunInstances", "ec2:StartInstances", "ec2:StopInstances", "ec2:TerminateInstances" ], "Resource": [ "*" ] } ] }

I already have

{ "Action": "ec2:*", "Effect": "Allow", "Resource": "*" },
…which allows everything on EC2.

This is still a problem. This morning, I have one agent instance which is not stopped, and isn’t listed in /getTesters.php?f=html.
Looks like it failed to be stopped - from error.log.20150125:

00:00:02 - EC2:Launching EC2 instance. Region: us-east-1, AMI: ami-XXXXX, error: The instance ID 'i-YYYYY' does not exist 00:00:05 - EC2:Listing running EC2 instances: The instance ID 'i-YYYYY' does not exist
In fact, the instance DOES exist, and it looks like the call to terminate instance intermittently fails. This is a repeated pattern which has happened in the past (similar entries in the error logs).
Perhaps a retry mechanism for terminating instances?

I’ll double-check the logic but it should always pull the full list of running instances and if it sees one that shouldn’t be running and is tagged with the WPT tags it should try to terminate it every time it checks (every 5 minutes but the terminations are done only close to hourly increments on the run time).

Same deal this morning:

00:00:05 - EC2:Launching EC2 instance. Region: us-east-1, AMI: ami-561cb13e, error: The instance ID 'i-ZZZZZ' does not exist

i-ZZZZZ is still running 8 hours later.

I’ve been travelling for around 3 days, and this morning I’ve 4 agent instances which are not terminated, and not listed in /getTesters.php. All have the log error shown above.
I’ve raised this as an issue now, since it is certainly happening with a regularity on my private instance: AWS private instances intermittently fail to terminate agents when idle · Issue #397 · WPO-Foundation/webpagetest · GitHub

I updated the bug and have a theory with what might be causing it and a fix in the works.

Hi,

I have also from time to time an instance which aren’t terminated.

As I see in AWS console they are not tagged like the other instances with “Name=WebPagetest Agent” but have empty names.
In the log I have for these instances always startup messages like:

09:05:10 - Instance i-5959c1bd started: m3.medium ami ami-d0c76fa7 in eu-west-1 for eu-west-1,eu-west-1_IE11 with user data: wpt_server=xx.xx.xx.xx wpt_loc=eu-west-1,eu-west-1_IE11 wpt_key=xxxxx 09:05:10 - Error: Launching EC2 instance. Region: eu-west-1, AMI: ami-d0c76fa7, error: The instance ID 'i-5959c1bd' does not exist

Bests

Reiner