archiving basics

it is fairly apparent that we need to start archiving test data on our private instance - another process that consumes test data (logstash) is beginning to fall over after the /results get so big.

i wonder if there is an overview of the archive function, or is it just a matter of pointing to the right things in the settings.ini file, and then running the cli/archive.php program every so often?

are these the only config settings required?


;directory to archive test files (must include trailing slash)
archive_dir=/data/archive/

; archiving to s3 (using the s3 protocol, not necessarily just s3)
;archive_s3_server=s3.amazonaws.com
;archive_s3_key=
;archive_s3_secret=
;archive_s3_bucket=
;archive_s3_url=http://s3.amazonaws.com/

;Number of days to keep tests locally before archiving
archive_days=2

not too sure of the impact of data being retrieved back into the /results directory on request, but that problem is small compared to logstash falling over frequently.

thanks for any assistance. i’m not a deep-dive programmer-type but can make my way around the install and s3.

thanks again.

I don’t have a doc but you basically have it. You don’t need both the s3 and the archive_dir settings - just one or the other depending on if you want to archive to S3 or archive to somewhere on the filesystem (I archive to a mount point for a NAS for example).

Unless you are accessing all of the tests frequently then odds are that hardly anything will be pulled back - even on the public instance I’m usually only using ~100GB max of the main storage. The archive.php script will prune the restored tests as well after they haven’t been accessed for archive_days.

thanks so much for the quick response.

i would like some clarification on the feature that ‘retrieves’ from the archive the waterfall data as it is requested.

does it simply unzip it into it’s original location in the /results tree?

that could be an issue since we have logstash monitoring that entire tree and indexing any new entries as they show up. indexing an entry that it has already indexed may be a problem.

to get around that, we are considering using a small and date-restricted rsync archive of /results, and pointing logstash to that directory, not webpagetest’s /results.

but for that to work with a simple timestamp find, i’d need the newly retrieved from archive data to retain the timestamp from when it was originally archived. do you know if the original dates are preserved when they are transparently retrieved from the archive?

thanks again for your time.

Yes, it unzips it to the original location. I believe the original timestamps are kept but the testinfo.ini file gets touched whenever the test is accessed and is used for the archiving logic so there may still be issues.