honor robots.txt

jared · March 15, 2011, 3:56am

Hey guys,

I’m working on httparchive, and one of the current bugs is to make it honor robots.txt–which really is an upstream bug with wpt.

I’m not familiar with the wpt codebase, but I’d be happy to try to contribute, if altering the spidering call to respect robots.txt seems relatively straightforward to someone familiar with the codebase. (I’m guessing the actual spidering is executed via a php curl extension call?)

If supporting robots.txt would be tough, I’ll just handle the spidering step in httparchive for now.

Thanks!

Jared Hirsch

pmeenan · March 15, 2011, 1:03pm

Sorry, I’m a little confused because wpt doesn’t spider. It only knows how to load individual pages as they are requested. I wouldn’t want to have wpt read robots.txt as if it were a bot because a lot of pages would be untestable. I’d expect you’d want to put the robots.txt logic wherever the spidering is being done.

If you’re talking about the project I think you are, last time I checked it didn’t spider either, it worked off of a list of pages from various “top X” lists. If the spidering is a new capability being added then that’s probably where the logic belongs (though there are things wpt can do to help with just a little work - for example, dumping a list of links as part of the data returned about a page).

Thanks,

-Pat

jared · March 15, 2011, 7:44pm

Hey Pat,

No worries, I’m the one who’s confused–I thought WPT was traversing the sites on the lists. Definitely can do the spidering in httparchive.

I’ll follow up w/Steve. Thanks!

Jared

Topic		Replies	Views
HTTPS with WPT Agent Issue Support	1	129	April 17, 2012
Issues about WPT Support	4	200	October 4, 2011
Writing advanced scripts General Discussion	3	202	February 28, 2015
Benchmarks and Scripting General Discussion	4	118	June 30, 2014
Need 63 urls to be tested from 7 locations, 5 times in a row, daily. Web Site Optimization Help Needed	0	222	November 28, 2014

honor robots.txt

Related topics