honor robots.txt

Hey guys,

I’m working on httparchive, and one of the current bugs is to make it honor robots.txt–which really is an upstream bug with wpt.

I’m not familiar with the wpt codebase, but I’d be happy to try to contribute, if altering the spidering call to respect robots.txt seems relatively straightforward to someone familiar with the codebase. (I’m guessing the actual spidering is executed via a php curl extension call?)

If supporting robots.txt would be tough, I’ll just handle the spidering step in httparchive for now.


Jared Hirsch

Sorry, I’m a little confused because wpt doesn’t spider. It only knows how to load individual pages as they are requested. I wouldn’t want to have wpt read robots.txt as if it were a bot because a lot of pages would be untestable. I’d expect you’d want to put the robots.txt logic wherever the spidering is being done.

If you’re talking about the project I think you are, last time I checked it didn’t spider either, it worked off of a list of pages from various “top X” lists. If the spidering is a new capability being added then that’s probably where the logic belongs (though there are things wpt can do to help with just a little work - for example, dumping a list of links as part of the data returned about a page).



Hey Pat,

No worries, I’m the one who’s confused–I thought WPT was traversing the sites on the lists. Definitely can do the spidering in httparchive.

I’ll follow up w/Steve. Thanks!