News Internet Archive announces broader crawler scope

Gareth Halfacree · 24 Apr 2017

Will begin ignoring robots.txt.
https://www.bit-tech.net/news/bits/2017/04/24/internet-archive-broadens-crawler/1

jb0 · 25 Apr 2017

While the shift will allow the Internet Archive access to a wider range of content and more control over what content remains within its archives, it does so by removing that control from webmasters - a move which has proven controversial, in particular in that ignoring the directives file altogether will also ignore sites which wish to block Internet Archive access specifically through its ia_archiver User Agent.
Click to expand...

Webmasters never HAD control. Robots.txt was not an enforcable access control mechanism, merely a polite request. It worked only because most search engines agreed to abide by said requests. Apparently, some of the chinese ones use robots.txt exactly backwards, as a map of what parts of the page to crawl FIRST. Which is, you know, the obvious first thing to do once people start acting like a polite request is a real access control mechanism.

Most of the controversy is from people that simply don't understand the difference and think Internet Archive is somehow hacking every server on Earth to bypass the robots.txt firewall. It isn't even setting a bad precedent, since they're far from the first major bot to ignore robots.txt(or use it as a sitemap)

...

Tangentally, have they ever explained why they honor the CURRENT robots.txt file when trying to view previously-stored content? It always seemed to me that they should honor the robots.txt in effect at the time the site was saved, if anything.

Gareth Halfacree · 25 Apr 2017

jb0 said: ↑

Tangentally, have they ever explained why they honor the CURRENT robots.txt file when trying to view previously-stored content? It always seemed to me that they should honor the robots.txt in effect at the time the site was saved, if anything.
Click to expand...

I think it started life as a "if you didn't want this archived, we'll politely take it down" and has since been replaced by the proper "you archived something I didn't want you to archive, please take it down" email address and/or DMCA notifications.

mi1ez · 25 Apr 2017

I would think this is essential functionality? Wouldn't most robots.txt files prevent crawlers from hitting scripts, styles, etc.? Wouldn't they be pretty useful in rendering the pages back at a later date?

Wwhat · 3 May 2017

Ever since the big companies and the politicians found out about the archive and started to force them to remove all kinds of stuff it's really not the same anymore.

Log in or Sign up

News Internet Archive announces broader crawler scope

Gareth Halfacree WIIGII! Lover of bit-tech Administrator Super Moderator Moderator

jb0 Minimodder

Gareth Halfacree WIIGII! Lover of bit-tech Administrator Super Moderator Moderator

mi1ez Modder

Wwhat Minimodder

Share This Page

Log in or Sign up

News Internet Archive announces broader crawler scope

Gareth Halfacree WIIGII! Lover of bit-tech Administrator Super Moderator Moderator

jb0 Minimodder

Gareth Halfacree WIIGII! Lover of bit-tech Administrator Super Moderator Moderator

mi1ez Modder

Wwhat Minimodder

Share This Page

Useful Searches