1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

News Internet Archive announces broader crawler scope

Discussion in 'Article Discussion' started by Gareth Halfacree, 24 Apr 2017.

  1. Gareth Halfacree

    Gareth Halfacree WIIGII! Staff Administrator Super Moderator Moderator

    Joined:
    4 Dec 2007
    Posts:
    12,148
    Likes Received:
    1,673
  2. jb0

    jb0 Active Member

    Joined:
    8 Apr 2012
    Posts:
    450
    Likes Received:
    55
    Webmasters never HAD control. Robots.txt was not an enforcable access control mechanism, merely a polite request. It worked only because most search engines agreed to abide by said requests. Apparently, some of the chinese ones use robots.txt exactly backwards, as a map of what parts of the page to crawl FIRST. Which is, you know, the obvious first thing to do once people start acting like a polite request is a real access control mechanism.

    Most of the controversy is from people that simply don't understand the difference and think Internet Archive is somehow hacking every server on Earth to bypass the robots.txt firewall. It isn't even setting a bad precedent, since they're far from the first major bot to ignore robots.txt(or use it as a sitemap)

    ...

    Tangentally, have they ever explained why they honor the CURRENT robots.txt file when trying to view previously-stored content? It always seemed to me that they should honor the robots.txt in effect at the time the site was saved, if anything.
     
  3. Gareth Halfacree

    Gareth Halfacree WIIGII! Staff Administrator Super Moderator Moderator

    Joined:
    4 Dec 2007
    Posts:
    12,148
    Likes Received:
    1,673
    I think it started life as a "if you didn't want this archived, we'll politely take it down" and has since been replaced by the proper "you archived something I didn't want you to archive, please take it down" email address and/or DMCA notifications.
     
  4. mi1ez

    mi1ez Active Member

    Joined:
    11 Jun 2009
    Posts:
    1,428
    Likes Received:
    17
    I would think this is essential functionality? Wouldn't most robots.txt files prevent crawlers from hitting scripts, styles, etc.? Wouldn't they be pretty useful in rendering the pages back at a later date?
     
  5. Wwhat

    Wwhat Member

    Joined:
    2 Oct 2005
    Posts:
    263
    Likes Received:
    1
    Ever since the big companies and the politicians found out about the archive and started to force them to remove all kinds of stuff it's really not the same anymore.
     

Share This Page