Thursday, December 29, 2016

Robots.txt Files and Archiving .gov and .mil Websites

Robots.txt Files and Archiving .gov and .mil Websites. Alexis Rossi. Internet Archive Blogs. December 17, 2016.
     The Internet Archive collects webpages "from over 6,000 government domains, over 200,000 hosts, and feeds from around 10,000 official federal social media accounts". Do they ignore robots.txt files? Historically, sometimes yes and sometimes no, but the robots.txt file is less useful that it was, and is becoming less so over time as, particularly for web archiving efforts. Many sites do not actively maintained the files or increasingly block crawlers with other technological measures. The "robots.txt file is not relevant to a different era". The best way for webmasters to exclude their sites is to contact and to specify the exclusion parameters.

"Our end-of-term crawls of .gov and .mil websites in 2008, 2012, and 2016 have ignored exclusion directives in robots.txt in order to get more complete snapshots. Other crawls done by the Internet Archive and other entities have had different policies."  The archived sites are available in the beta wayback. They have had little feedback at all on their efforts. "Overall, we hope to capture government and military websites well, and hope to keep this valuable information available to users in the future."

No comments: