Wednesday, February 01, 2017

Why Aren't We Doing More With Our Web Archives?

Why Aren't We Doing More With Our Web Archives? Kalev Leetaru. Forbes. January 13, 2017.
     The post looks at the many projects that have been launched to archive and preserve the digital world; the best known is the Internet Archive, "which has been crawling and preserving the open web for more than two decades" and has preserved more than 510 billion distinct URLs from over 361 million websites. The author asks: "With such an incredible repository of global society’s web evolution, why don’t we see more applications of this unimaginable resource?"

Some of the reasons that there isn't a more vibrant and active research and software development community around web archives may be:
  • Economics plays a role, 
  • Complex nature of web archives
  • The Internet Archive archive is over 15 petabytes, which is difficult to manipulate
  • There aren't many tools that can use the archive, particularly indexing
The Internet Archive last year announced the first efforts at keyword search capability. These kinds of search tools are needed to make the Archive’s holdings more accessible to researchers and data miners.

"At the end of the day, web archives are our only record capturing the evolution of human society from the physical to the virtual domains. The Internet Archive in particular represents one of the greatest archives ever  created of this immense transition in human existence and with the right tools and a greater focus on non-traditional avenues, perhaps we can launch a whole new world of research into how humans evolved into a digital existence."