I just wanted to call some attention to this, since it may pertain to the searchability of a lot of newspaper Web sites. (Disclosure: My employer, the Newspaper Association of America, is a member/supporter of ACAP.)
ACAP version 1.0, a strategically-based technology solution to the controversy over what mainstream media content search engines can index, launched at the end of November. The people behind ACAP are asking newspaper and other media Web sites to install and implement it.
ACAP is the acronym for Automated Content Access Protocol. Here's the quick (easy to understand) background from the Nov. 30 Online Publishing Update:
Currently, most (but not all) search engines respect the instructions in a document called “robot.txt” that tells search engines not to crawl certain Web pages or site sections. A group of publishers unveiled a proposal [Nov. 29] called the “Automated Content Access Protocol” which would require all search engines to respect search instructions and restrictions.
“If accepted by search engines, publishers say they would be willing to make more of their copyright-protected materials available online. But Web surfers also could find sites disappear from search engines more quickly, or find smaller versions of images called thumbnails missing if sites ban such presentations,” The Associated Press reported.
Many Web sites already have robot.txt files that essentially tell search engines what to crawl and what to ignore. There is some controversy over robot.txt files, though, as some Web developers see them as a road map for hackers. According to the ACAP FAQ section:
We recognise that robots.txt is a well established method for communication between content owners and crawler operators. However, robots.txt is not sophisticated enough for today's content and publishing models. Robots.txt, in its current form as implemented by most search engine operators, provides only a simple choice between allowing and disallowing access. These simple choices are inconsistently interpreted. A number of proprietary extensions have been implemented by several of the major search engines, but not all search engines recognise all or even any of these extensions. ACAP provides a standard mechanism for expressing conditional access which is what is now required.
According to a recent press release, ACAP "will allow publishers, broadcasters and indeed any other publisher of content on the network to express their individual access and use policies in a language that search engine robots and similar automated tools can read and understand."
The primary drivers of ACAP are the World Association of Newspapers (WAN), the European Publishers Council (EPC) and the International Publishers Association (IPA). That press release also lists the pilot project participants (who tested the system for a year starting in late 2006) and current members, which now include The Associated Press, Reuters and NAA.