In recent times, people have discussed a number of sites that allow the search engine spiders to crawl pages that are not allowed to be seen by non-registered people. These are some of the discussions:-
June 2006 (about the Experts-Exchange site)
http://forums.searchenginewatch.com/showthread.php?t=11974
June 2006 (about the New York Times site)
http://forums.searchenginewatch.com/showthread.php?t=12191
December 2006 (general discussion lower down in the thread)
http://www.mattcutts.com/blog/communication-in-other-languages/
December 2006 (about the WebMasterWorld site)
http://blog.outer-court.com/archive/2006-12-13-n85.html
What happens is that the pages are listed in the search engine results, but when people click on the listings, they are redirected to a login/register page, instead of receiving the page itself.
Some people want to think that what the sites do is spam - specifically cloaking. They want it to be spam because they don’t like it, but it isn’t spam, as I demonstrated here.
The real issue is that the sites allow search engine spiders to crawl those pages, for the sole purpose of having them listed in the search results, so that they will attract people, but they don’t allow unregistered people who click on the listings to see the pages without first registering. In other words, they are specifically using the engines to gain members. It’s understandable that many people would object to being denied direct access to pages that are listed in the search results.
That’s the real issue. Is it right for search engines to be used in that way? Is it right to intentionally have the search engines list pages that are denied to people unless they register? If people would discuss that, instead of clouding the issue by wrongly trying to make out that it’s spam, they may even persuade the engines to do something about it.
Personally, I prefer the pages to be listed, so that I know the information they contain is there, and I can choose to view it if I want to by registering. But I also think that it’s an abuse of the engines that they shouldn’t allow. So I have a foot in both camps - I like to know that the information is there, but I also think that search engines should not list pages that all people cannot go directly to when clicking the link in the results.
If the engines wanted to do something about it, the problem they have is that, without doing something out of the ordinary, they can’t programmatically differentiate between URLs that anyone can reach, and those that people can’t reach without being logged into a site. If a site allows the engines access to pages that all people aren’t allowed access to, the only way they can programmatically know about it is to request the pages with stealth spiders (unknown IP addresses), and they would need to do it for every page in the index. There are problems in doing that. For instance, how would they know if a ‘different’ returned page is due to something other than registered-only pages? It might be that the page has simply been changed.
Maybe they could write a sophisticated programme that could do a reasonable job of stealth spidering, but, since the sites aren’t in breach of any guidelines, and it is only a small problem, if they see it as a problem at all, I doubt that they would spend their time trying to deal with it programmatically when they have much bigger problems, such as spam, to deal with.
I think, if they are going to do anything at all, it will have to be done by hand, and since some of the sites that use the technique are big brand sites that the engines need in the index (New York Times and other newspapers, etc.), I can’t see anything being done about it in the near future.
Update:
In November, a Googler wrote that Google is close to making an announcement about the issue, and it will satisfy those who want to allow Google’s spider to crawl and index pages that require people to register for. This is what he wrote:
On a happier note, my colleagues and I are working on an arrangement
which I think you’ll be pleased with… balancing many Webmasters’
interest in requiring community membership or signin to content-rich
pages while still showing content in Google’s search results. Â Stay
tuned
(we’ll make an announcement in the Webmaster Central blog)
My guess is that they are coming up with a system where a site can make an arrangement with Google, to inform them which pages are behind closed doors, so that Google can mark them as such in the search results; e.g. “subscription required” and “registration required”. It could be in the form of a new meta tag to be added to each page that requires a subscription or registration to view.