SEO Forum : Google forum and seo forums for the open discussion of all seo techniques and methods

The Topic Senstive PageRank and Florida theory: comments : Seo Forum

WebWorkshop Home
PhilC's Blog
Home  Forum History  FAQ  Search  Usergroups  Profile  Log in to check your private messages  Log in  REGISTER
 
The Topic Senstive PageRank and Florida theory: comments
Goto page 1, 2, 3, 4  Next
 
Post new topic   Reply to topic    SEO Forum Index -> Google Forum
View previous topic :: View next topic  
 
Author Message
PhilC
Site Admin


Joined: 21 Nov 2002
Posts: 13052

Post Posted: Sun Jan 11, 2004 4:19 pm Quote selected Reply with quote
     Post subject: The Topic Senstive PageRank and Florida theory: comments

I've been giving some thought to Dan Thies' Topic Sensitive PageRank (TSPR) theory and comparing it to the recognized Florida effects.

What is Topic Sensitive PageRank (TSPR)?

It is a value of a page's importance for a particular topic based on the linkages between on-topic pages. Normal PageRank is a value of a page's overall importance based on the linkages between all pages, regardless of topic or anything else. For Topic Sensitive PageRank, it is necessary to pre-determine a range of specific topics and, for each page in the index, to pre-calculate a Topic Sensitive PageRank for each topic.

The nature of PageRank means that it cannot be calculated on-the-fly at the time of the search because when a search is made, every on-topic page in the index needs to be considered for the rankings. If the TSPRs were not pre-calculated, Google wouldn't know which pages were on-topic and it would need to find out by using a normal algorithm, and then the TSPRs would need to be calculated for all the returned pages, or perhaps a sub-set of them. That's much too time consuming.

So TSPRs need to be pre-calculated and each page in the index would have a very large number of PageRanks - the normal PageRank, and a PageRank for each supported topic. That's a lot of additional storage space, but I guess that's not a problem.

How Topic Sensitive PageRank fits in with Florida

The biggest way that it fits in is that, like the expert system theory, TSPR cannot produce a set of results for every search query. It can only produce a set of results for its pre-defined topics. We know that the results for different searchterms are produced by different algorithms - the Florida algo and the old algo. Topic Sensitive PageRank appears to match that particular Florida effect.

Another way that Dan Thies fits TSPR with the Florida effects is that Florida results are returned for more general queries but not for more specific queries, but I'm not so sure that that's likely to happen with TSPR. When Topic Sensitive PageRank was first conceived, programmed and tested, it used the DMOZ top level categories as topics (Arts, Business, Computers, etc). Those one-word topics are as 'general' as it is possible to be, and it isn't difficult to produce a TSPR for every page in the index for each topic, or to produce a highly relevant set of results for each one-word topic.

But when an additional search word is added to a one-word topic, e.g. "uk holidays", then although it is still a very general searchterm, it would require a great many pre-defined topics, and TSPRs for each page in the index (including TSPR0s), to cover all the possible 2-word topics based on the word "holidays". That's just for a single one-word topic, but there are hundreds or even thousands of those one-worders.

I used the "uk holidays" example because I know that it produces Florida results. The 'holidays' based 2-worders may be limited so that every town and city in the world isn't included in the topics database but, even so, when you add up all the 2-worders for every one-word topic, there are still thousands upon thousands of 2-word search terms that would need to be pre-defined topics for TSPR to be a useful search algorithm.

Dan suggested that each word in the searchterm might be in the topics database and that, for a given search query, a "distance" between them can be calculated and acted upon. I don't like this at all because it would mean that topics are created on-the-fly and, if that is so, where are the Topic Sensitive PageRanks for them? As was mentioned earlier, they can't be calculated on-the-fly. Either a TSPR for each page exists in advance for a specific topic, or it doesn't. If it doesn't exist then it can't be calculated at the point of searching, so there is no need to measure distances between words in a topics database.

He also suggested that Google might have combined their CIRCA technology (aquired when they purchased Applied Semantics) with TSPR. The idea is that the CIRCA technology is capable of deciding what a person is searching for from the words typed into the search box. The technology then selects a suitable topic and produces a set of results for it. Again, I don't like this because it would require topics to be created on-the-fly, or a topic to be often selected that is merely a close match for the searchterm.

and finally...

I'm not saying that Dan's theory is wrong. I'm saying that I'm inclined to think that it is wrong because there are parts of it that don't seem to add up. If I am correct in my assertion that PageRanks (topic sensitive or not) cannot be calculated on-the-fly, then it would require a very large database of pre-defined topics and, for each of those topics, a Topic Sensitive PageRank for every page in Google's index.

It takes Google around a week to calculate the normal PageRank for every page in the index. There just isn't the time for that to be done for the many thousands of necessary topics. Yes, for each topic, the calculations would be done with a comparitively small number of pages, albeit several million in some cases, and yes, the number of iterations could be reduced for the TSPR calculations, but why do that when it means ending up with inaccurate PageRanks? Besides, it requires a certain number of iterations to begin to get close to the final figures, so the number of them can't be reduced too much. I really don't believe that the computing time could be reduced sufficiently to make it possible to calculate Topic Sensitive PageRank on-the-fly at the point of search.

_________________
PhilC
Hidden Text
Search Engine Optimization articles and tools :: PageRank explained


Last edited by PhilC on Tue Jan 13, 2004 12:22 pm; edited 1 time in total
Back to top
View user's profile Send private message Visit poster's website
rustybrick
Member


Joined: 11 Jan 2004
Posts: 7
Location: New York, USA

Post Posted: Sun Jan 11, 2004 6:09 pm Quote selected Reply with quote
     Post subject:

Well I think the assumption is that the CIRCA technology provides a mechanism for apply topic sensitive to the page rank.

No one will argue it requires a very smart algorithm to accomplish this but maybe it is was finally achieved.

I do not know enough about the details of the algorithm but Teoma has done it, so why can't Google?
Back to top
View user's profile Send private message Visit poster's website
PhilC
Site Admin


Joined: 21 Nov 2002
Posts: 13052

Post Posted: Sun Jan 11, 2004 6:18 pm Quote selected Reply with quote
     Post subject:

I don't know how Teoma works, but TSPR is ordinary PageRank, which requires a reasonable number of iterations to get any meaningful figures. The "topic" part is just that it is calculated using much fewer pages then the normal PageRank.

The original idea of TSPR is to pre-calculate the values for each topic. Dan's idea is to do it on-the-fly and, in terms of time, I just don't think it can be done on-the-fly.

For it not to be done on-the-fly requires pre-set topics. CIRCA could decide on a suitable topic from the words provided in the searchterm, but Dan seemed to be suggesting something different - that topics are created from the searchterm words on-the-fly, based on the "distance" between them in the topic words database.

If topics are created on-the-fly, then Topic Sensitive PageRanks must also be calculated on-the-fly, and I don't believe that can happen - it takes too long.

_________________
PhilC
Hidden Text
Search Engine Optimization articles and tools :: PageRank explained
Back to top
View user's profile Send private message Visit poster's website
rustybrick
Member


Joined: 11 Jan 2004
Posts: 7
Location: New York, USA

Post Posted: Sun Jan 11, 2004 11:13 pm Quote selected Reply with quote
     Post subject:

Interesting...
Back to top
View user's profile Send private message Visit poster's website
Mel
Site Admin


Joined: 03 Sep 2003
Posts: 9060

Post Posted: Mon Jan 12, 2004 11:24 am Quote selected Reply with quote
     Post subject:

I was under the impression that CIRCA was used to understand the topic of the page not the search, and that would seem to fit in with the way it is used for Adsense:

Quote:
Applied Semantics' products are based on its patented CIRCA technology, which understands, organizes, and extracts knowledge from websites and information repositories in a way that mimics human thought and enables more effective information retrieval

_________________
Expert SEO Services - Buy Cheap Used Cars
Back to top
View user's profile Send private message Visit poster's website
PhilC
Site Admin


Joined: 21 Nov 2002
Posts: 13052

Post Posted: Mon Jan 12, 2004 12:02 pm Quote selected Reply with quote
     Post subject:

If I've understood Dan's article correctly, he suggests that it is being used to understand the searchterm's topic, but the 'new algo' explanation is very limited in the article. It's more of a small overview.

Dan Thies wrote:
What CIRCA allows Applied Semantics (and Google) to do is to identify concepts related to specific words and phrases.

_________________
PhilC
Hidden Text
Search Engine Optimization articles and tools :: PageRank explained
Back to top
View user's profile Send private message Visit poster's website
DanThies
Member


Joined: 13 Jan 2004
Posts: 35

Post Posted: Tue Jan 13, 2004 3:00 am Quote selected Reply with quote
     Post subject:

Phil:

It's entirely possible that I am completely, 100% wrong. Very Happy I am certain that I am at least partially wrong. I would have to be, since I am speculating. I am also happy to have this conversation with someone who really understands PageRank in the first place.

Let me try to clarify things a bit. This report was created as a mini-update for my book's readers, so a lot of detail is left out.

For the sake of argument, let's say that Google is capable of calculating a set of topic-sensitive PageRank (TSPR) scores. Maybe it's 2 topics, maybe it's 16, 100, whatever. For each topic, they'd need to have a TSPR for each page, as Phil has pointed out. Maybe they'd only calculate TSPR for pages above a certain threshold in PageRank.

Topics are not search terms.

The original paper on TSPR (the link is in my report) describes the use of 16 topics, representing the top-level categories of DMOZ. You could do any search query, and bias the results by one of the 16 topics. You didn't have to search for the word "Arts" to use TSPR, you could use the TSPR for "Arts" to slant the results of any search toward "Arts."

But users don't say "I'm searching for these words using this topic. This was noted in my report. If they can't come up with a topic (or topics) to use, they can't use TSPR. So without some other mechanism, you would only be able to use TSPR when someone indicates the topic - for example, by "searching the web" from a directory page.

That's where Applied Semantics / CIRCA could come into play.

There's at least one example in my report, and I don't want to type it in again, but they *could* use CIRCA to determine how your search phrase is related to some topic or set of topics. They can also tell you how closely related, represented with a numeric value - a distance between your search phrase and a topic for which they have calculated TSPR.

The greater the 'semantic distance' (why not coin any extra term) between your search phrase and a topic, the less impact TSPR would have on the results, and the more influence that the generic PageRank would have.

That's the theory, in an even smaller nutshell, with more detail, I hope.

A few more quick hits to address specific questions:
- CIRCA could also be at play in determining the topics of pages for calculating TSPR. As someone pointed out, they're already doing this with Adsense.
- Taher Haveliwala is also one of the founders of Kaltix, the company formed by the people who had figured out how to calculate PageRank really fast, which was acquired by Google approximately 18 seconds after it was founded.
- To make TSPR work, you wouldn't need an exact PageRank for the topics, a fast approximation would do, since it's not the only value used in returning results, but is instead used to bias the results.
- For fun, look at nearly identical content on different web sites, that display Adsense. Depending on the type of site it's published on, you can get very different types of ads. An easy source of 'nearly identical content' is articles (like those I publish), which are frequently published on numerous web sites.
Back to top
View user's profile Send private message Visit poster's website
PhilC
Site Admin


Joined: 21 Nov 2002
Posts: 13052

Post Posted: Tue Jan 13, 2004 6:41 pm Quote selected Reply with quote
     Post subject:

Hi Dan,

I'm glad you stopped by because I would have liked to discuss it with you, but I'm banned from the Highrankings forum where I think you are an administrator ( http://www.webworkshop.net/seoforum/viewtopic.php?t=129 ). I even tried to find an email address on your site, but I couldn't find one with your name in it.

I still think that your TSPR idea is one of the two theories that has a chance of being right - the other being the "expert system" that I put forward here. In fact, I'd added a bit at the bottom of my article about your idea ( http://www.webworkshop.net/florida-update.html#latest ). Like you, I also pointed out flaws in the other common theories. So we are looking at Florida with very similar minds - that Google really does use two different algorithms depending on the searchterms, and that the early explanations, "seo filter", etc. were flawed. So on with the current discussion...

I've re-read Taher H. Haveliwala's paper again and I found that I'd misunderstood it. I'd assumed that each page in Google's index had to have a pre-computed TSPR value for each supported topic. But Taher H. Haveliwala wrote in his paper:-

Quote:
An approach for enhancing rankings by generating a PageRank vector for each possible query term was recently proposed ... with favorable results. However, the approach requires considerable processing time and storage, and is not easily extended to make use of user and query context.

It sounds like I was mistaken, because he's writing off the idea, but later in the paper he wrote:-

Quote:
In our approach to topic-sensitive PageRank, we precompute the importance scores offline, as with ordinary PageRank. However, we compute multiple importance scores for each page; we compute a set of scores of the importance of a page with respect to various topics.

Now it appears that I was correct. Confused, aren't I? Confused I'm pretty sure that I'm suffering from the fact that I'm not a mathematician - specifically, I don't understand the mathematical use of the word "vector", as in "topic-sensitive PageRank vector". According to definitions found at Google, it has a number of meanings, including "A quantity having both magnitude and direction, e.g. displacement, velocity, acceleration and force". I rather fancy that that is the meaning in Taher's paper.

He continues...

Quote:
At query time, these importance scores are combined based on the topics of the query to form a composite PageRank score for those pages matching the query. This score can be used in conjunction with other IR-based scoring schemes to produce a final rank for the result pages with respect to the query.

and elsewhere...

Quote:
....we assume a user with a specific information need issues a query to our search engine in the conventional way, by entering a query into a search box. In this scenario, we determine the topics most closely associated with the query, and use the appropriate topic-sensitive PageRank vectors for ranking the documents satisfying the query. This ensures that the ``importance'' scores reflect a preference for the link structure of pages that have some bearing on the query.

I was correct after all. However, I was mistaken that a TSPR is required for each page for every possible topic that is to be supported. In fact, the pre-defined topics only need to be of a more general nature because they are only used to 'bias' the results. "Bias" is a word that both Taher and you used.

The overall effect is to bias ("reflect a preference") the importance of each page in the results set towards the relevant topic(s) for the searchterm and its context. Now I think I'm getting somewhere Smile

So onto your paper:-

Well, now that I have a better (but imperfect) understanding of the way that TSPR works, I don't see any immediately obvious flaws in your theory. It accounts for what I see as the single most important change since Florida - that results sets are compiled in different ways depending on the searchterm, and that the Florida serps are not a 'standard' results set to which one or more of the various suggested filters have been applied. I think that, like the 'expert system' theory ;), the TSPR theory stands a good chance of being correct.

_________________
PhilC
Hidden Text
Search Engine Optimization articles and tools :: PageRank explained
Back to top
View user's profile Send private message Visit poster's website
I, Brian
Advanced Member


Joined: 06 Dec 2003
Posts: 366
Location: Yorkshire, UK

Post Posted: Tue Jan 13, 2004 9:19 pm Quote selected Reply with quote
     Post subject:

One of the fascinating things here is that TSPR would have remarkably similar results to the Hilltop theory that I have certainly favoured so far.

However, what I would like to see Dan address is why .edu , .gov and directory sites have such elvated rankings in the affected Floridan results - as these are implicitly symptoms of a hilltop system, and are even singled out in the Hilltop paper. In what way would you see TSPR cause these sort of sites to suddenly rank abnormally high? And, in what way would TSPR differ most significantly from a Hilltop dominated algo?

As for themes and context - I do not at all doubt that this should be a part of SEO now.

_________________
SEO resources
Back to top
View user's profile Send private message Visit poster's website
DanThies
Member


Joined: 13 Jan 2004
Posts: 35

Post Posted: Wed Jan 14, 2004 12:03 am Quote selected Reply with quote
     Post subject:

How'd you get run off from the HR forums, Phil? I didn't know about that. I thought we only had one individual who was actually banished.

You, Brian: (did I get that right?)

Hilltop is similar to Topic-Sensitive PageRank, but I rejected the idea of Hilltop a few days into this thing, for several reasons. I don't think you'd see all these resource & directory type pages ranking as well with Hilltop, for example.

Hilltop is five years old. If it were anything more than an interesting idea, someone would have implemented it by now. For those who remember how crazy Altavista's results looked a couple years ago, is it possible that was an attempt to implement Hilltop?

The main reason, though, is that they'd be throwing PageRank out the window with Hilltop. If they were going to do that, there wouldn't have been much reason for them to create and acquire Kaltix last year, just a few short months before the big change.

PageRank is good, TSPR fixes its flaws. The problem is getting the topics right, getting the semantics right, and getting the balance right.

I'm sure the "onmousedown" code in Google's SERPs has already been discussed in these forums. Google is tracking clicks now, so they are getting live user feedback constantly.

They've said there are more factors to be added to the algorithm, and this must surely be one of them.... one more reason not to click on your competitor's organic listing, if you're super paranoid.

As far as why .edu, .gov, and directory pages are showing up more often (can anyone validate that this is true?), show me a specific SERP and let's walk through it.

There are two kinds of linking relationships on the web - natural and artificial. EDU and GOV sites, as well as directories, are going to have a huge advantage in natural linking from relevant pages. EDU and GOV web sites also tend to be extremely well structured, with related content clustered together, which is also an advantage.

It takes a long time to walk through web graphs, but it's very enlightening, especially after you've been at it for 18-20 hours. Try it - around the 19th hour of squinting at web graphs, a strange calm will overcome you, and you can actually SEE THE WEB.

Random additional note: With very generic search terms, Google seems to have a much better mix of the possible meanings than before. The "real estate" SERPs used to be dominated by residential real estate agents, now you see commercial real estate, real estate investing, real estate law, training, etc. etc. mixed into a lot of them.
Back to top
View user's profile Send private message Visit poster's website
PhilC
Site Admin


Joined: 21 Nov 2002
Posts: 13052

Post Posted: Wed Jan 14, 2004 12:30 am Quote selected Reply with quote
     Post subject:

DanThies wrote:
Hilltop is similar to Topic-Sensitive PageRank, but I rejected the idea of Hilltop a few days into this thing, for several reasons. I don't think you'd see all these resource & directory type pages ranking as well with Hilltop, for example.

Oddly enough, one of my reasons for liking the 'expert system' idea is because expert pages would tend to link more to resource pages than to commercial pages, which is what people have been seeing in the serps.

That wouldn't be true of directory pages, though, but I haven't seen a preponderance of directory pages around the top of the serps. I see many website pages listed that are in Google's directory, but that's different. How did Google include those pages before Florida? Did they work out the serps and then add the Directory description for each page that, coincidentally, was in their directory, or did they arbitrarily add some of them according to algorithms? I don't think we know the answer to that. But it's perfectly feasible that Google simply adds the directory description to each page that is selected for the serps and, coincidentally, is in the directory. I see no reason why that doesn't happen with the Florida results.

Personally, I've never suggested that it's Hilltop. I've said along that it looks like an 'expert-based system', or words to that effect, that may have been developed from Hilltop. Why wouldn't an expert system have been implemented back when Hilltop was devised? Because that was very soon after Google was launched and, at that time, they were doing very well. It's only in more recent times that the relevancy of the serps provided by one or two other engines has caught up, or almost caught up, and Google needed to do something drastic to move ahead again.

So I still think that an expert system could account for all the reported Florida effects, and stands a good chance of being correct.

_________________
PhilC
Hidden Text
Search Engine Optimization articles and tools :: PageRank explained
Back to top
View user's profile Send private message Visit poster's website
DanThies
Member


Joined: 13 Jan 2004
Posts: 35

Post Posted: Wed Jan 14, 2004 12:51 am Quote selected Reply with quote
     Post subject:

PhilC wrote:
So I still think that an expert system could account for all the reported Florida effects, and stands a good chance of being correct.

I agree. Whatever they're doing, it has to fit in seamlessly with what they were doing before. It doesn't have to be Topic-Sensitive PageRank. It could be something simpler, or something more complicated.
Back to top
View user's profile Send private message Visit poster's website
PhilC
Site Admin


Joined: 21 Nov 2002
Posts: 13052

Post Posted: Wed Jan 14, 2004 12:58 am Quote selected Reply with quote
     Post subject:

I thought a little more...

Directory pages (that's pages in a directory, and not pages that are linked to from a directory) are very likely to be selected as 'expert pages'. In fact, they are ideal. I see very many pages that are linked to from Google's directory around the top of the serps. And, if they are in Google's directory, they are also in DMOZ and many smaller sites' directories. That could amount to quite a few 'expert' pages linking to them, and could account for why there are so many up at the top.

Just more food for thought Smile

If we ever find out what Florida really is, we'll probably all have been way off the mark Cool But I still think that an expert system and TSPR are the two most likely candidates that have surfaced so far.

_________________
PhilC
Hidden Text
Search Engine Optimization articles and tools :: PageRank explained
Back to top
View user's profile Send private message Visit poster's website
rustybrick
Member


Joined: 11 Jan 2004
Posts: 7
Location: New York, USA

Post Posted: Wed Jan 14, 2004 1:45 pm Quote selected Reply with quote
     Post subject:

Question,

What are the core differences in bullet format between both your theories (TSPR and Expert Theory)?

I know the expert theory is based on hilltop but in essence they seem the same. I am going to re-read the Hilltop report but I hope to get a quick answer here.

Also if I may, and I know I said this a hundred times, how does Teoma's Subject Specific Popularity(wrote a summary on it after being upset with Google's results mid December only to now realize that Google is moving towards that direction) differ from these theories as well?

This is an excellent thread.
Back to top
View user's profile Send private message Visit poster's website
PhilC
Site Admin


Joined: 21 Nov 2002
Posts: 13052

Post Posted: Wed Jan 14, 2004 2:57 pm Quote selected Reply with quote
     Post subject:

An expert system compiles the results set from the pages that are linked to from on-topic 'expert' pages. The expert pages are contained in a database. TSPR compiles the results set in the (or a) 'normal' way, but includes a bias towards one or more relevant, pre-defined topics.

I think that's the only core difference.

_________________
PhilC
Hidden Text
Search Engine Optimization articles and tools :: PageRank explained


Last edited by PhilC on Thu Jan 15, 2004 2:36 am; edited 4 times in total
Back to top
View user's profile Send private message Visit poster's website
 
Display posts from previous:   
Post new topic   Reply to topic    SEO Forum Index -> Google Forum

   The Topic Senstive PageRank and Florida theory: comments
All times are GMT
Goto page 1, 2, 3, 4  Next
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Blog Entries
  • Google's Supplemental Index

  • Behind Closed Doors

  • Dispelling the Myth: "Subscription Cloaking"

  • Cloaking - what it is, and what it isn’t

  • Hidden Text and Google

  • Some crazy ideas about search engines

  • What makes a searchterm competitive?

  • Google's Custom Search Engines
  • SEO Shop - SEO Services
    website design
    Webmaster Radio
    Expert SEO and SEM
    Google