"Google PageRank
History
N (logarithmic scale)
●
Start of WW at Cern 1990 Growth of WWW approx. exponential Searching content starts to get time consuming Search engines invented
–
●
●
●
year
1990 Archie (FTP) 1993 ALIWEB WebCrawler 1995 Dec. AltaVista
N = number of log file entries at Cern's Webserver – Graphics from www.w3.org (Berners Lee 1996)
–
History
●
Early Search Engines
– – –
Full text search Index with key words pointing to content Results mostly listed according to number of hits of search words in content Quantity is presented before quality Other criteria used as well (most “citied” pages count more) Additional ordering according to links structure of the web (PageRank)
– –
●
2001 Google
–
PageRank
●
Published in Paper of Sergey Brin and Lawrence Page (Stanford, 1998)
PR(A) := (1-d) + d ( PR(T1)/C(T1) + ... PR(Tn)/C(Tn) )
d = 0.85, a damping factor T2
C(Ti) := Number of Anchors inT1 C(T1) = 1, C(T2) = 3
Tn Two simple examples ? ?
T1 A
PageRank
PR(A) := (1-d) + d ( PR(T1)/C(T1) + ... PR(Tn)/C(Tn) ) = (1-d) = 0.15 C(B) := PR(B) / 1
PR(B) := (1-d) + d ( PR(B) )
PR(B) := (1-d) + d ( PR(B) ) = 1, with PR(B) = 1 as start value PR(B) := (1-d) + d * 0.5, mit PR(B) = 0.5 start value = 0.15 + 0.475 = 0.625 PR(B) := (1-d) + d * 0.625 = 0.15 + 0.85 * 0.625 = 0.68125 value iteratively increases to 1
Two simple examples 0.15 A B 1
PageRank
●
Observations
–
Start values seems not to matter (must be between 0-1) Iterative calculation results in final PageRank Pages with no outgoing links loose PageRank (dangling links) Average PageRank is 1 (if no dangling links exists) Adding links does not increase the PageRank
– –
– –
PageRank
●
PR models a random surfer
– – –
Starts at a random page Follows link In dead end: continue at random page
d = 0.85, probability that the surfer is bored by the page and randomly follows a link T2
T1 A
Tn ? ?
PageRank
●
PR of all pages / number of all pages is 1 Dead ends (dangling links) ignored by Google's page rank PageRank “exported” to dangling links usually is lost
●
●
?
?
?
1
1
PageRank
Lets consider all Web pages T1 to Tn of the world PR(Ti) := (1-d) + d ( PR(T1)/C(T1) + ... + PR(Tn)/C(Tn) ) Applying all n equalities leads to a system of equalities with n unknown values (the PageRank) There is an exact solution
PR(T1) := (1-d) + d ( PR(T2) + 0.5 * PR(T3) ) PR(T2) := (1-d) + d ( 0.5 * PR(T3) ) PR(T3) := (1-d) + d ( PR(T1) ) T1 T2 T3
PageRank
●
Simple (inefficient) implementation based on breath-first-search
– – –
Initialize all pages with PR 0.15 Start with a list of one start node while ( list is not empty )
● ●
T := first (list) foreach (anchor a in T)
– –
update PageRank in a with PR(T) / n, where n is number of outgoing links in T If PageRank of a changes, then add a to the end of list
PageRank
●
PageRank visible in Google toolbar Non linear scale, probably logarithmic (base 6-7) 11 Units: 0, 1, 2, 3, 4, ... , 10
0 - 0.9 0.9 - 5.4 5.4 - 32.4 194.4 - 1 166.4
●
●
32.4 - 194.4
PageRank
●
Homepage with Google toolbar PageRank of 2 Tree like navigational structure Only navigational links between the top level nodes No outgoing links at top level, a few at the leafs Content, content, content: real content, no fakes Do not copy content from other Web sites: create new content (5-15 KBytes per page) Approx. 20 pages needed Create at least one page per day Think about the content before creating the Website Put keywords in title, header, bold face, emphasize, anchors, and outgoing links
●
●
●
●
●
●
●
●
●
..."
|
You need to upgrade your Flash Player , or try to enable javascript in order see this document properly.
|
|