Tài liệu Luận văn Www and the pagerank-Related problems: ĐẠI HỌC QUỐC GIA HÀ NỘI
TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN
Nguyễn Hoài Nam
WWW AND THE PAGERANK-RELATED
PROBLEMS
LUẬN VĂN THẠC SĨ KHOA HỌC
Hà Nội - 2006
ĐẠI HỌC QUỐC GIA HÀ NỘI
TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN
Nguyễn Hoài Nam
WWW AND THE PAGERANK-RELATED
PROBLEMS
Chuyên ngành: Đảm bảo toán học cho máy tính và hệ thống tính toán
Mã số: 1.01.07
LUẬN VĂN THẠC SĨ KHOA HỌC
NGƯỜI HƯỚNG DẪN KHOA HỌC:
PGS. TS. HÀ QUANG THỤY
Hà Nội - 2006
HANOI NATIONAL UNIVERSITY
UNIVERSITY OF SCIENCE
Nguyen Hoai Nam
WWW AND THE PAGERANK-RELATED
PROBLEMS
Major: Mathematical assurances for computers and computing systems
Code: 1.01.07
MASTER THESIS
THESIS SUPERVISOR:
ASSOC. PROF. HA QUANG THUY
Hanoi - 2006
Intentionally left blank for your note
Contents
List of Figures ii
List of Tables iii
Introduction iv
Acknowledgement vi
Abstract ix
List of Glossaries xi
1 Objects’ ranks and applications to WWW 1
1.1 Rank of objects . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1...
81 trang |
Chia sẻ: hunglv | Lượt xem: 1934 | Lượt tải: 0
Bạn đang xem trước 20 trang mẫu tài liệu Luận văn Www and the pagerank-Related problems, để tải tài liệu gốc về máy bạn click vào nút DOWNLOAD ở trên
ĐẠI HỌC QUỐC GIA HÀ NỘI
TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIấN
Nguyễn Hoài Nam
WWW AND THE PAGERANK-RELATED
PROBLEMS
LUẬN VĂN THẠC SĨ KHOA HỌC
Hà Nội - 2006
ĐẠI HỌC QUỐC GIA HÀ NỘI
TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIấN
Nguyễn Hoài Nam
WWW AND THE PAGERANK-RELATED
PROBLEMS
Chuyờn ngành: Đảm bảo toỏn học cho mỏy tớnh và hệ thống tớnh toỏn
Mó số: 1.01.07
LUẬN VĂN THẠC SĨ KHOA HỌC
NGƯỜI HƯỚNG DẪN KHOA HỌC:
PGS. TS. HÀ QUANG THỤY
Hà Nội - 2006
HANOI NATIONAL UNIVERSITY
UNIVERSITY OF SCIENCE
Nguyen Hoai Nam
WWW AND THE PAGERANK-RELATED
PROBLEMS
Major: Mathematical assurances for computers and computing systems
Code: 1.01.07
MASTER THESIS
THESIS SUPERVISOR:
ASSOC. PROF. HA QUANG THUY
Hanoi - 2006
Intentionally left blank for your note
Contents
List of Figures ii
List of Tables iii
Introduction iv
Acknowledgement vi
Abstract ix
List of Glossaries xi
1 Objects’ ranks and applications to WWW 1
1.1 Rank of objects . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Rank for objects different from web page. . . . . . . . . . 2
1.1.2 Linkage usage in search engine . . . . . . . . . . . . . 4
1.2 Basis of PageRank . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Mathematical basis . . . . . . . . . . . . . . . . . . . 11
1.2.2 Practical issues. . . . . . . . . . . . . . . . . . . . . 13
Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 19
2 Some PageRank-related problems 20
2.1 Accelerating problems . . . . . . . . . . . . . . . . . . . . . 20
2.1.1 Related works . . . . . . . . . . . . . . . . . . . . . 21
2.1.2 Exploiting block structure of the Web . . . . . . . . . . . 22
2.2 Connected-component PageRank approach . . . . . . . . . . . 30
2.2.1 Initial ideas. . . . . . . . . . . . . . . . . . . . . . . 30
2.2.2 Mathematical basis of CCP . . . . . . . . . . . . . . . 32
2.2.3 On practical side of CCP. . . . . . . . . . . . . . . . . 35
2.3 Spam and spam detection . . . . . . . . . . . . . . . . . . . 37
2.3.1 Introduction to Web spam . . . . . . . . . . . . . . . . 37
2.3.2 Spam techniques . . . . . . . . . . . . . . . . . . . . 38
Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 42
3 Implementations and experimental results 43
3.1 Search engine Nutch . . . . . . . . . . . . . . . . . . . . . 43
3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 Infrastructure and data sets . . . . . . . . . . . . . . . 48
3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . 48
Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 58
Conclusions and future works 59
References 61
List of Figures
1.1 A small directed Web graph with 7 nodes . . . . . . . . . . . . . . . 13
1.2 Ranks of page in SmallWeb graph with α = .9 . . . . . . . . . . . . . 16
1.3 Figures exemplifying results with different α of SmallSet . . . . . . . 17
1.4 Graph of iterations needed with α ∈ [.5; .9] of SmallSet . . . . . . . . 18
2.1 Convergence rates for standard PageRank vs. BlockRank . . . . . . 29
2.2 Unarranged matrix and arranged matrix . . . . . . . . . . . . . . . . 31
2.3 Boosting techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Nutch packages dependencies . . . . . . . . . . . . . . . . . . . . . 46
3.2 Matrices from set 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Ranks from PageRank vs. CCP for set 1 . . . . . . . . . . . . . . . . 50
3.4 Ranks and differences corresponding to each block for set 1 . . . . 51
3.5 Time of two methods with different decay values for set 1 . . . . . . 51
3.6 No. of iterations of two methods with different decay values for set 1 52
3.7 Matrices from set 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8 Ranks from PageRank vs. CCP for set 2 . . . . . . . . . . . . . . . . 53
3.9 Ranks and differences corresponding to each block for set 2 . . . . 53
3.10 Time of two methods with different decay values for set 2 . . . . . . 54
3.11 No. of iterations of two methods with different decay values for set 2 54
3.12 Matrices from set 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.13 Ranks from PageRank vs. CCP for set 3 . . . . . . . . . . . . . . . . 55
3.14 Ranks and differences corresponding to each block for set 3 . . . . 56
3.15 Time of two methods with different decay values for set 3 . . . . . . 56
3.16 No. of iterations of two methods with different decay values for set 3 57
i
3.17 Mean case for 3 sets with different decay values . . . . . . . . . . . 57
List of Tables
1.1 Iterations needed for each value of decay-factor α of SmallSet . . . 17
2.1 The closeness of local PageRank vector to global PageRank vector 25
3.1 Each set with corresponding sources . . . . . . . . . . . . . . . . . . 49
3.2 Each set with corresponding number of pages and links . . . . . . . 49
iii
Introduction
Human beings’ history has witnessed many big improvements but nothing is
comparable to networks’, in general, and the WWW’s improvement, in partic-
ular.
During the last decades, databases have grown exponentially in large stores
and companies. In the old days, seekers faced many difficulties in finding
enough data to feed their needs. The picture has changed and now the reverse
picture is a daily problem – how to find relevant data in the information which
is regularly accumulated. The need therefore arises for better approaches that
are able to handle complex models in a reasonable amount of time because
the old models in statistics, machine learning and traditional data analysis
fail to cope with this level of complexity. That is named data-mining with an
important branch, Web-mining.
The key inspiration of web-mining based on WWW’s growth. There is no
doubt that WWW is the most valuable source of information but it is not very
easy for you to get used to if you do not know the way. Several studies try to
estimate the size of the Web andmost of them agree that there are over billions
of pages. Additionally, the changing rate of the Web is even more dramatic [4].
According to [4], the size of the Web has doubled in less than two years, and
this growth rate is projected to continue for the next two years. Although a
enormous amount of information is available on the Web, you can hardly find
appropriate information if you nothing supporting. For that reason, you need
a search engine to make your work easier.
We can consider search engines roughly as places that store information
and return what we need after some operations. Everything returned by a
iv
search engine is sorted descendingly, that means entries which appear first
seem to be more relevant. Even if that seems normal, many efforts were made
to make this step straightforward. For various ways of approaching, there are
many methods that deal with this problem. In this problem, we have to deter-
mine a page’s importance score in order to sort it in the list of pages [16, 6].
The most famous approach discovered is PageRankTM from Google [1]. This
thesis focuses on PageRank and its modifications to tweak up the computing
process.
With the aim of understanding network phenomena and, especially, PageR-
ank computation, this thesis is organized as follow:
• Chapter 1: Demontrates on the rank concept and PageRank. This chap-
ter is the background on which the following chapters flourish.
• Chapter 2: Takes care of some PageRank-related problems such as ac-
celerating problem, spam. And, a new method is also proposed in this
chapter (section 3.2) which we has published [19].
• Chapter 3: Concentrates on the practical side of PageRank with im-
plementations and experiments. Conclusions and future works are also
mentioned in this chapter.
Acknowledgement
My first words are to acknowledge all those people who had helped, contributed
to the works in this thesis. It is obvious that I can not finish this thesis alone.
It will be an impossible task if I, myself alone, work on all things such as de-
ciding orientation, doing research, implementing... and many more things. I
tried my best, and if your name is not listed, please trust implicitly in me that
my gratitude is not less than those who are listed below.
In the foremost position of my gratitude would stand Assoc. Prof. Ha Quang
Thuy for his supervision, advice and guidance to me from the very first day
when I newly accustom science. His endless enthusiasm for science has in-
spired me in doing research and finally this thesis can be realized. Above all
and most needed, he always tries his utmost to encourage and support me in
various ways. Once again, I would like to send my most nicely thankful words
to him.
Many thanks would go to Assoc. Prof. Hoang Chi Thanh, Head of Depart-
ment of Informatics where I am working, for his help and advice to me in both
work and life. Without his help, my thesis can not be here as it is being now.
It is a pleasure for me to express my grateful thanks to Prof. Pham Ky Anh
for his generous disposition. I was allowed to use many facilities in the Center
for High Performance Computing, where he is the Director, to do research and
these facilities helped forming a big important part of my thesis.
To Doan Duy Hai, Department of Computational and Applied Mathemat-
ics, it is a great joy for me to have such a wonderful friend, colleague like him.
May be he does not know how much I am indebted to him for his kind help.
I thank him for his valuable advice in science discussion and for his precious
vi
time spending on helping me. His precise ideas were quite an unforgettable
assistance to me with my scientific works. Furthermore, with his superb skill
in LATEX, I can have help from an expert to drive the troubles away.
Working environment is a key point in forming the way and attitude of
working to me. I am very lucky to work in a good environment with wise
and scholarly colleagues. I also benefit by academic courses and professional
subjects from these wise and scholarly colleagues including my teachers and
friends. I thank Nguyen Trung Kien who, with his unconditional help of im-
plementing, running and storing data in my experiments, makes things easier
for me.
I have a nice collaboration to the seminar KDD group from Faculty of In-
formation Technology, College of Technology. I was so providential to have sev-
eral opportunities to work, discuss about science with those intellectuals. I
especially thank Nguyen Thu Trang for her help and time working on pro-
gramming with me.
This work was supported in part by the National Project "Developing con-
tent filter systems to support management and implementation public secu-
rity - ensure policy" and the MoST-203906 Project "Information Extraction
Models for discovering entities and semantic relations from Vietnamese Web
pages".
I can not wait anymore to wholeheartedly acknowledge my sweetheart. She
is really an angel whose dedication, love, confidence in me and intelligence
really took me over hard works.
I am extremely lucky to be born in my family. My father, the first person
who set the basis of my learning character, taught me the joy of intellectual
recreation even when I was a child. My mother, the best mother in the world,
with her unselfish sacrifice, bring me up with gently love. And, my younger
brother, without his spiritual support, may I fail many times.
My last words are to thank to everybody because you deserve that for your
importance to me.
Hanoi, Octorber 25th, 2006
Nguyen Hoai Nam
Abstract
In this thesis, we will study about the whole WWW on the point of view of
social networks. There are many directions to come up to WWW but we chose
the way of social networks because it will give us many advantages in doing
research.
Everytime we heard of WWW, that refer us to the Internet - the biggest
network all over the world. Networks are all around us, all the time. From the
biochemistry of our cells to the web of friendships across the planet. From the
circuitry of modern electronics to chains of historical events. A network is the
result of the forces that shaped it. Thus the principles of network formation can
be, to some extent, deciphered from the network itself. All such information
comprises the structure of the network.
Moreover, the Internet has been improving rapidly, from a small network
for experimental purpose to a world-wide network. The plentiful content of the
World-Wide Web is useful to millions. But infomation is only available if you
know where to find it. This thesis centres around some aspects of network-
related problems:
• How can we find needed infomation? The answer is the appearance of
search engines.
• How is returned information sorted? The key idea is to use a ranking ap-
proach and the most famous approach, PageRankTM, is carefully studied
in this thesis.
• What are raised problems associated to the answers of two previous ques-
tions? These problems cover storage capacity, computational complexity,
ix
accelerating methods, programming skills... and many more.
To check whether the theory to tweak up the existing method is true, we
implemented with data sets downloaded from the Internet. Experimental fig-
ures show that our method is relatively good and it is really promissing.
List of Glossaries
Glossary name Description
CCP Connected-component PageRank
IR Information Retrieval
PR PageRank
URL Unified Resource Locator
WWW World Wide Web
xi
CHAPTER
Objects’ ranks and applications to
WWW
WWW is a very famous entity that you can meet everyday at any
time and in any circumstance. Gradually it prevails on every aspect
of our lives and it is definitely useful. Because of endlessly enormous
information, which can be provided by the WWW, we must have a
scheme to sort things in order to determine which is relatively suit-
able. This chapter will guide you through one of the most topical
problems addressed.
1.1 Rank of objects
Many datasets of interest today are best described as a linked collection of
interrelated objects. These may represent homogeneous networks, in which
there is a single-object type and link type, or richer, heterogeneous networks,
in which there may be multiple object and link types (and possibly other se-
mantic information). Examples of homogeneous networks include single mode
social networks, such as people connected by friendship links, or the WWW,
1
1.1 Rank of objects
a collection of linked web pages. Examples of heterogeneous networks include
those in medical domains describing patients, diseases, treatments and con-
tacts, or in bibliographic domains describing publications, authors, and venues.
These examples may include set of scientific papers. As all of us know, it is
necessary that we credit others’ works if we use their works in our paper. That
can be considered a link between our papers and credit-owner’s works. Link
mining refers to data mining techniques that explicitly consider these links
when building predictive or descriptive models of the linked data. Commonly
addressed link mining tasks include object ranking, group detection, collective
classification, link prediction and subgraph discovery. While network analysis
has been studied in depth in particular areas such as social network analy-
sis, hypertext mining, and web analysis, only recently has there been a cross-
fertilization of ideas among these different communities. This is an exciting,
rapidly expanding area.
In this section, examples on ranking objects in many kinds of datasets are
examined to insist that an approach to sort relevant results is required. Then,
the section will go on with the need of a ranking algorithm for WWW–one of
the most important communities in our contemporary world.
1.1.1 Rank for objects different from web page
Progress in digital data acquisition and storage technology has resulted in
the growth of huge databases. This has occurred in all areas of human en-
deavor, from the mundane (such as supermarket transaction data, credit card
usage records, telephone call details, and government statistics) to the more
exotic (such as images of astronomical bodies, molecular databases, and med-
ical records). Little wonder, then, that interest has grown in the possibility
of tapping these data, of extracting from them information that might be of
value to the owner of the database. The discipline concerned with this task
has become known as data mining.
Data mining, with its aim to find knowledge from a "big mountain" of infor-
mation, needs a way to sort things extracted. We will study some examples to
decide that a ranking algorithm is "must-have" part of every mining system.
2
1.1 Rank of objects
Example 1.1. Consider some online computers store that accept queries on
the Configuration and Price attributes of a goods. Agents could treat query
conditions as if they were regular Boolean conditions then returns relevant
things. This way, agent or any client can determine an acceptable radius of
pattern and price around the preferred products. However, suppose we are
browsing a very big store, there could be too many matching items making
users’ tasks difficult. Additionally, many items that are slightly weaker in
specification but price are very good in the searching range are missing.
Thus, we should think of a new ranking system which ranks results if they
are closest to the Configuration given and have best Price or relatively inex-
pensive.
In addition, we can propose, briefly, a method to rank items as follow.
Example 1.2. Suppose, with two criteria Configuration and Price used, each
criterion is assigned weight as in 0.8c+0.2p where c ∈ [0, 1] shows how close the
result is to the target configuration and p ∈ [0, 1] indicates how close the price
is to the target price. Looking at the formula, if a computer has higher relevant
configuration, it seems to be ranked higher in results. If another store would
want to weigh configuration and price equally, the formula should be 0.5c+0.5p.
Suppose that a customer wants to look for a computer with specific con-
figuration and target price of 500US$. Besides, in the store, there is only one
computer with the specific configuration (c = 1, quite good) and high price
(i.e. p = 0.2). All the remaining computers are in the near configuration with
c = 0.8 and the price for all is moderate, p = 0.5. Note that, with the same or
near configuration, computers can differ in price due to the make of manufac-
turers.
Applying the definition above, we compute the relevance score of each com-
puter type. The computer, which suites best in configuration have score of
0.8 ì 1 + 0.2 ì 0.2 = 0.84 whereas other computers should have scores of
0.8ì0.8+0.2ì0.5 = 0.74. Therefore, in this store, the computer with best rele-
vant configuration should be ranked first in results and others will have lower
ranks. What will happen if we consider a store which assigns the same weight
to both configuration and price? Let us have a computation of these computers.
The computer, which suites best in configuration has score of 0.5ì1+0.5ì0.2 =
3
1.1 Rank of objects
0.6 whereas other computers should have scores of 0.5 ì 0.8 + 0.5 ì 0.5 = 0.65.
Therefore, in this store, the answer is different. Computers with unlike config-
uration but good price should be the first choice because they have high scores
and the computer with the same configuration but high price would receive a
worse vote from the store’s algorithm.
Problem arises in Example 1.2 is not the only problem. We will face another
problem which is described Example 1.3
Example 1.3. Suppose that there is a service, which combines results from
many online stores to provide information to customers. Therefore, this ser-
vice has to query many stores at once then combines the results into a single
ranked result. Nevertheless, what will happen if this service uses a different
algorithm to stores as the case described in Example 1.2? This is an open ques-
tion.
It is hard to determine which system is right, which is better, to which we
should follow for previous examples. Moreover, not only computers but also
many kinds of entities around us need to be ranked. That leads us to a big job:
Ranking objects. This is an open question that the thesis’ author intends to
solve in the future.
In previous examples, computers with different configurations and prices
have no interrelated information. How about objects those have connections
between? Is it easier to mine and extract useful information? In the next sec-
tion, we will consider web pages–special objects which have links and know
how links between web pages benefit us.
1.1.2 Linkage usage in search engine
Links or more generically relationships, among data instances are ubiquitous.
These links often exhibit patterns that can indicate properties of the data in-
stances such as the importance, rank, or category of the object. In some cases,
not all links will be observed; therefore, we may be interested in predicting
the existence of links between instances. In other domains, where the links
are evolving over time, our goal may be to predict whether a link will exist
4
1.1 Rank of objects
in the future, given the previously observed links. By considering links, more
patterns that are complex arise as well. This leads to other challenges focused
on discovering substructures, such as communities, groups, or common sub-
graphs.
Traditional data mining algorithms such as association rule mining, mar-
ket basket analysis, and cluster analysis commonly attempt to find patterns
in a dataset characterized by a collection of independent instances of a single
relation. One can think of this process as learning a model for the node at-
tributes of a homogeneous graph while ignoring the links between the nodes.
A key emerging challenge for data mining is tackling the problem of mining
richly structured, heterogeneous datasets. These kinds of datasets are best de-
scribed as networks or graphs. The domains often consist of a variety of object
types; the objects can be linked in a variety of ways. Thus, the graph may have
different node and edge types. Naively applying traditional statistical infer-
ence procedures, which assume that instances are independent, can lead to
inappropriate conclusions about the data. Care must be taken that potential
correlations due to links are handled appropriately. In fact, object linkage is
knowledge that should be exploited. This information can be used to improve
the predictive accuracy of the learned models: attributes of linked objects are
often correlated, and links are more likely to exist between objects that have
some commonality. In addition, the graph structure itself may be an impor-
tant element to include in the model. Structural properties such as degree and
connectivity can be important indicators. Link mining is a newly emerging re-
search area that is at the intersection of the work in link analysis, hypertext
and web mining, relational learning and inductive logic programming, and
graph mining. We use the term link mining to put a special emphasis on the
links, moving them up to first-class citizens in the data analysis endeavor.
Now we will take care of link mining on the point of view of the main
inspiration–WWW. There is no doubt that the Web is really gigantic and deal-
ing with it is really a challenging task. Paper [4] has point out that the size and
the content of the Web is changing second by second with the dramatic growth
rate. The size of WWW is estimated to have doubled in less than two years and
this growth rate is projected to continue for the next two years. As Google [1]
has announced, they have a repository of over 8 billion pages indexed and this
5
1.1 Rank of objects
number does not reflect well of all pages exist on the WWW.
Aside from newly created pages, the existing pages are continuously up-
dated and, consequently, contents of Web change incessantly. For example,
with the study of over a half of million pages, [4] also shows that about 23%
of pages changed daily. In the .com domain, 40% of the pages changed daily,
and in about 10 days half of the pages are gone (i.e., their URLs are no longer
valid).
In addition to size and rapid change, the interlinked nature of the Web
sets it apart from many other collections. Several studies aim to understand
how the Web’s linkage is structured and how that structure can be modeled.
One recent study, for example, suggests that the link structure of the Web is
somewhat like a "bow-tie" [7]. That is, about 28% of the pages constitutes a
strongly connected core (the center of the bow tie). About 22% form one of the
tie’s loops: these are pages that can be reached from the core but not vice versa.
The other loop consists of 22% of the pages that can reach the core, but cannot
be reached from it. (The remaining nodes can neither reach the core nor can
be reached from the core.)
With its own complex structure, WWW has created many difficulties in
people’s aim to understand it, and, furthermore, to utilize and find useful in-
formation from its enormous warehouse of information resources. As many
insisting sentences in thesis, this is the main reason why search engines ap-
pear.
Example 1.4. For many of us, if we want to search for scientific papers, be-
sides non-free sites, one of the first-thought-of websites should be
This website has a huge collection of scientific papers, especially in Infor-
mation Technology. It collects papers from many sources then stores in its
database. If you want to search for any paper, just get to the URL above, type
what you want and look up at the results returned. This is quite an easy task
for us. How about people who are newbies to using Internet? It may be im-
possible for them to know the URL needed to find what they want. Moreover,
Citeseer is limited in a small amount of available papers, and scientific papers
are not the only thing that people want to look up from the WWW.
What shown in Example 1.4 has addressed one of the smallest things that
6
1.1 Rank of objects
Internet users face in finding what they want on the Internet although infor-
mation can be available everywhere. Search engines may be the best solution
for us. When a user type a keyword with some terms, a search engine will look
up its database to find out relevant web pages and return results at user’s
terminal. In general, a search engine accumulates web pages in its database.
Then, bases on some criteria, it computes each web page’s importance score to
give a place that web page should stand. Importance score is a combination
two following parameters:
• relevant score between the query and the content of web page, this score
can be combined from many coefficients. These criteria may include:
– term frequencies: number of terms’ (in keyword) appearances in web
pages’ contents. For example, if your query contains "science" then
a page with more words "science" would be more relevant to your
query.
– term proximities: this criterion defines how close your query is to
a web page’s content. If your query has "health" word, then a hos-
pital’s homepage would be likely more appropriate compares to a
supermarket’s homepage.
– term position: if terms in your keyword have one of the following
positions in a web page’s content, the page is expected to be suit-
able. These positions are: in title, top of page... It is quite easy to
understand, a title or words on top of pages are highly extracted in-
formation from many sentences below them. Thus, a web page with
those characteristics would have good relation to the query.
– term characteristics: in a web page, words can be formatted in many
kinds: bold, italics, capitalized... and each of them has its own mean-
ing. If a word is in bold or italics format, it may be emphasized so
that people will pay more attentions. And, if your query has a word
that is coincident to an emphasized word in web page, your query
would receive higher relevance. We will not mention carefully how
this score is computed.
– category information: we divide things into some categories and it
is not that all categories have the same importance. This criterion
7
1.1 Rank of objects
takes out the above advantage. If a page in a higher prior category
has a (some) word(s) in your keyword, then it will get more score
from category information.
– popularity information: there are world-wide-known pages and there
are no-one-know pages, too. So if a more popular page has words that
are the same to a query’s words, it will be rank higher according
to popularity information gained. This factor is determined by the
number of users reaching a web page, how active is that web page...
• rank score which can be referred to as page’s own value and can be com-
puted offline [16]. What is a page’s own value and why can it be computed
offline? That means every page will have a value adhering and this value
is independent from queries issued. Above scores can only be computed
when queries are issued but rank score is an exception. Let us consider
the following example to understand clearer about the previous criteria.
Example 1.5. We assume that a user issues one query of a word "TOEFL" and
there are two web pages, namely α and β, which can satisfy this query. For α,
term frequencies of the word "TOEFL" is 9 and the number is 7 in web page β.
Therefore, we can determine that α has a better score for term frequencies. Re-
garding term proximities, we have α is about education-biased and β is in the
field of engineering. As all of us know, TOEFL is a test from ETS to check En-
glish proficiency so α, with the aim to education, is a better candidate. A better
candidate in this example receives 6.8 and another one receives 5.4. The third
criterion is designed as follow: α has no TOEFL word in its title and β has one
TOEFL in its title. It is clearly that, in this case, β will receive more score from
this criterion. And β gets 6.9, α gets 4.2 for that reason. The term characteris-
tics is a little bit harder to compute. We have to check if our keywords lie in bold
or italics or H11 formats, etc, to determine scores for this factor. Suppose that
the final value of this score for two web pages are 6.0 for α and 7.0 for β. Cate-
gory information is a relative score. We have to define which category is more
important in hundreds of categories. We do believe that there are thousands
ideas people can think of to defend that their choices of categories is better.
1In terms of web pages, which mainly derived from HTML, H1 corresponds to Heading 1
which defines the level of a portion of text (H1 is the highest level)
8
1.1 Rank of objects
That means, people will sort categories’ importance quite differently. For that
reason, in this example, we choose the same "category information" score of
5.0 for both α and β. The last but not least, popularity information score is the
one that we should be careful. An enterprise’s homepage, such as Phillips, is
definitely more important than an individual’s blog2 but Phillips’ homepage
may be not as wide-known as the most famous blog or Phillips’ homepage is
unlikely to be as famous as – a web page allowing
you to post your self-edited video. So this is an important part in a combina-
tion that settles up importance of a page. Come back to the problem, we give α
popularity score of 2.2 and β 2.6.
After having all information needed about two web pages, we bring them
into a formula from which we get the result to evaluate web pages’ importance.
As in Example 1.2, we can weigh different scores according to different coeffi-
cients and howwe choose these coefficients can significantly change our result.
In this example, weight of term frequencies is given to 0.1, weight of term prox-
imities is given to 0.13, term position factor is set to 0.17, term characteristics
is weighed by 0.1, weights for category information and popularity information
are 0.1 and 0.15, corresponding. The final weight which is assigned for rank
score is 0.25.
The formula is given as
IS = wfrefre + wpropro + wpopo + wchacha + wcatcat + wpoppop
in which IS is importance score, w means weight and fre, pro, po, cha, cat, pop
are term frequencies, term proximities, term positions, term characteristics,
category information and popularity, corresponding. With those parameters,
we compute IS for each α, β. Consider two cases of rank for α, β to demonstrate
how importance scores of pages change.
Case 1: rankα = 4 and rankβ = 6
ISα=0.1ì 9 + 0.13ì 6.8 + 0.17ì 4.2 + 0.1ì 7.0 + 0.1ì 5.0 + 0.15ì 2.6 + 0.25ì 4
=5.387
ISβ=0.1ì 9 + 0.13ì 6.8 + 0.17ì 4.2 + 0.1ì 7.0 + 0.1ì 5.0 + 0.15ì 2.6 + 0.25ì 6
=5.206
2a.k.a. web log – a kind of online diaries
9
1.2 Basis of PageRank
Obviously, page α will be on top of page β.
Case 2: rankα = 4 and rankβ = 7
ISα=5.387
ISβ=0.1ì 9 + 0.13ì 6.8 + 0.17ì 4.2 + 0.1ì 7.0 + 0.1ì 5.0 + 0.15ì 2.6 + 0.25ì 7
=5.456
In this case, final IS for β changes so that β will take a better place than α in
results returned to user.
As can see in Example 1.5, importance score of a page can change to get
higher place if it has a good rank. But rank is not the only factor that can
change importance score of a page, there are some other factors. Although
other factors which can affect importance score of a page can be easily tweaked
to get higher IS, there are some restrictions that make modifications harder.
For example, you can bias your web page by using much word that relate to a
specific subject but it may be consider spam3.
We have detailed about how criteria are computed and used in determining
a page’s importance score except rank. As a result, questions rise: how is rank
of a page is calculated and how many ways are there to compute ranks of
pages? The answer is as what had been stated: linkage structure can be greatly
useful. And, how linkage structure is used will be detailed in the next section
with the most famous algorithm PageRankTM from search engine Google [1].
1.2 Basis of PageRank
There are two main reasons why traditional Information Retrieval (IR) tech-
niques may not be effective enough in ranking query results. First, the Web
is very large, with great variation in the amount, quality and types of infor-
mation presented in Web pages. Thus, many pages that contain the search
terms may be of poor quality or not relevant. Second, many Web pages are not
sufficiently self-descriptive, so IR techniques that examine the contents of a
page alone may not work well. An often-cited example to illustrate this issue is
3This is subject to mention in the next chapter
10
1.2 Basis of PageRank
the search for "search engines". The homepages of most of the principal search
engines do not contain the text "search engine". Moreover, Web pages are fre-
quently manipulated by adding misleading terms so they are ranked higher
by a search engine (spamming). Thus, techniques that base their decisions on
the content of pages alone are easy to manipulate. The link structure of the
Web contains important implied information, and can help in altering or rank-
ing Web pages. In particular, a link from page A to page B can be considered
a recommendation of page B by the author of A. Some new algorithms have
been proposed that exploit this link structure - not only for keyword search-
ing, but also other tasks like automatically building a Yahoo-like hierarchy,
or identifying communities on the Web. The qualitative performance of these
algorithms is generally better than the IR algorithms since they make use of
more information than just the contents of the pages. While it is indeed possi-
ble to influence the link structure of the Web locally, it is quite hard to do so at
a global level. Therefore, link analysis algorithms that work at a global level
are relatively robust against spamming.
1.2.1 Mathematical basis
In [16], Page et. al. introduced an approach, named PageRank, which tries
to capture notion importance of a page. For instance, the Google homepage is
intuitively more important than the homepage of a normal university. This
difference is reflected in the number of other pages that point to these two
pages, that is, more pages point to the Google homepage than to the univer-
sity’s homepage. PageRank bases its ranking on an idea that if there is a link
from web page A to web page B, the importance of page A will influence the
importance of page B. That means a page receives more importance if Yahoo
points to it, than if some unknown page points to it. Citation ranking, in con-
trast, does not distinguish between these two cases. Note that the definition of
PageRank is recursive - the importance of a page both depends on and influ-
ences the importance of other pages.
Page and Brin [16] uses the linkage structure of the Web to build a stochas-
tic irreducible Markov chain with a transition probability matrix. We will first
present the simple definition of PageRank and discuss some of the computa-
11
1.2 Basis of PageRank
tional aspect, then more details on computational aspect are mentioned in the
following chapter. Suppose that we have a collection of web pages in database
and these pages are numbered from 1 to n. For any arbitrary page u, let BI(u)
be the collection of pages in which every page points (links) to u, Nu be the
number of links from u, let piu be the rank of page u.
Rank piu of page u in PageRank method is computed as follow:
piu =
∑
i∈BI(u)
pii
Ni
(1.1)
Expressing in graph terms, we define graph G = (V,E) in which V is a col-
lection of web pages whose ranks are to be computed (pages number ranging
from 1 to n), E is a collection of edges: E={(i, j) | if there is a link from i to j}.
In its most trivial form, PageRank assumes that web graph G is strongly
connected. Thus, we can construct the adjacent matrix representing G as fol-
low:
P = (pij)nìn (1.2)
in which
pij =
1/Ni if there is a link from i to j0 otherwise (1.3)
As view in probability terms, P is transition matrix whose element pij is the
probability of moving from state i (page i) to state j (page j) in one time step.
For example, assume that, starting from any node (webpage), it is equally
likely to follow any of the outgoing links to arrive at another node. Matrix
P is stochastic and irreducible. Rewrite (1.1) in matrix form for every page, we
have:
pi = piP (1.4)
in which pi = (pi1, pi2, . . . , pin) is a 1ì n vector with elements are ranks of pages.
Thus, from (1.4), we come back to the problem of finding eigen-vector of matrix
P corresponding to eigen-value λ = 1 or we can say that pi is the stationary
vector of the chain.
12
1.2 Basis of PageRank
1.2.2 Practical issues
The first practical issue we meet come from the Web linkage structure. Sim-
ple PageRank is well defined only if the link graph is strongly connected that
means there is, at least, a way from a page to other ones. However, the Web
1 2
0 3 6
4 5
Figure 1.1: A small directed Web graph with 7 nodes
is far from strongly connected. In particular, there are some related problems
that arise on the real Web: rank sinks and rank leaks. Any strongly (internally)
connected cluster of pages within the Web graph from which no links point out-
wards forms a rank sink. An individual page that does not have any outlinks
constitutes a rank leak. Although, technically, a rank leak is a special case
of rank sink, a rank leak causes a different kind of problem. In the case of a
rank sink, nodes not in the sink receive a zero rank, which means we cannot
distinguish the importance of such nodes.
Example 1.6. In Figure 1.1, node 6 sinks and if we remove the link from node
5 to node 3, we make nodes 4, 5 and 6 sink. In that circumstance, group of 4, 5,
and 6 sinks because there is no outlink from them to the outer world. Node 6
is even worse, it becomes a representative for rank leak. Because, from it, we
can count no outlink. In this case, any rank reaching a rank leak is lost forever
due to no way out. However, node 0 is the worst because there is no inlink to it
and outlink from it.
Page. et. al. suggested overcoming this problem in two steps. First, they
remove all the leak nodes with out-degree 0 (out-degree of a page is the number
of its outlinks). It is equivalent to the job that we add a value of
1
n
to all entries
13
1.2 Basis of PageRank
of every row corresponding to a rank sink node because that row will have all
0s. Matrix P turns into matrix P ′. Second, to insure a stationary vector exists,
they define a parameter called decay-factor. The final modified matrix is given
as follow:
P˜ = αP ′ +
(1− α)
n
J (1.5)
in which α is "decay-factor" (often chosen as 0.85) and J is matrix of all 1s.
Then, instead of finding eigen-vector of matrix P , we find eigen-vector of ma-
trix P˜ given by equation:
pi = piP˜ (1.6)
We also have sum of all elements of pi equals to 1:
n∑
i=1
pii = 1 (1.7)
The second practical issue we face is an efficient way to compute PR vector.
As indicated, the problem of computing ranks of pages became the problem of
finding the principal eigenvector pi of transition matrix. One of the simplest
methods used to determine eigenvectors of matrix is called power iteration.
Power method can yield result of PageRank after about 50 to 100 iterations
which is ranged by our choice of decay-factor α [16, 13, 12, 19, 14]. PageRank
iterative method is given below:
Algorithm 1.1 Iterative method for original PageRank computation
1: function PAGERANK(G) . G is the Web graph
2: tmp← any random vector . vector tmp is often assigned by
(
1
n
)
1ìn
3: ← any threshold value . value to determine if pi converges
4: while ||pi − tmp||L > do . norm L can be L1 or L2
5: pi ← tmp
6: pi ← tmpì P
7: end while
8: return pi . the final PR vector
9: end function
Example 1.7. Suppose we have 7 web pages in our repository with the graph
as shown in Figure 1.1. Notice that, we add 2 rank leak nodes: 0 and 6 to
illustrate all cases in the real Web.
14
1.2 Basis of PageRank
We, then, form the matrix of 6 nodes in this small web graph (except for
node 0) for an easier demonstration. We name this 6-node graph as SmallSet
to easily reference.
P =
p1 p2 p3 p4 p5 p6
p1 0
1
2
1
2
0 0 0
p2
1
3
0 1
3
0 0 1
3
p3 0
1
2
0 1
2
0 0
p4 0 0 0 0 1 0
p5 0 0
1
3
1
3
0 1
3
p6 0 0 0 0 0 0
Row 6 contains all 0s because there is no outlink from the 6th page and as
a result, P is not an adjacent matrix for a strongly connected graph. So we
modify matrix P by adding
1
6
to all entries on the 6th row. We got:
P ′ =
p1 p2 p3 p4 p5 p6
p1 0
1
2
1
2
0 0 0
p2
1
3
0 1
3
0 0 1
3
p3 0
1
2
0 1
2
0 0
p4 0 0 0 0 1 0
p5 0 0
1
3
1
3
0 1
3
p6
1
6
1
6
1
6
1
6
1
6
1
6
After the addition, P ′ is stochastic but the addition alone cannot definitely
make sure that the stationary vector exists. The final step to make sure the
matrix irreducible is to alter P ′ by P˜ as in 1.5. We choose α =
9
10
P˜ =
9
10
P ′ +
1
60
J =
p1 p2 p3 p4 p5 p6
p1
1
60
7
15
7
15
1
60
1
60
1
60
p2
19
60
1
60
19
60
1
60
1
60
19
60
p3
1
60
7
15
1
60
7
15
1
60
1
60
p4
1
60
1
60
1
60
1
60
11
12
1
60
p5
1
60
1
60
19
60
19
60
1
60
19
60
p6
1
6
1
6
1
6
1
6
1
6
1
6
Our matrix now becomes stochastic and irreducible. Therefore, every node can
directly move to other ones and the stationary probability distribution over
15
1.2 Basis of PageRank
the Markov chain can be found. With matrix P modified, we can obtain the
principal eigenvector of P˜ :
piT = (0.1973 0.2637 0.2066 0.1584 0.1574 0.0167)
A visual illustration of this example can be found in Figure 1.2.
1 2 3 4 5 6
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Figure 1.2: Ranks of page in SmallWeb graph with α = .9
In Figure 1.3, we also test with different value of α to make sure that each
page’s rank can be significantly changed by modifying α. As can see in the
figure, α with the value of 0.9 really help us to draw a distinction between
pages because it yields quite different results of ranks. Results become more
neutral when value of α decreases. Medium high-ranked pages do not change
much in ranks but the highest and the lowest change quite clearly. For more
specifically information, in Figure 2.3(a)
pi = (0.1973 0.2637 0.2066 0.1584 0.1574 0.0167)
and in Figure 2.3(i)
pi = (0.1681 0.2183 0.1809 0.1720 0.1773 0.0833)
so pi2 goes down to 0.2183 from 0.2637 and pi6 goes up to 0.0833 from 0.0167.
One’s changing rate is about 0.5 and another one is about 0.6. This can lead
us to a different result in searching if α is not carefully chosen. For a better
performance of system, we can choose a small value for α but the result may
be useless. For a usable result, we should choose α near to 1 but the cost may
increase sharply as in Table 1.1.
16
1.2 Basis of PageRank
1 2 3 4 5 6
0
0.2
0.4
(a)
1 2 3 4 5 6
0
0.2
0.4
(b)
1 2 3 4 5 6
0
0.2
0.4
(c)
1 2 3 4 5 6
0
0.2
0.4
(d)
1 2 3 4 5 6
0
0.2
0.4
(e)
1 2 3 4 5 6
0
0.2
0.4
(f)
1 2 3 4 5 6
0
0.2
0.4
(g)
1 2 3 4 5 6
0
0.2
0.4
(h)
1 2 3 4 5 6
0
0.2
0.4
(i)
Figure 1.3: Figures exemplifying PR vector with (a) α = .9; (b) α = .85; (c)
α = .8;
(d) α = .75; (e) α = .7; (f) α = .65; (g) α = .6; (h) α = .55; (i) α = .5
Table 1.1 shows the number of iterations needed to compute PR vector cor-
responding to each value of α.
Value of α .9 .85 .8 .75 .7 .65 .6 .55 .5
Iterations 18 16 14 12 12 10 10 8 8
Table 1.1: Iterations needed for each value of decay-factor α of SmallSet
Iterations needed for α = .9 is as double as iterations needed for α = .5.
As Figure 1.4 shows, for the biggest value of α chosen, the cost is remarkable
but if reference back Figure 1.3 the result is relatively better because we can
discriminate pages’ ranks. For the smallest value of α, the number of spending
iterations is smaller and therefore time taken is much lower but the result may
be not really good.
17
1.2 Basis of PageRank
0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9
8
10
12
14
16
18
Figure 1.4: Graph of iterations needed with α ∈ [.5; .9] of SmallSet
Algorithm 1.2 will discuss about the PR computation in practical use.
18
1.2 Basis of PageRank
Algorithm 1.2 Iterative method for practical PageRank computation
1: function PRACTICALPAGERANK(G) . G is the Web graph
2: α← any decay value . α is often chosen .85
3: tmp← any random vector . vector tmp is often assigned by
(
1
n
)
1ìn
4: ← any threshold value . value to determine if pi converges
5: for indexRow ← 1ữ n do
6: remove rank leak and rank sink nodes
7: end for
8: construct P˜
9: while ||pi − tmp||L > do . makes pi converge to the principal
eigenvector
10: pi ← tmp
11: pi ← tmpì P˜
12: end while
13: return pi . the final PR vector
14: end function
In fact, all rank sink and rank leak nodes are removed by making them
connected by other nodes before constructing matrix P˜ . After that, we apply
the same process as in Algorithm 1.1 to get the final result. And, Algorithm
1.2 also finals this chapter.
Conclusion of chapter
In this chapter, we have gone through a key idea for making an order to objects
in our world. Objects in a database should be ranked, scientific papers should
be ranked and web pages must be ranked. Moreover, the most important thing
it brings us that we can put everything in order to easily mine for knowledge.
We also navigated about PageRank, a method used for ranking web pages,
completely about its base and arisen problems in practical usage. In the next
chapter, we will examine some matters rounding accelerating and spamming
problems.
19
CHAPTER
Some PageRank-related problems
PageRank is a famous occurrence, a trademark and a tool from
which Google benefits much. Because it is famous, many scientists
pay attention and as a result, there are many ways to tweak PageR-
ank up. Some accelerating methods for PageRank will be shown in
this chapter as a proof that PageRank is one of the biggest current
affairs in Internet-related fields. Furthermore, because a page can
easily exploit PageRank to get higher rank so we must deal with a
problem, which is called spam4.
2.1 Accelerating problems
PageRank computation is a big problem which involves many fields of study
such as storage capacity, computational complexity... In this section, we will
discuss one of the major problems rising with the computation of PageRank
which is so-called accelerating problems.
4This kind of spam is different from mail spam
20
2.1 Accelerating problems
2.1.1 Related works
The Web is becoming a massive repository for delivering information. Recent
studies [8] show that most of the users access the Web by means of a search
engine. Nielsen/NetRatings, one of the leading Internet and digital media au-
dience information and analysis services, reports that there were an estimated
151 million active Internet users in the US, for January 2004. Of these, 76%
used a search engine at least once during the month and the average time
spent searching was about 40 minutes. Therefore, to access something on In-
ternet the easiest way is to insert some keywords on a favorite search engine
and to jump to the selected Web site returned as answer. For these reasons,
we witnessed to a growing interest and in large investment efforts to improve
the quality of Web ranking algorithms. Besides, the research community has
devoted an increased attention to reduce the computation time needed by Web
ranking algorithms. Indeed, a particular attention was devoted to improve
PageRank [6, 16], the well known ranking algorithm used by Google. The
core of PageRank exploits an iterative weight assignment of ranks to the Web
pages, until a fixed point is reached. This fixed point turns out to be the com-
putation of the (dominant) eigenpairs of a matrix derived by the Web Graph
itself. Brin and Page originally suggested computing this pair using the well-
known Power method and they also gave a nice interpretation of PageRank in
term of Markov Chains.
The most recent researches about PageRank are aimed to address at least
two different needs. First, the desire to reduce the time spent to weight the
nodes of a Web Graphs containing many billions of pages which requires sev-
eral days of computation. Note that Fetterly et al. in [9] showed that 35% of the
Web changed during the course of their study (eleven downloads over a period
of 11 weeks). More recently Cho et al. [14] showed that in a week more than
25% of links are changed and 5% of "new content" is created. They conclude
that "This result indicates that search engines need to update link based rank-
ing metrics (such as PageRank) very often... a week-old ranking may not reflect
the current ranking of the pages very well. The second need, is to assign many
PageRank values to each Web page, as results of PageRank’s personalization
that was recently presented by Google as beta-service [1].
21
2.1 Accelerating problems
Previous approaches to accelerate PageRank followed three different direc-
tions. The first approach tries to fit the graph structure into main memory
by using compression techniques for the Web Graph [5]. The second approach
efficiently computes in external memory the PageRank, by using a sequence
of scan operations on the Web graph. The last direction exploits efficient nu-
merical methods to reduce the computation time. These kinds of numerical
techniques are the most promising and we have seen many intriguing results
in the last few years. Arasu et al. [3] observe that computing PageRank is
equivalent to solve a linear system and proposed to exploit iterative methods
such as the Gauss-Seidel algorithm. Kamvar et al. [12] suggest accelerating
computation by using extrapolation methods which periodically subtracts off
estimates of non-principal eigenvectors from the current iterate of the power
method. The reported speedup is about 50-300% and it is influenced by the
particular parameter settings used for computing the PageRank. Kamvar et
al. [14] note that the Web Graph’s matrix, permuted by sorting the lexico-
graphically the URLs, assumes an approximate block structure since most of
the links are intra-domains. They suggest to compute separately several "lo-
cal" PageRank vectors, one for each block and to use the concatenation of these
vectors as starting point for the computation of a global web rank. In [13], the
authors suggest to avoid the re-computation of the (sub)-components of PageR-
ank vector as soon as they arrive to a fixed point. This results in a speedup of
about 17%.
We will investigate a method motivated by cost-reducing direction as men-
tioned above in the following part. It exploits the structure of the Web graph
and arranges web pages into blocks if they are from the same host to reduce
computational cost. By introducing BlockRank [14], we can have a nice bridge
to a method proposed in the section 3.2.
2.1.2 Exploiting block structure of the Web
Overview of BlockRank Algorithm
The block structure of the web suggests a fast algorithm for computing
PageRank, wherein a “local PageRank vector” is computed for each host, giving
the relative importance of pages within a host. These local PageRank vectors
22
2.1 Accelerating problems
can then be used to form an approximation to the global PageRank vector that
is used as a starting vector for the standard PageRank computation. This is
the basic idea behind the Block-Rank algorithm, which we summarize here
and describe in this section. The algorithm proceeds as follows:
Algorithm 2.1 General steps of BlockRank algorithm
1: function BLOCKRANK(G) . G is the Web graph
2: Split the web into blocks by domain
3: Compute the Local PageRanks for each block
4: Estimate the relative importance, BlockRank, for each block
5: Form an approximate Global PageRank vector z for web pages
6: Use z as a starting vector for standard PageRank
7: end function
Before describing the steps in detail below, we should get familiar to some
notations used in this section. We will use lower-case indices (i.e. i, j) to repre-
sent indices of individual web sites, and upper case indices (i.e. I, J) to repre-
sent indices of blocks. We use the shorthand notation i ∈ I to denote that page
i ∈ block I. The number of elements in block J is denoted nJ . The graph of a
given block J is given by the nJ ì nJ submatrix GJJ of the matrix G.
Computing Local PageRanks
In this section, we describe computing a “local PageRank vector” for each
block in the web. One may wonder how a block is created and that question
would be revealed in this part.
Authors of [14] took all the hyperlinks in their FULL-LARGEWEB data set,
and count how many of these links are “intra-host” links (links from a page to
another page in the same host) and howmany are “inter-host” links (links from
a page to a page in a different host). They had shown that 79.1% of the links in
dataset are intra-host links, and 20.9% are inter-host links. These intra-host
connectivity statistics are not far from the earlier results of Bharat et al. [2].
They also investigated the number of links that are intra-domain links, and
the number of links that are interdomain links. And larger majority of links
are intra-domain links (83.9%). These results make sense intuitively.
23
2.1 Accelerating problems
Take as an example the domain cs.stanford.edu. Most of the links in
cs.stanford.edu are links around the cs.stanford.edu site (such as cs.stanford.edu/admissions
or cs.stanford.edu/research). Furthermore, almost all non-navigational
links are links to other Stanford hosts, such as www.stanford.edu,library.stanford.edu
or www-cs-students.stanford.edu. One might expect that there exists
this structure even in lower levels of the web hierarchy. For example, one might
expect that pages under cs.stanford.edu/admissions/ are highly inter-
connected, and very loosely connected with pages in other sublevels, leading
to a nested block structure. This type of nested block structure can be natu-
rally exposed by sorting the link graph to construct a link matrix in the fol-
lowing way. To sort URLs lexicographically, except that as the sort key, they
reversed the components of the domain. For instance, the sort key for the URL
www.stanford.edu/home/students/would be edu.stanford.www/home/students.
The URLs are then assigned sequential identifiers when constructing the link
matrix. A link matrix contains as its (i, j)th entry a 1 if there is a link from i
to j, and 0 otherwise. This has the desired property that URLs are grouped in
turn by top level domain, domain, hostname, and finally path. The subgraph
for pages in stanford.edu appear as a sub-block of the full link matrix. In
turn, the subgraph for pages in www-db.stanford.edu appear as a nested
sub-block.
Since most blocks have very few links in and out of the block as compared
to the number of links within the block, it seems plausible that the relative
rankings of most of the pages within a block are determined by the inter-block
links. We define the local PageRank vector lJ of a block J (GJJ ) to be the result
of the PageRank algorithm applied only on block J , as if block J represented
the entire web, and as if the links to pages in other blocks did not exist. That
is:
lJ = PAGERANK(GJJ )
in which PAGERANK is exactly as what written in Algorithm 1.1.
Local PageRank accuracies
To investigate how well these local PageRank vectors approximate the rel-
ative magnitudes of the true PageRank vectors within a given host, Kamvar
et. al. ran these experiments. They computed the local PageRank vectors lJ of
24
2.1 Accelerating problems
each host in STANFORD/BERKELEY. They also computed the global PageRank
vector pi for STANFORD/BERKELEY using the standard PageRank algorithm.
After the global PageRank vector computation is completed, they then com-
pared the local PageRank scores of the pages within a given host to the global
PageRank scores of the pages in the same host. Specifically, they took the el-
ements corresponding to the pages in host J of the global PageRank vector pi,
and form the vector gJ from these elements. This newly-created vector gJ is
then normalized so that its elements sum to 1 in order to compare it to the
local PageRank vector lJ , which also has an L1 norm of 1.
Specifically,
gJ =
pij∈J
||pij∈J ||
In Table 2.1, the absolute error, ||lJ−gJ ||1, is on average 0.2383 for the hosts
in STANFORD/BERKELEY.
Error measure Average
||lJ − gJ ||1 0.2383
KDist(lJ , gJ) 0.0571
Table 2.1: The closeness of local PageRank vector to global PageRank vector
Furthermore, they conclude that rank scores induced by local PageRank is
close to the scores obtained from the global PageRank scores. To compare the
orderings, we measure the average “distance” between the local PageRank vec-
tors lJ and global PageRank segments gJ . The KDist distance measure, based
on Kendall’s-τ rank correlation and used for comparing induced rank orders,
is defined as follow.
Consider two partially ordered lists of URLs, τ1 and τ2, each of length m.
Let U be the union of the URLs in τ1 and τ2. If δ1 is U − τ1, then let τ ′1 be the
extension of τ1, where τ ′1 contains δ1 appearing after all the URLs in τ1. We
extend τ2 analogously to yield τ ′2. KDist is then defined as:
KDist(τ1, τ2) =
|{(u, v) : τ ′1, τ
′
2 disagree on order of (u, v), u 6= v}|
|U | ì (|U | − 1)
In other words, KDist(τ1, τ2) is the probability that τ ′1 and τ ′2 disagree on the
relative ordering of a randomly selected pair of distinct nodes (u, v) ∈ U ì U .
25
2.1 Accelerating problems
In the current work, we only compare lists containing the same sets of ele-
ments, so that KDist is identical to Kendall’s τ distance. The average distance
KDist(lJ , gJ ) is 0.0571 for the hosts in STANFORD/BERKELEY. Notice that this
distance is low. This observation means that the ordering induced by the lo-
cal PageRank is close to being correct, and thus suggests that the majority of
the L1 error in the comparison of local and global PageRanks comes from the
miscalibration of a few pages on each host. Indeed the miscalibration may be
among important pages; discussing next, this miscalibration is corrected by
the final step of our algorithm. Furthermore, the relative rankings of pages
on different hosts is unknown at this point. For these reasons, authors of [14]
do not suggest using local PageRank for ranking pages; just using as a tool
for computing the global PageRank more efficiently. And the local PageRank
results compare favorable to the global PageRank which can be obtained from
the standard PageRank algorithm. They suggest using the local PageRank
vector as a start vector for computing global PageRank vector.
It should be noted that errors of this pattern (where the majority of L1 er-
ror comes from the miscalibration of a few pages) are fixed easily, since once
these miscalibrated pages are fixed (by, for example, a few iterations of global
PageRank), the rest of the pages fall into place. Errors that are more random
take longer to fix.
Local PageRank convergence rates
Another interesting question to investigate is how quickly the local PageR-
ank scores converge. In Figure 4, they showed a histogram of the number of it-
erations it takes for the local PageRank scores for each host in their data set, in
which dangling nodes are removed and contains roughly about 70M pages with
over 600M edges, to converge to an L1 residual < 10−1. Notice that most hosts
converge to this residual in less than 12 iterations. Interestingly, there is no
correlation between the convergence rate of a host and the host’s size. Rather,
the convergence rate is primarily dependent on the extent of the nested block
structure within the host. That is, hosts with strong nested blocks are likely to
converge slowly, since they represent slow-mixing Markov chains. Hosts with
a more random connection pattern converge faster, since they represent a fast-
mixing Markov chain. This observation suggests that one could make the local
26
2.1 Accelerating problems
PageRank computations even faster by wisely choosing the blocks. That is,
if a host has a strong nested block structure, use the directories within that
host as your blocks. However, this is not a crucial issue, because we show in
Section 5 that the local PageRank computations can be performed in a dis-
tributed fashion in parallel with the crawl. Therefore, reducing the cost of the
local PageRank computations are not a bottleneck for computing PageRank
with our scheme, as the local computations can be pipelined with the crawl.
Estimating the relative importance of each block
After calculating all pages’ ranks in each host, the algorithm turns out to
compute each block’s rank, namely BlockRanks. Assume there are k blocks in
the web. The idea used to compute BlockRank is similar to the computation
of page’s rank. In [14] firstly the block graph B is created, where each vertex
corresponds to a block in the web graph. An edge between two pages in the web
is represented as an edge between the corresponding blocks (or a self-edge, if
both pages are in the same block). The edge weights are determined as follows:
the weight of an edge between blocks I and J is defined to be the sum of the
edge-weights from pages in I to pages in J in the web graph, weighted by the
local PageRanks of the linking pages in block I. That is, if P is the web graph
and li is the local Page-Rank of page i in block I, then weight of edge BIJ is
given by:
BIJ =
∑
i∈I,j∈J
Pij.li
They wrote it in matrix notations. Define the local PageRank matrix L to be
the nì k matrix whose columns are the local PageRank vectors lJ .
L =
l1 0 ã ã ã 0
0 l2 ã ã ã 0
...
... . . .
...
0 0 ã ã ã lk
Define the matrix S to be the n ì k matrix that has the same structure as L,
but whose nonzero entries are all replaced by 1. Then the block matrix B is
the k ì k matrix:
B = LTPS
27
2.1 Accelerating problems
Notice that B is a transition matrix where the element BIJ represents the
transition probability of block I to block J . That is:
BIJ = p(J |I)
Once we have the k ì k transition matrix B, we may use the standard PageR-
ank algorithm on this reduced matrix to compute the BlockRank vector b. In
this notation, each block is considered as a web page and the PageRank algo-
rithm is applied same as for web pages.
Using relative importance of blocks for computing PageRank
At this point, there are two sets of rankings, one set contains relative ranks
of web pages in each block and another set contains relative ranks of blocks in
the web graph. Within each block J , we have the local PageRanks of the pages
in the block bJ . We also have the BlockRank vector b whose elements bJ are the
BlockRank for each block J , measuring the relative importance of the blocks.
We may now approximate the global PageRank of a page j ∈ J as its local
PageRank lj , weighted by the BlockRank bJ of the block in which it resides.
That is,
pi
(0)
j = lj ì bJ
In matrix notation:
pi(0) = Lì b
And pi(0) is used as the start vector for the regular PR algorithm.
Conclusion of approach
More specific, this approach is presented in Algorithm 2.2.
28
2.1 Accelerating problems
Algorithm 2.2 Specific version of BlockRank
1: procedure BLOCKRANK(G) . G is the whole Web graph
2: Sort the web graph lexicographically . to induce intra-host block
3: for all block J do . compute the local PageRank vector of block J
4: lJ = PageRank(J)
5: end for
6: B = LTGS . B is a matrix
7: b = PageRank(B) . B is understood as a graph of blocks
8: pi(0) = Lì b
9: pi = PageRank(G) . pi(0) is used as the start vector
10: return pi
11: end procedure
This approach has shown that the hyperlink graph of the web has a nested
block structure, something that has not yet been thoroughly investigated in
studies of the web. And, this structure is exploited to compute PageRank in
a fast manner using an algorithm which is so-called BlockRank. As shown in
Figure 2.1: Convergence rates for standard PageRank (solid line) vs. Block-
Rank (dotted line). The x-axis is the number of iterations, and the y-axis is the
log of the L1-residual.
Figure 2.1, BlockRank speeds up PageRank computations by factors of 2 and
higher, depending on the particular scenario. There are a number of areas
for future work: finding the “best” blocks for BlockRank by splitting up what
would be slow-mixing blocks with internal nested block structure; using the
29
2.2 Connected-component PageRank approach
block structure for hyperlink-based algorithms other than web search, such as
in clustering or classification; and exploring more fully the topics of updates
and personalized PageRank.
2.2 Connected-component PageRank approach
In this section, we will discuss about advantages of using connected compo-
nents, how connected-component concept is applied in computation of PageR-
ank and we will prove some work generated on the theory side. We also propose
an approach which utilizes connected component idea from graph theory. The
proposal is so-called CCP, a short for "Connected Component PageRank".
2.2.1 Initial ideas
While taking care of Markov model [17], we pay special attention to a defini-
tion that states in Markov chains can be divided into classes; states which are
transposable are grouped into one class. This concept is similar to concept of
connected components in graph theory. Moreover, as mentioned, the web graph
is presented using adjacent matrix. This gives us a chance to apply connected
component concept into PageRank computation.
The proposed algorithm is given as follow.
Algorithm 2.3 General steps of CCP algorithm
1: function CCP(G) . G is the Web graph
2: Split the web into blocks by connected component
3: Compute the local PageRanks for each block
4: Form an approximate global PageRank vector pi for web pages by com-
bining eigen-vectors of blocks
5: end function
Figure 2.2 illustrates the idea of CCP.
30
2.2 Connected-component PageRank approach
(a) Unarranged adjacent matrix (b) Arranged adjacent matrix
Figure 2.2: Unarranged matrix and arranged matrix
Applying connected components into PageRank computation, we gain the
following advantages.
ơ Regarding computational complexity: our method follow the trend of re-
ducing cost needed for computing PageRank. We already know that time
expenditure for multiplying a vector with a matrix is O(n2) in which n is
size of common dimension. For that reason, if we compute PageRank by
constructing its adjacent matrix, we face a hard problem when n reaches
billions. By using connected components in computing PageRank, we can
utilize it to reduce time needed. Here are some fundamental details about
how we use connected components in PageRank problem. Assume that
our web graph has about k connected components, therefore we have k
small adjacent matrices (we call blocks) obtained from matrix P . Let max
be the biggest number of pages in an arbitrary block. For every block,
time needed to compute PageRank vector is smaller or equals to O(n2max)
and if compare to whole web graph, it is much smaller. For all blocks,
complexity can not exceed the amount of k ì O(n2max) which is smaller
than O(n2) where n is the size of the graph. Generally, on the theory
side, our method is more efficient than trivial PageRank.
ư Regarding convenience in finding connected components: In [18], we can
get many effective algorithms for finding connected components of graphs.
31
2.2 Connected-component PageRank approach
Time needed for determining connected components is O(m+n) in which
m is the number of edges and n is the number of vertices. This additional
time is not remarkable, compares to the previous.
đ Regarding job of combining with other accelerating methods:We can have
even a better performance if we run our method in hybrid mode with
some other accelerating methods. In each block, we can use MAP or Ex-
trapolation method to reduce computational time for that block, so the
total reduced time is remarkable.
¯ In terms of programming techniques, [14] refers to a programming way
which splits web graph matrix into small parts for computing in parallel.
Using our method, we can have PageRank computation parallelized eas-
ily. We can assign each block to one processor in mediocre case; for this,
things are very natural. There certainly exists many ways that we can
parallelize PageRank work and CCP is a very promising one.
Notice that these advantages was not ensured by a theoretical proof or practi-
cal experiments.
If we split web graph matrix into smaller blocks corresponding to connected
components, we will have to answer these questions:
ơ Is CCP method correct?
ư How do we use method for effectiveness?
We will concentrate on more descriptive information of CCP method in the
following sections: 3.2.2 and 3.2.3.
2.2.2 Mathematical basis of CCP
How did we use connected components in computing PageRank vector? Firstly,
we repeat the PageRank computation as what has been stated in chapter 1
but with a slight modification. Suppose that we have web graph G with corre-
sponding adjacent matrix P . We do not remove all leak nodes, instead of that,
we define
P˜ = αP +
(1− α)
n
J (2.1)
32
2.2 Connected-component PageRank approach
And the principal eigen vector of P˜ is computed as follow:
pi = piP˜ (2.2)
Secondly, let k be the number of connected components in G, let Pi be the
adjacent matrix size ni ì ni corresponding to the ith component,
k∑
i=1
ni = n.
Thus, we can re-arrange P as follow:
P =
P1 0 ã ã ã 0
0 P2 ã ã ã 0
... ... . . . ...
0 0 ã ã ã Pk
(2.3)
We define matrices:
P˜i = αPi +
(1− α)
ni
Ji i = 1, k, Ji is ni ì ni size (2.4)
For each block Pi, i = 1, k, formula for the principal eigen-vector is:
pii = piiP˜i (2.5)
Proposition 2.1. From (2.1), (2.2), (2.3), (2.4) and (2.5)
˜ = (n1
n
pi1,
n2
n
pi2, . . . ,
nk
n
pik
)
(2.6)
is eigen-vector of matrix P˜ .
Proof. To prove that ˜ is eigen-vector of matrix P˜ , we have to prove that˜ = (n1
n
pi1,
n2
n
pi2, . . . ,
nk
n
pik
)
satisfies equation (2.2).
Replace ˜ = (n1
n
pi1,
n2
n
pi2, . . . ,
nk
n
pik
)
into equation (2.2), we have:
˜ = ˜P˜
= ˜ [αP + (1− α)
n
J
]
= ˜
α
P1 0 ã ã ã 0
0 P2 ã ã ã 0
... ... . . . ...
0 0 ã ã ã Pk
+
(1− α)
n
J
(2.7)
33
2.2 Connected-component PageRank approach
⇔
(n1
n
pi1,
n2
n
pi2, . . . ,
nk
n
pik
)
=
(n1
n
pi1,
n2
n
pi2, . . . ,
nk
n
pik
)
ì
αP1 +
1− α
n
J11
1− α
n
J12 ã ã ã
1− α
n
J1k
1− α
n
J21 αP2 +
1− α
n
J21 ã ã ã
1− α
n
J2k
... ... . . . ...
1− α
n
Jk1
1− α
n
J2k ã ã ã αPk +
1− α
n
Jkk
(2.8)
in which Jij is ni ì nj size matrix of all 1s.
Multiply the right side of equation (2.8) and consider the first component,
we have:
n1
n
pi1 =
n1
n
pi1
(
αP1 +
1− α
n
J11
)
+
n2
n
pi2
1− α
n
J21 + ã ã ã+
nk
n
pik
1− α
n
Jk1 (2.9)
For every block Pi, PageRank pii of that block satisfies equation (2.5), that
means:
pii = piiP˜i = pi
i
(
αPi +
1− α
ni
Ji
)
(2.10)
⇔
ni
n
pii =
ni
n
pii
(
αPi +
1− α
ni
Ji
)
(2.11)
Let i be 1, we get:
n1
n
pi1 =
n1
n
pi1
(
αP1 +
1− α
n1
J1
)
(2.12)
From (2.9), (2.12) we have:
n1
n
pi1
(
αP1 +
1− α
n1
J1
)
=
n1
n
pi1
(
αP1 +
1− α
n
J11
)
+
n2
n
pi2
1− α
n
J21+ã ã ã+
nk
n
pik
1− α
n
Jk1
(2.13)
Since J1 = J11, so
(2.13)⇔ pi1J11 =
n1
n
pi1J11 +
n2
n
pi2J21 + ã ã ã+
nk
n
pikJk1 (2.14)
⇔ 1nì1 = 1nì1 ì
k∑
i=1
ni
n
in which 1nì1 is a column vector of all 1s (2.15)
34
2.2 Connected-component PageRank approach
Note that, in our definition,
k∑
i=1
ni = n so (2.15) is always true.
Do the same process to component i, i = 2, k, we get the expected result.
Consequently, the proposition is true.
We have the proposition proved so the first question is answered, CCP is
true. In the next section, we briefly suggest a way of using CCP effectively.
2.2.3 On practical side of CCP
CCP method can be divided into three main steps as follow:
ơ First step: Get connected components from whole web graph and gener-
ate corresponding adjacent graph,
ư Second step: Compute each block’s eigen-vector to get PageRank vector,
đ Third step: Combine the last PageRank vector as shown in Proposition
2.1.
A more detailed version of CCP can be found in Algorithm 2.4.
Algorithm 2.4 Connected-component PageRank algorithm
1: function CCP(G) . G is the whole Web graph
2: Construct matrix P . P is the adjacent matrix of G
3: Determine connected components Pi in G . i = 1, k
4: for all Pi do
5: compute Pi’s eigen-vector pii . Pi is the ith connected component
6: compute
ni
n
pii . ni ì ni is the size of block Pi
7: end for
8: ˜i ← ni
n
pii . ˜ = (n1
n
pi1,
n2
n
pi2, . . . ,
nk
n
pik
)
9: return ˜
10: end function
Note that we have introduced some advantages in the previous section but
all of them are obvious or induced from other approaches. To be listed here are
major advantages of proposal over standard PageRank algorithm.
35
2.2 Connected-component PageRank approach
Advantage 1 The major speedup that this algorithm can obtain comes from
one of the main directions in researching of PageRank – how to fit PageR-
ank problem into memory. All of the blocks in our crawl are small enough
so that each block graph fits in main memory, and the vector of ranks for
the active block largely fits in the CPU cache. As the full graph does not
fit entirely in main memory, the local PageRank (of each block) itera-
tions thus require less disk I/O than the global computations. Indeed, if
we manipulate the crawling process, specially designed for CCP, as the
input instead to the standard PageRank algorithm, the result would be
much better.
Advantage 2 In CCP algorithm, the local PageRank vectors for many blocks
will converge quickly; thus the computations of those blocks may be ter-
minated after only a few iterations. This increases the effectiveness of the
local PageRank computation by allowing it to expend more computation
on slowly converging blocks, and less computation on faster converging
blocks. We will discuss more on this advantage in Chapter 3. In the stan-
dard PageRank algorithm, iterations operate on the whole graph; thus
the convergence bottleneck is largely due to the slowest blocks. Much
computation is wasted recomputing the PageRank of blocks whose local
computation has already converged.
Advantage 3 The local PageRank computations in Second Step of the CCP
can be computed in a completely parallel or distributed fashion. That is,
the local PageRanks for each block Pi can be computed on a separate pro-
cessor, or computer. The only communication required is that, at the end
of Second step, each computer should send their local PageRank vector to
a central computer that will compute the global PageRank vector. If our
graph consists of n total pages, the net communication cost consists of 8n
bytes (if using 8-byte double precision floating point values). Naive paral-
lelization of the computation that does not exploit block structure would
require a transfer of 8nbytes after each iteration, a significant penalty.
36
2.3 Spam and spam detection
2.3 Spam and spam detection
Web spamming refers to actions intended to mislead search engines into rank-
ing some pages higher than they deserve. Recently, the amount of web spam
has increased dramatically, leading to a degradation of search results. We will
get through a comprehensive taxonomy of current spamming techniques and
some fighting-spam techniques in this section.
We use the term spamming (also, spamdexing) to refer to any deliberate
human action that is meant to trigger an unjustifiably favorable relevance
or importance for some web page, considering the page’s true value. In com-
mon, the word spam is used to mark all those web objects (page content items
or links) that are the result of some form of spamming. People who perform
spamming are called spammers [11, 10].
2.3.1 Introduction to Web spam
As more and more people rely on the wealth of information available online,
increased exposure on the WWW may yield significant financial gains for indi-
viduals or organizations. Most frequently, search engines are the entryways to
the Web; that is why some people try to mislead search engines, so that their
pages would rank high in search results, and thus, capture user attention.
Just as with emails, we can talk about the phenomenon of spamming the
Web. The primary consequence of web spamming is that the quality of search
results decreases. For instance, a site may contain only a few lines of useful in-
formation, but consists of thousands of pages, each repeating the same content
and pointing to dozens of other pages. All the pages in that site were probably
created to boost the rankings of some others, and none of them seems to be
particularly useful for anyone looking for useful information. The secondary
consequence of spamming is that search engine indexes are in inflated with
useless pages, increasing the cost of each processed query.
To provide low-cost, quality services, it is critical for search engines to ad-
dress web spam. Search engines currently fight spam with a variety of often
manual techniques, but as far as we know, they still lack a fully effective set
of tools for combating it. The very first step in combating spam is understand-
37
2.3 Spam and spam detection
ing it that is analyzing the techniques the spammers use to mislead search
engines. A proper understanding of spamming can then guide the develop-
ment of appropriate countermeasures. To that end, in [10], web spamming
techniques are organized into a taxonomy that can provide a framework for
combating spam, [10] gives us very useful information about spam such as
ways that spams are created, in-depth discussion about spam... Furthermore,
Zoltỏn G. et. al. [10] proposes a method, namely TrustRank, to combat web
spam. A newly created spam technique, blog spam, is treated well with Lan-
guage Model Disagreement in [15]. One can also find details for several specific
techniques on the Web itself.
2.3.2 Spam techniques
To build a taxonomy, authors of [10] worked closely with experts at one of
the major search engine companies, relying on their experience, while at the
same time investigating numerous spam instances on their own. As a sep-
arated work, [11] agrees with [10] on almost ideas. We will investigate the
common parts of these papers to have a basic understanding of web spam.
There are two categories of techniques associated with web spam. The first
category includes the boosting techniques, i.e., methods through which one
seeks to achieve high relevance and/or importance for some pages. The second
category includes hiding techniques, methods that by themselves do not in
influence the search engine’s ranking algorithms, but that are used to hide
the adopted boosting techniques from the eyes of human web users.
Term spamming
Term spamming [10], or text spamming [11], is the way that web pages are
manipulated with redundancy of some specific terms to get higher ranks or
tailored with text fields in order to make spam pages relevant for some queries.
As we had known, when a query is issued, its content and contents of web
pages will be evaluated to get the relevance score. Relevance score is calculated
based on term frequencies, term positions... 5 In which, term positions and
term frequencies are the relatively easy criteria to be modified so that pages
would get more ranks. In evaluating relevance, term positions receive more
5Refer to chapter 1 for more details
38
2.3 Spam and spam detection
Figure 2.3: Boosting techniques
attention, [10] calls each type of location that a term appears in a field. The
common text fields for a page p are the document body, the title, the meta
tags in the HTML header, and page p’s URL. In addition, the anchor texts
associated with URLs that point to p are also considered belonging to page p
(anchor text field), since they often describe very well the content of p. The
terms in p’s text fields are used to determine the relevance of p with respect to
a specific query (a group of query terms), often with different weights given to
different fields.
Authors of [10, 11] mention the following techniques which are grouped
based on the text field in which the spamming occurs. They are:
• Body spam: In this case, the spam terms are included in the document
body. This spamming technique is among the simplest and most popular
ones, and it is almost as old as search engines themselves.
• Title spam: Today’s search engines usually give a higher weight to terms
39
2.3 Spam and spam detection
that appear in the title of a document. Hence, it makes sense to include
the spam terms in the document title.
• Meta tag spam: The HTMLmeta tags that appear in the document header
have always been the target of spamming. Because of the heavy spam-
ming, search engines currently give low priority to these tags, or even
ignore them completely.
• Anchor text spam: Just as with the document title, search engines as-
sign higher weight to anchor text terms, as they are supposed to offer a
summary of the pointed document. Therefore, spam terms are sometimes
included in the anchor text of the HTML hyperlinks to a page. Please note
that this spamming technique is different from the previous ones, in the
sense that the spam terms are added not to a target page itself, but the
other pages that point to the target. As anchor text gets indexed for both
pages, spamming it has impact on the ranking of both the source and
target pages.
• URL spam: URLs of pages may be broken down by search engines in
order to compute relevance scores. Spammers can exploit this by creating
long URLs which contains spam terms.
We have introduced some positions that spammers use to achieve better ranks
for their web pages. Now, considering the number of occurrences of terms, we
distinguish some ways:
• Repetition of one or a few specific words in content of page. If an issued
query is relevant, this page will receive higher relevance score.
• Dumping a large amount of words so that the page would be relevant to
many different queries.
• Weaving spam terms into a copied content. A copied content can be news
from a reliable source in which spam terms are randomly inserted.
• Phrase stitching is the technique in which sentences or phrases from
many different sources are weaved. To this method, spammers could get
relevance to many kinds of queries because sentences are collected from
different sources.
40
2.3 Spam and spam detection
Link spamming
Web spam involves in not only term spamming but also link spamming.
Many major search engines use link-based algorithm to determine ranks of
web pages and the most famous is PageRank. Many spammers create link
structure which will exploit link-based algorithm to obtain better seats in
search result.
Spammers can use outgoing and incoming links as tools to make their goal
come true. They can add a number of outgoing links to a popular page or gather
many incoming links to a specific page or a group of pages.
Outgoing links
Numerous outgoing links might be added to popular pages in hope of spam-
mers to get better hub scores for their pages. A spammer can use directory
cloning technique to reach this goal. On the WWW, there are some directory
sites which contain links of many topics. For example, DMOZ is one of the most
famous in which we can find content of many topics and sub topics by navi-
gating link lists. Spammers copy this idea and create their own directory web
which resembles a popular directory web, it is called directory cloning. By this
action, big amount of outgoing links are easily generated.
Incoming links
Spammers might deploy one of the following strategies:
• Using a honey pot as a tool to attract users. A honey pot may be a page
which provides free, pirated softwares... or useful resources, in general.
This honey pot also contains links to spam page(s) and links are hidden
from users. The more people uses this honey pot, the more it is popular.
So many users may desire to link to this honey pot. When users point
their links to the honey pot, they indirectly increase the ranking of spam
page(s). Note that directory sites can be used as a honey pot, too.
• Infiltrate web directory: Many web directories allow people to post links
on their directories. But it might happen that editors of these directories
do not strictly manage the given contents. So spammers would post links
to target pages to get boosted in both PageRank and authorities scores.
41
2.3 Spam and spam detection
• Post links to blogs, wikis... This is a hard problem that search engine
could resolve. Links can be given directly to blog comments, wikis... or
can be hidden by using anchor text. For this technique, spammers would
get their target pages more popular.
• Exchange links: A group of spammers would create a link structure in
which pages are circle-pointed. Source and target pages in this strategy
will fly higher in both PageRank and popularity.
• Using expired domains: When a domain is newly out-of-date, spammers
will buy it. Since that domain has just expired, incoming links to it still
exist. Moreover, spammers can populate this old domain with some tricks.
Then, spammers will create links from this expired domain to their target
pages. Thus, target pages takes advantages from old links to an expired
domain.
• Create spam farm: Spammers, nowadays, have control of many sites and
they can create link structure that boost the ranking of target pages.
Link spamming is used more and more because PageRank algorithm is still
deployed by Google. For that reason, many approaches appear to treat with
this arisen problem.
Conclusion of chapter
This chapter has discussed about some problems which arose with PageRank
computation. The underlined problem is accelerating problem which involves
in many fields. We can benefit from storage techniques to reduce time taken
to compute PageRank computation. Besides, we can take advantages from the
structure of the web to speed up the computational time. Furthermore, nu-
merical methods give us many promising approaches by which we just have to
modify a little the way of programming to get a nice result.
In the next chapter, experimental results are presented so that we will have
a clearer picture of how PageRank and other methods are handled.
42
CHAPTER
Implementations and experimental
results
Nowadays, hardly can a newly born result in any informatics field
achieve agreement if it does not have experimental figures. This the-
sis is not out of that stream. In this chapter, introduction of an or-
dinary search engine, Nutch, is shown. We used Nutch, with some
modifications and additions in code, to crawl data then the obtained
data was processed to get experimental results.
3.1 Search engine Nutch
As all of us know, building a search engine is very, very hard because of enor-
mous amount of works need to be done. Thus, such effective search engine
as Google is so successful in the market. But every effort made is tending
towards making more and more profit, not for academic purpose. How could
researchers, who want to do research about search engine, make their experi-
ments? The answer may be Nutch [2]. This section concentrates on some gen-
eral information about Nutch and the work needed to utilize Nutch for our
43
3.1 Search engine Nutch
purpose.
Nutch is an open source Java implementation of a search engine. It provides
all of the tools you need to run your own search engine. But why would anyone
want to run their own search engine? After all, there is always Google. There
are at least three reasons.
ơ Transparency:Nutch is open source, so anyone can see how the ranking
algorithms work. With commercial search engines, the precise details of
the algorithms are secret so you can never know why a particular search
result is ranked as it is. Furthermore, some search engines allow rank-
ings to be based on payments, rather than on the relevance of the site’s
contents. Nutch is a good fit for academic and government organizations,
where the perception of fairness of rankings may be more important.
ư Understanding: We do not have the source code to Google, so Nutch is
probably the best we have. It is interesting to see how a large search en-
gine works. Nutch has been built using ideas from academia and indus-
try: for instance, core parts of Nutch are currently being reimplementa-
tion to use the Map Reduce distributed processing model, which emerged
from Google Labs last year. And Nutch is attractive for researchers who
want to try out new search algorithms, since it is so easy to extend.
đ Extensibility: Do not like the way other search engines display their re-
sults? Write your own search engine–using Nutch! Nutch is very flexible:
it can be customized and incorporated into your application. For develop-
ers, Nutch is a great platform for adding search to heterogeneous collec-
tions of information, and being able to customize the search interface, or
extend the out-of-the-box functionality through the plug-in mechanism.
For example, you can integrate it into your site to add a search capability.
Nutch installations typically operate at one of three scales: local file system,
intranet, or whole web. All three have different characteristics. For instance,
crawling a local file system is reliable compared to the other two, since network
errors do not occur and caching copies of the page content is unnecessary (and
actually a waste of disk space). Whole web crawling lies at the other extreme.
Crawling billions of pages creates a whole host of engineering problems to be
44
3.1 Search engine Nutch
solved: which pages do we start with? How do we partition the work between
a set of crawlers? How often do we recrawl? How do we cope with broken links,
unresponsive sites, and unintelligible or duplicate content? There is another
set of challenges to solve to deliver scalable search–how do we cope with hun-
dreds of concurrent queries on such a large dataset?
This portion tells us about the architecture of Nutch. Nutch has a highly
modular architecture that uses plug-in APIs for media-type parsing, HTML
analysis, data retrieval protocols, and queries. The core has four major compo-
nents:
• Searcher:Given a query, it must quickly find a small relevant subset of a
corpus of documents, then present them. Finding a large relevant subset
is normally done with an inverted index of the corpus; ranking within
that set to produce the most relevant documents, which then must be
summarized for display.
• Indexer: Creates the inverted index from which the searcher extracts
results. It uses Lucene storing indexes.
• Database: Stores the document contents for indexing and later summa-
rization by the searcher, along with information such as the link struc-
ture of the document space and the time each document was last fetched.
• Fetcher:Requests web pages, parses them, and extracts links from them.
Nutch’s robot has been written entirely from scratch.
Figure 3.1 outlines the relationships between elements that refer on each
other, placing them in the same box, and those they depend on in a lower layer.
For example, protocol does not depend on net, because protocol is only an
interface point to plug-ins that actually provide much of Nutch’s functionality.
Crawling
An intranet or niche search engine might only take a single machine a few
hours to crawl, while a whole-web crawl might take many machines several
weeks or longer. A single crawling cycle consists of generating a fetch list from
the webdb, fetching those pages, parsing those for links, then updating the
45
3.1 Search engine Nutch
Figure 3.1: Nutch packages dependencies
webdb. Nutch’s crawler supports both a crawl-and-stop and crawl-and-stop-
with-threshold (which requires feedback from scoring and specifying a floor).
It also uses a uniform refresh policy; all pages are refetched at the same inter-
val (30 days, by default) regardless of how frequently they change. The fetching
process must also respect bandwidth and other limitations of the target web-
site. However, any polite solution requires coordination before fetching; Nutch
uses the most straightforward localization of references possible: namely, mak-
ing all fetches from a particular host run on one machine.
Indexing text
Lucenemeets the scalability requirements for text indexing in Nutch. Nutch
also takes advantage of Lucene’s multi-field case-folding keyword and phrase
search in URLs, anchor text, and document text. The typical definition of Web
search does not require some index types Lucene does not address, such as
regular-expression matching (using, say, suffix trees) or document-similarity
clustering (using, say, term-vectors).
Indexing hypertext
Lucene provides an inverted-file full-text index, which suffices for indexing
text but not the additional tasks required by a web search engine. In addition
to this, Nutch implements a link database to provide efficient access to the
Web’s link graph, and a page database that stores crawled pages for indexing,
46
3.1 Search engine Nutch
summarizing, and serving to users, as well as supporting other functions such
as crawling and link analysis. Nutch also has to add support for HTML extrac-
tion to Lucene. When a page is fetched, any embedded links are considered
for addition to the fetch list and that link’s anchor text is also stored. What
is eventually indexed by Lucene is only the text of a Web page, though: while
a high-fidelity copy is stored in the page database, font, heading, and other
structural information do not propagate to the Lucene indexing layer.
Removing duplicates
The Nutch dedup command eliminates duplicate documents from a set of
Lucene indices for Nutch segments, so it inherently requires access to all the
segments at once. It’s a batch-mode process that has to be run before running
searches to prevent the search from returning duplicate documents. It uses
temporary files containing the 5-tuple
MD5 hash, float score, int indexID, int docID, int urlLen
for each page. Presumably the documents with the same URLs were fetched
at different times, so Nutch tries to sort the records so that the ones for the
newest fetches come first. A second pass, using hash=MD5(content) and with
slightly different sorting rules, eliminates multiple URLs for the same docu-
ment from the index.
Link analysis
Nutch includes a link analysis algorithm similar to PageRank. It is per-
formed by the DistributedAnalysisTool; even the single-machine LinkAnaly-
sisTool merely calls into it. Nutch has already demonstrated the ability to
harness multiple servers to compute link ranking for 100Mpage subsets of
the World Wide Web. Distributed link analysis is a bulk synchronous parallel
process. At the beginning of each phase, the list of URLs whose scores must
be updated is divided up into many chunks; in the middle, many processes
produce score-edit files by finding all the links into pages in their particular
chunk. At the end, an updating phase reads the score-edit files one at a time,
merging their results into new scores for the pages in the web database.
Because of its promising features, Nutch can be used for academic purposes
as previously indicated. In the next section, we will discuss in detail about
47
3.2 Experiments
experiments done with trivial PageRank and CCP.
3.2 Experiments
Nowadays, almost all papers in informatics field need experimental figures
to show the correctness of a new idea or advantages that it takes over other
methods. This section covers experiments done by the author of the thesis and
through figures, CCP insists that it is a promising approach which reduces
the computational time and number of iterations needed in PageRank compu-
tation.
3.2.1 Infrastructure and data sets
All experiments were done with the IBM cluster 1350 from Center for High
Performance Computing, Hanoi University of Science. The IBM cluster con-
sists of 8 computing nodes, each with 2 Xeon processors at 3.2GHz, 2GB of
RAM. Additionally, the total storage capacity that the cluster can use is about
1TB.
Web pages are downloaded from many sources. These sources are web sites
of many universities from all over the world: from US, from UK or fr
Các file đính kèm theo tài liệu này:
- MSc07_Nguyen_Hoai_Nam_Thesis_English.pdf