Khóa luận Research on node ranking in peer-To-peer networks

Tài liệu Khóa luận Research on node ranking in peer-To-peer networks: ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC CÔNG NGHỆ Hoàng Cường Research on node ranking in peer-to-peer networks KHOÁ LUẬN TỐT NGHIỆP ĐẠI HỌC HỆ CHÍNH QUY Ngành: Công nghệ thông tin HÀ NỘI - 2010 Research on Node Ranking – Peer to Peer …. Hoàng Cường Lời cảm ơn Lời đầu tiên em xin bày tỏ lòng biết ơn sâu sắc tới TS. Nguyễn Hoài Sơn, các thầy đã hướng dẫn và là nguồn cảm hứng cho quá trình nghiên cứu của em. Em xin bày tỏ lòng biết ơn tới các thầy, cô giáo trong Khoa Công nghệ thông tin - Trường Đại học Công nghệ - ĐHQGHN. Các thầy cô đã dạy bảo, chỉ dẫn chúng em và luôn tạo điều kiện tốt nhất cho chúng em học tập trong suốt quá trình học đại học đặc biệt là trong thời gian làm khoá luận tốt nghiệp. Hà Nội, ngày 22 tháng 5 năm 2010 Hoàng Cường i Research on Node Ranking – Peer to Peer …. Hoàng Cường ABSTRACT This paper defines and describes a fully distributed NODE ranking algorithm for “peer to peer” systems. The research puts forward new approa...

pdf59 trang | Chia sẻ: hunglv | Lượt xem: 1575 | Lượt tải: 0download
Bạn đang xem trước 20 trang mẫu tài liệu Khóa luận Research on node ranking in peer-To-peer networks, để tải tài liệu gốc về máy bạn click vào nút DOWNLOAD ở trên
ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC CÔNG NGHỆ Hoàng Cường Research on node ranking in peer-to-peer networks KHOÁ LUẬN TỐT NGHIỆP ĐẠI HỌC HỆ CHÍNH QUY Ngành: Công nghệ thông tin HÀ NỘI - 2010 Research on Node Ranking – Peer to Peer …. Hoàng Cường Lời cảm ơn Lời đầu tiên em xin bày tỏ lòng biết ơn sâu sắc tới TS. Nguyễn Hoài Sơn, các thầy đã hướng dẫn và là nguồn cảm hứng cho quá trình nghiên cứu của em. Em xin bày tỏ lòng biết ơn tới các thầy, cô giáo trong Khoa Công nghệ thông tin - Trường Đại học Công nghệ - ĐHQGHN. Các thầy cô đã dạy bảo, chỉ dẫn chúng em và luôn tạo điều kiện tốt nhất cho chúng em học tập trong suốt quá trình học đại học đặc biệt là trong thời gian làm khoá luận tốt nghiệp. Hà Nội, ngày 22 tháng 5 năm 2010 Hoàng Cường i Research on Node Ranking – Peer to Peer …. Hoàng Cường ABSTRACT This paper defines and describes a fully distributed NODE ranking algorithm for “peer to peer” systems. The research puts forward new approach for ranking nodes over peer to peer. Synthesizing foundation and promoting new method which is feasible for peer to peer networks. Integration of this algorithm into P2P keyword search can produce dramatic benefit both in terms of effectiveness for users and decrease in network traffic. The incremental search algorithm provided approximately a ten-fold reduction in network traffic for two-word and three-word queries. ii Research on Node Ranking – Peer to Peer …. Hoàng Cường Chapter 1 Table of Contents Abstract.............................................................................. Error! Bookmark not defined. List images........................................................................................................................... 5 List tables............................................................................................................................. 7 Chapter 1: Peer to Peer and Ranking Problem .................................................................... 5 1.1. Peer to Peer ......................................................................................................... 5 1.1.1. Peer to Peer overview ............................................................................. 5 1.1.2. Architecture of Peer to Peer Systems .. Error! Bookmark not defined.7 1.1.3. Distributed hash tables............................................................................. 8 1.2. Ranking in Peer to Peer networks....................................................................... 9 1.2.1. Introduction............................................ Error! Bookmark not defined. 1.2.2. Ranking Roles........................................ Error! Bookmark not defined. 1.2.3. Research’s important objects ................. Error! Bookmark not defined. Chapter 2: Ranking on DHT Peer to Peer Networks......................................................... 11 2.1. Chord Protocol .................................................................................................. 11 2.2. Pagerank............................................................................................................ 12 2.2.1. Description............................................................................................. 12 2.2.2. Algorithms ............................................................................................. 13 2.3. Distributed Computing ..................................................................................... 17 2.2.1. Introduction............................................................................................ 17 2.2.2. Algorithms ............................................. Error! Bookmark not defined. 2.4 if-idf................................................................................................................... 18 Chapter 3: Building a new algorithm for ranking in chord networksError! Bookmark not define 3.1. Targets and Missions of Research .................... Error! Bookmark not defined. 3.2. Idea.................................................................... Error! Bookmark not defined. iii Research on Node Ranking – Peer to Peer …. Hoàng Cường 3.2.1. Major problems to exploit ..................... Error! Bookmark not defined. 3.2.2. Ranking Idea .......................................... Error! Bookmark not defined. Chapter 4: Ranking on Details ........................................ Error! Bookmark not defined. 4.1. Ranking algorithm ............................................ Error! Bookmark not defined. 4.2. Ranking’s features ............................................ Error! Bookmark not defined. Chapter 5: Evaluation ....................................................................................................... 50 Chapter 6: Related Work .................................................................................................. 52 Chapter 7: Contributions and future work........................................................................ 53 References ......................................................................................................................... 54 iv Research on Node Ranking – Peer to Peer …. Hoàng Cường List Images Image 1.1.1 Peer to Peer means connected together.Error! Bookmark not defined. Image 1.1.3 Distributed hash tables example.......... Error! Bookmark not defined. Image 1.2.1 System must to have the ranking engine to find the one.Error! Bookmark not defined. Image 2.1 A 16-node Chord network. example. ... Error! Bookmark not defined. Image 2.2.2: How Pagerank works........................... Error! Bookmark not defined. Image 2.3: Distributed Nodes Graph example ......... Error! Bookmark not defined. Image 3.2.1: Google almost is not exact .................. Error! Bookmark not defined. Image 3.2.2: Intersect Idea ....................................... Error! Bookmark not defined. Image 3.4: Factor Percent......................................... Error! Bookmark not defined. Image 4.1: Bandwidth is the key of ranking trusted. Error! Bookmark not defined. Image 4.1.2: Example of sub-graph semantic rank .. Error! Bookmark not defined. Fig 4: A global graph of both local nodes and external nodes Error! Bookmark not defined. Fig 5: An external local graph without a strategy .... Error! Bookmark not defined. Fig 6: An external local graph .................................. Error! Bookmark not defined. Image 4.2: Eigenvalue .............................................. Error! Bookmark not defined. Image 4.2.3: Random walk....................................... Error! Bookmark not defined. Image 4.2.4: (n+1) graph nodes................................ Error! Bookmark not defined. Image 4.2.5: Graph example - 6 nodes..................... Error! Bookmark not defined. Image 4.2.6: Multiplication result example.............. Error! Bookmark not defined. Image 4.2.7: Multiplication result example – at iterators........Error! Bookmark not defined. v Research on Node Ranking – Peer to Peer …. Hoàng Cường vi Research on Node Ranking – Peer to Peer …. Hoàng Cường List tables Table 3.2.1: The Pagerank converge and HITS converge .......Error! Bookmark not defined. Table 3.2.2: The Pagerank converge increasing to fastError! Bookmark not defined. Table 3.2.3: Pagerank convergence are not steady when Epsilon small ...........Error! Bookmark not defined. Table 3.2.4: HITS convergence ( take lots time than Pagerank)...............Error! Bookmark not defined. Table 5.1: the number of iterators which converges……………………….Error! Bookmark not defined. 7 Research on Node Ranking – Peer to Peer …. Hoàng Cường Chapter 1: Peer to Peer and Ranking Problem 1.1. Peer to Peer A peer-to-peer, commonly abbreviated to P2P, is any distributed network architecture conceive in associate that make a portion of their resources (such as processing power, disk storage or network bandwidth) directly available to other network partners, without the need for central coordination instances (such as servers or stable hosts). Peers are both suppliers and consumers of resources, in contrast to the traditional client–server model where only servers put out, and clients snack. Peer-to-peer was popularized by file sharing systems like Napster. File sharing is the practice of distributing or providing access to digitally stored information, such as computer programs, multi-media (audio, video), documents, or electronic books. It may be implemented through a variety of storage, transmission, and distribution models and common methods of file sharing incorporate manual sharing using removable media, centralized computer file server installations on computer networks, World Wide Web-based hyperlinked documents, and the use of distributed peer-to-peer networking. 1.1.1. Peer to Peer overview In its simplest form, a peer-to-peer (P2P) network is created when two or more PCs are connected and share resources without going through a separate server. A P2P network can be an ad hoc connection—a couple of computers connected via a Universal Serial Bus to transfer files. A P2P network also can be a permanent infrastructure that links a half-dozen computers in a small office over copper wires. Or a P2P network can be a network on a much grander scale in which special protocols and applications set up direct relationships among users over the Internet. The initial use of P2P networks in business followed the deployment in the early 1980s of free-standing PCs. In contrast to the mini-mainframes of the day (e.g. Fuitsu/ICL, IBM AS/400, IBM Mainframe, Unisys, … ), which used by over 16,000 8 Research on Node Ranking – Peer to Peer …. Hoàng Cường organizations, the list, or selections from this list, will be of particular interest to those companies who either supply medium sized to large scale systems or who offer products and/or services related to the use of such systems; any data is supplied with the named head of IT, as standard.. Mini- mainframes served up word processing and other applications to dumb terminals from a central computer and stored files on a central hard drive, the then- new PCs had self-contained hard drives and built-in CPUs. The smart boxes also had onboard applications, which meant they could be deployed to desktops and be useful without an umbilical cord linking them to a mainframe. Shared file and printer access within a local area network may either be based on a centralized file server or print server, sometimes denoted client–server paradigm, or on a decentralized model, denoted peer-to-peer network topology or Workgroup (computer networking). In client–server communications, a client process on the local user computer takes the initiative to start the communication, while a server process on the file server or print server remote computer passively waits for requests to start a communication session. In a peer-to-peer network, any computer can be server as well as client. It’s fantastic and efficient. In effect, every connected PC is at once a server and a client. There's no special network operating system residing on a robust machine that supports special server- side applications like directory services (specialized databases that control who has access to what). Image 1.1.1 : “Peer to Peer” means “connected together” 9 Research on Node Ranking – Peer to Peer …. Hoàng Cường In a P2P environment, Unlike client-server networks, where network information is stored on a centralized file server PC and made available to tens, hundreds, or thousands client PCs, the information stored across peer-to-peer networks is uniquely decentralized. Because peer-to-peer PCs have their own hard disk drives that are accessible by all computers, each PC acts as both a client (information requestor) and a server (information provider). In the diagram below, three peer-to- peer workstations are shown. Although not capable of handling the same amount of information continuance that a client-server network might, all three computers can communicate directly with each other and share one another's resources. 1.1.2 Architecture of P2P systems Peer-to-Peer Architecture distinguishes itself by its distribution of power and function. Rather than concentrating its power in the server, Peer-to-Peer models rely on the power and bandwidth of participants. They form ad hoc connections between nodes for sharing all kinds of information and files. Peer-to-Peer discards hierarchical notions of clients and servers (clients at the top, servers on the bottom) and replaces it with equal peer nodes that function simultaneously as clients and servers. This also discards the idea of a central server, which exists in Client-Server Architecture. There are several classifications of Peer-to-Peer networks. These include pure/hybrid and structured/unstructured Peer-to-Peer networks. Pure P2P networks merge the role of clients and servers as equals and do not provide a central server for managing the network or a central router that forwards requests to other networks. Hybrid P2P models, on the other hand, do contain a central server that stores peer information and responds to request for information stored on that server. In this configuration, peers host available resources since there no central server provides this function. Peers also make central servers aware of what resources they want to share and make those resources available to peers that request them. Also, route terminals function as used addresses and are indexed to find an absolute address. The structure of P2P networks is determined by the nature of the overlay network, which consists of all participating peers as equal nodes. Nodes in an overlay network are connected through virtual or logical links that create a path to the underlying network. Essentially, overlay networks are network built on top of other networks. Peer- to-Peer networks are considered overlay networks because they are usually built on top of the Internet. Structured P2P networks use a global protocol so that searches can be routed TO any peers/nodes BY any peers/nodes on the network. To retrieve rare files, more structured overlay links are required. The most common structured P2P network is the distributed hash table (DHT). DHTs are decentralized distributed systems that store names and values. Any participating node in the network can lookup and retrieve values. Maintenance of the DHT mapping is distributed among the nodes. The ownership of each file is assigned to a peer, but the 10 Research on Node Ranking – Peer to Peer …. Hoàng Cường addition or deletion of peers or files doesn’t cause major disruptions. This makes them very scalable. Unstructured P2P networks establish links more arbitrarily. To join, a peer only has to copy the links of an existing node and then add its own links as it develops. To find a desired file, however, the request must be flooded throughout the network. This doesn’t always necessarily return the desired results if the file being requested is rare. There is no correlation between peer and content. Also, flooding increases network traffic, slowing down responses and file sharing. The primary advantage of P2P networks is that all clients contribute their resources. These resources include computing power, bandwidth, and storage space. In traditional Client-Server models there are a fixed number of servers, so the addition of clients slows down network processing. In Peer-to-Peer models, as nodes are added, system resources increase (contributed by the added nodes) to accommodate demand. Comparisons Client-Server Architecture provides certain advantages over these other network models. For example, Client-Server models offer easier maintenance, security, and administration. For example, encapsulation makes it possible for servers to be repaired, upgraded, or replaces without clients being affected. Encapsulation is the process by which an object can hide its data and methods without revealing them to users. Also, because all data is stored on servers, the data is more secure. Servers control access and ensure that only screened clients can access and manipulate data. Again, since data is centralized on servers, updates occur on the server and are then transmitted to clients as they request services. In P2P models, updates must be applied and copied to peers in the network, which requires a lot labor and is prone to errors. However, Client-Server paradigms often suffer from network traffic congestion. This is not a problem for P2P, since network resources are in direct proportion to the number of peers in the network. Also, Client-Server paradigms lack the robustness of P2P networks. Robustness refers to a network’s ability to bounce back or continue functioning if one of the components fails. If a server fails in Client-Server models, the request cannot be completed. In P2P, a node can fail or abandon the request. Other nodes still have access to resources needed to complete the download. 11 Research on Node Ranking – Peer to Peer …. Hoàng Cường 1.1.3. Distributed hash tables Image 1.1.3: Distributed hash tables example Distributed hash tables (DHTs) are a class of decentralized distributed systems that provide a lookup service similar to a hash table: (key, value) pairs are stored in the DHT, and any participating node can efficiently retrieve the value associated with a given key. Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows DHTs to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures. DHTs form an infrastructure that can be used to build peer-to-peer networks. Notable distributed networks that use DHTs include BitTorrent's distributed tracker, the Kad network, the Storm botnet, YaCy, and the Coral Content Distribution Network. 1.2. Ranking in Peer to Peer networks 1.2.1 Introduction This thesis discusses the execution distributed page ranking technology in the peer-to-peer network crown which constructs. The distribution page ranking needs, because net's size grows by the remarkable speed, and the centralized page ranking is not may promote. Open style system PageRank is proposed in the paper base in Google use traditional PageRank. 12 Research on Node Ranking – Peer to Peer …. Hoàng Cường Image 1.2.1: System must to have the ranking engine to find the one We then proposed that some distributed page ranking algorithm, proves their convergence partially, and discusses some interesting products they. The indirect transmission in this article was introduced that reduces representence which the representence ceiling and achieves between page rankers may promote. Between the convergence time which and the band width relations consumes are also discussed. Finally, we verify certain discussions by the basis true data set's experiment. 1.2.2. Ranking Roles The determination “the importance” the link structure became based on the page ranking's homepage was searching an engine's important technology. Specially, the hit algorithm maintains each page a jack and the authority score, the authority and the jack score calculates based on the page connection relations in the hyperlinked environment. Google the use PageRank algorithm determines “the score” the homepage double counting matrix eigenvector/feature vector/proper vector. When net's size growth, it difficult and becomes difficultly, for the existing search engine can include the entire net. We need to be may promote about page quantity and user's quantity distributed search engine. Not only in a distributed search engine, the page ranking is essential in its improvement's inquiry result centralization relative, but should and the availability carries out distributed for the measurable quantity. A direct way achieves distribution the page ranking to call the hit or the PageRank algorithm to the distributed environment. But it is not to do a that trivial matter. Two hits and PageRank are the redundant algorithms. Each pack of riding instead of walking needed previously the step estimated result, the synchronous 13 Research on Node Ranking – Peer to Peer …. Hoàng Cường operation needed. However, achieves the synchronous communication, in the width disseminates in the distributed environment is difficult. Moreover, must consider carefully the page divides into with the representence ceiling, when carries out the distributed page ranking. The coordinated cover network which constructs was won the prestige to take recently from has organized, toughness, a large-scale distribution system's construction platform. In this article, we try to carry out the effective page ranking in the peer-to-peer network crown which constructs. We first propose according to the google PageRank some distributed page ranking algorithm, and proposes about theirs some interesting products and the result. Is more important than because of the representence ceiling CPU and in the distributed page ranking's memory usage, our then discussion about relaxes the representence ceiling's strategic page to divide into with the idea. Through this execution, our paper makes the following contribution: • Through the use true data set, we provide two kind of distributed page ranking algorithm, proves their convergence partially, and verifies their characteristic. • We recognize main the point in dispute and question and the distribution page ranking concern in the P2P network crown which constructs. Chapter 2: Foundation of Ranking on DHT Peer to Peer Networks 2.1. Chord Protocol Using the Chord lookup protocol, node keys are arranged in a circle. The circle cannot have more than 2m nodes. The circle can have IDs/keys ranging from 0 to 2m − 1. IDs and keys are assigned an m-bit identifier using consistent hashing. The SHA-1 algorithm is the base hashing function for consistent hashing. Consistent hashing is integral to the robustness and performance of Chord because both keys and IDs (IP addresses) are uniformly distributed and in the same identifier space. Consistent hashing is also necessary to let nodes join and leave the network without disruption. Each node has a successor and a predecessor. The successor to a node (or key) is the next node (key) in the identifier circle in a clockwise direction. The predecessor is counter-clockwise. If there is a node for each possible ID, the successor of node 2 is node 3, and the predecessor of node 1 is node 0; however, normally there are holes in the sequence. For example, the successor of node 153 may be node 167 (and nodes 14 Research on Node Ranking – Peer to Peer …. Hoàng Cường from 154 to 166 will not exist); in this case, the predecessor of node 167 will be node 153. Since the successor (or predecessor) node may disappear from the network (because of failure or departure), each node records a whole segment of the circle adjacent to it, i.e. the r nodes preceding it and the r nodes following it. This list results a high possibility that a node is able to correctly locate its successor or predecessor, even if the network in question suffers from a high failure rate. Image 2.1: A 16-node Chord network example The Chord protocol is one solution for connecting the peers of a P2P network. Chord consistently maps a key onto a node. Both keys and nodes are assigned an m-bit identifier. For nodes, this identifier is a hash of the node's IP address. For keys, this identifier is a hash of a keyword, such as a file name. It is not uncommon to use the words "nodes" and "keys" to refer to these identifiers, rather than actual nodes or keys. There are many other algorithms in use by P2P, but this is a simple and common approach. 2.2. Pagerank 15 Research on Node Ranking – Peer to Peer …. Hoàng Cường 2.2.1 Description PageRank is the key link parsing algorithm, names by the Google Co-founder Lary Page, uses from assigns a digit as extra as the hyperlinked set of each element document, for example world wide network, with "Goal Google Internet search engine; measuring" It in set relative importance. Perhaps the algorithm is utilized in the individual all collection mutual quotation and about. It assigns to all specific element E digital weight also calls E PageRank, and indicated by PR (E). 2.2.2 Algorithm PageRank will be the possibility distribution which will use in is symbolizeed possibly willfully clicks in the link person will arrive all special data. PageRank may calculate for all size document collection. In several research papers are divided evenly by the supposition release between collection all documents in the computational process at the beginning. The PageRank computation requests several passes, calls " iterations" To reflects the theory real value strictly through the adjustment approximate PageRank collection value. The possibility is expressed takes in 0 and a 1. scope value. A 0.5 possibility are expressed together takes " 50% chance" Something occurrence. Therefore, there 0.5 method PageRank will be 50% opportunity click the human who will link willfully in one will be directed to and 0.5 PageRank this article. Simplified algorithm 16 Research on Node Ranking – Peer to Peer …. Hoàng Cường Image 2.2.2: How PageRank Works Supposition four homepage microcosms: A, B, C and D. The PageRank initial approximation will be divided evenly between these four documents. Therefore, each document 0.25 will start from estimate PageRank. By the PageRank original shape original value is completely 1. This means that all data the sum total is the page total in the net. The PageRank newest edition (will see also the following convention) the supposition in 0 and 1. between possibility distributions. Therefore here will use together the simple possibility distribution original value 0.25. If page B, C and D each link only, they every one will discuss 0.25 PageRank to A. All PageRank PR () will gather therefore in this simplification's system to A, because all links will aim at A. This is 0.75. Supposition page B has a link to call C, and calls A, but page D has the link to all three data. In the link vote's value in the page all links outward is divided. Therefore, page B calls A for quite a 0.125 value's vote and quite a 0.125 value vote calls C. D' Only 1/3; s PageRank is A' Counting; s PageRank (about 0.083). In other words, PageRank linked outward by one discussed with document' Is equal; s has links outward the L() normalization quantity division PageRank score. (supposition, a concrete URL link each document only counting). In the general case, the PageRank value for any page u can be expressed as: , i.e. the PageRank value for a page u is dependent on the PageRank values for each page v out of the set Bu (this set contains all data linking to page u), divided by the number L(v) of links from page v. Damping factor The PageRank theory will maintain at the link clicks on the surf rider who will fictionalize to stop willfully clicking finally. Possibility, in all step, the human will 17 Research on Node Ranking – Peer to Peer …. Hoàng Cường extend will be damping factor various research has tested different damping factor D., but the usual supposition, the damping factor will be established nearby 0.85. The damping factor from 1 is subtracted (, and in algorithm some variations, result divides (N) by document the quantity in collection), and this deadline then increases the PageRank score sum total product which and follows on somebody's heels to the damping factor. Namely So any page's PageRank is derived in large part from the PageRanks of other data. The damping factor adjusts the derived value downward. The original paper, however, gave the following formula, which has led to some confusion: Then any page' s PageRank is obtaining majority of from other page of PageRanks. The damping factor downward adjustment obtains value. Original text, however, has given the following convention, has caused some confusions: Between them the difference is in the first convention sum total PageRank value to one, but obtains in second convention each PageRank is multiplied by N, and the sum total becomes N. In page and Brin' A statement; s paper " All PageRanks sum total is one" and supports the above convention by other Google employee's request the first distortion. Each time it crawls the net, and reconstructs its index, Google evaluates the PageRank score. When Google increases the document quantity in its collection, PageRank initial approximation for all document reduction. The convention use obtains tastelessly, in several clicks and switch after random page a random surf rider’s model. The page PageRank value reflection random surfrider will land in that page through the click in the link opportunity. May understand that takes the condition is the page, and the transition is equally all possible and is between the page link Markov chain. If the page link to other data, it has not become the water trough, and terminates the random surfing the process. However, the explanation is quite simple. If the random surf rider arrives at the water trough page, it picks another URL stochastically, and continues again the surfing. When calculates PageRank, the page has not linked outward the supposition and in collection other data of connections. Therefore their PageRank score is divided evenly 18 Research on Node Ranking – Peer to Peer …. Hoàng Cường 19 in other data. In other words, is fair with is not water trough's page, these random transitions increase to net's all knots, when remaining possible usual d = 0.85, estimated that uses their browser' from a frequency common surfrider; s bookmark characteristic. Therefore, the equality is as follows: where p1,p2,...,pN are the data under consideration, M(pi) is the set of data that link to pi, L(pj) is the number of outbound links on page pj, and N is the total number of data. jacency matrix. This makes PageRank a particularly elegant metric: the eigenvector is The PageRank values are the entries of the dominant eigenvector of the modified ad where R is the solution of the equation where the adjacency function is 0 if page pj does not link to pi, and normalized such that, for each i , i.e. the elements of each column sum up to 1 (for more details see the computation section below). This is a variant of the eigenvector centrality measure sed commonly in network analysis. the PageRank eigenvector are fast to approximate (only a few iterations are needed). u Because of the large eigengap of the modified adjacency matrix above, the values of Research on Node Ranking – Peer to Peer …. Hoàng Cường As a result of the Markov theory, may display page PageRank be the possibility is in that page after many clicks. This accidentally equals t − 1 t is expectation of) the place request's click (or jumps willfully quantity obtains from the page returns to itself. The major object is it favors a older page, because is new, the very good first page, will not even have many links, only if it will be an existing stand (is a stand part crowded wrap page which will connect, for example Wikipedia). The Google table of contents (itself derivative opening table of contents project) allows the user to look in the category the PageRank sorting result. The Google table of contents is PageRank determined directly the demonstration order Google provides only service. In Google' the s other search service (e.g. its main net search) PageRank uses in considering the relevance in search result demonstration dozens of data. Several strategies proposed that accelerates PageRank the computation. Operated PageRank various strategies to arrange to use diligently together the improvement search result ranking and decides as the currency to do to the link the advertisement. These strategies have attacked the PageRank concept reliability severely, seeks determined that which documents in fact take seriously by the net community. Google knew that the punishment the link farm which and other plans designs inflates artificially PageRank. Google starts in December, 2007 to punish effectively sells the paid text link the stand. How does Google identify the link farm and other PageRank operational tool is in Google' In; s business secret. 2.3 Distributed Computing The distributed computing is the computer science area research distributional system. Distributional system through a computer network service including many autonomous computers. The computer achieves a common goal mutually according to the order interaction. The computer program which runs in the distributional system said that a distributed program, the distribution programming writes such program' s process. And the distributed computing mentions the use distributional system explanation estimate question. In the distributed computing, the question is divided many responsibilities, the computer explains everybody. 20 Research on Node Ranking – Peer to Peer …. Hoàng Cường Image 2.3: Distributed Nodes Graph example We pass use computer’s hope automation; s many responsibilities held responsible with answer the type: We hope to ask the question, and the computer should cause the answer. In the computer science theoretically, is called the estimate question like this voluntarily. It is estimated that the question has each template including the instance is an explanation officially together. The example is the question which we asked that and the explanation is anticipates the answer to these questions. (How does the theory computer science seek needs to understand the estimate question possibly through use that the complex theory solution computer (the computability theory) and high efficiency computation). In the tradition, said the question perhaps through the use solution computer, if perhaps we design all concrete instances are correct explanation algorithm causes. Perhaps such algorithm possibly implements the computer program which runs in an general calculator: Studies from the input question instance's holiday eye, carries out some computation, and causes the explanation to adopt the product. Formalism for example random access ' perhaps the s machine or the universal Turing machine use the achievement to carry out such algorithm continuously general calculator' s abstraction model. In many computer situations, consistent and distributed computing area research similar question or execution interaction process system computer: Which estimate question how can solve in such network and the high efficiency place? However, it is not obvious in concurrent or the distributional system situation, “solves the problem is all meanings” 2.4 Computing PageRank in a distributed system 21 Research on Node Ranking – Peer to Peer …. Hoàng Cường Lectured the net graph in distribution system's recent research work to divide into messes up the website or the domain case. The net is molded takes many messes up the network server. Is divided in net's ultra link two categories, the internal cut-off link and the mutual server link. The internal server link is between the page link in the server, and these links use in calculating on each server's place PageRank intermediate vector. The mutual server link is between the page link with the different server, and they use in calculating ServerRank. ServerRank surveys the different network server's relative importance. The server which submits is being merged finally from many network server's result causes an arrangement ultra link name list. The ranking algebra proposed that deals with the ranking in the different granularity level, is utilized possibly also in gathering the place ranking and the stand ranking obtains the global ranking. Has in one disperses the system fully in the PageRank approximation work, each of the same generation is autonomous, and perhaps of the same generation mutually overlaps. Was proposing the JXP algorithm, each of the same generation calculates the place PageRank score, then meets other of the same generations and increases it gradually through the exchange information willfully about the global net graph knowledge, then recomputation in place of the same generation's PageRank score. This conference and the recomputation process is duplicated, collects the enough information until of the same generation. If of the same generation meets the sufficient number of times exchange information finally, JXP score polymerization to the real global PageRank score. Supposes is each page of out degree in global graph awareness. However, these operations are providing the approximation the focal point are the global graph, in centralized system or distribution system. 2.5. tf-idf The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf– idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. One of the simplest ranking functions is estimated by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model. Motivation Suppose we have a set of English text documents and wish to determine which document is most relevant to the query "the brown cow." A simple way to start out is by eliminating documents that do not contain all three words "the," "brown," and "cow," but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document and sum them all together; the number of times a term occurs in a document is called its term frequency. 22 Research on Node Ranking – Peer to Peer …. Hoàng Cường However, because the term "the" is so common, this will tend to incorrectly emphasize documents which happen to use the word "the" more, without giving enough weight to the more meaningful terms "brown" and "cow". Also the term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms like "brown" and "cow" that occur rarely are good keywords to distinguish relevant documents from the non-relevant documents. Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the collection and increases the weight of terms that occur rarely. Mathematical details The term count in the given document is simply the number of times a given term appears in that document. This count is usually normalized to prevent a bias towards longer documents (which may have a higher term count regardless of the actual importance of that term in the document) to give a measure of the importance of the term ti within the particular document dj. Thus we have the term frequency, defined as follows. where ni,j is the number of occurrences of the considered term (ti) in document dj, and the denominator is the sum of number of occurrences of all terms in document dj. The inverse document frequency is a measure of the general importance of the term (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient). with • | D | : total number of documents in the corpus • : number of documents where the term ti appears (that is ). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to use Then A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. The tf-idf value for a term will always be greater than or equal to zero. 23 Research on Node Ranking – Peer to Peer …. Hoàng Cường Example Consider a document containing 100 words wherein the word cow appears 3 times. Following the previously defined formulas, the term frequency (TF) for cow is then 0.03 (3 / 100). Now, assume we have 10 million documents and cow appears in one thousand of these. Then, the inverse document frequency is calculated as log(10 000 000 / 1 000) = 4. The TF-IDF score is the product of these quantities: 0.03 × 4 = 0.12. Chapter 3: Building a new algorithm for ranking nodes in chord networks 3.1 Targets and Missions of Research The P2P deployment is the building distributed search network. Proposed that the system support discovers with retrieval all results, but lacks the essential information to arrange them. User, however, is mainly to most is related and is not all most possible results to be interested. The use random sampling, we expand the well-known information retrieval ranking algorithm class they may apply like this in this distributed establishment. How do we analyze the ceiling our method, and the quota our system scale along with the document, the system size, the inquiry document to ties the mapping correctly (uniform to non-uniform) and the type increases the digit (rarely to universal deadline). Our analysis and the imitation indicated that a) these extensions are the high efficiency, and possibly calls with a ceiling to the large-scale system, and b) uses the result accuracy which and the centralized implementation the distributed ranking obtains is comparable. 3.2 Idea: 24 Research on Node Ranking – Peer to Peer …. Hoàng Cường Table 3.2.1: The Pagerank converge and HITS converge When I write a small code, I see that’s the converges of two algorithms: HITS- Pagerank. HITS’s better converge more than Pagerank. It’s can be see as the experiment’s image.. experiment test with Graph 1000 nodes. The blue line is the converge ( iterators ) of Pagerank The red line is the converge ( iterators ) of HITS Table 3.2.2: The Pagerank converge increasing to fast 25 Research on Node Ranking – Peer to Peer …. Hoàng Cường I also see that’s the converges of the algorithm: Pagerank. It’s can be see as the experiment’s image that’s the time to calculate converge of Pagerank algorithm. It goes to much faster. It can’t be done in peer to peer network ( the fast increase of time is more than the fast increase of graph’s size a lot ) The When I write a small code, I see that’s the converges of two algorithms: HITS-Pagerank. HITS’s better converge more than Pagerank. It’s can be see as the experiment’s image.. The Graph with 1000 nodes. The blue line is the converge ( iterators ) of Pagerank The red line is the converge ( iterators ) of HITS Image taken from the experimental results, the convergence of the Pagerank algorithm. 1000 nodes network simulation, 2000 nodes, 4000 nodes. (Execution time calculation) Light blue line is the only way to calculate the time convergence of the Pagerank algorithm is at the network node 4000. Dark blue line is the only way to calculate the time convergence of the Pagerank algorithm is at the network node 2000. The red dot is the only way to calculate the time convergence of the Pagerank algorithm is at the network node 1000. Computing time increases when the number of exponential increase network node Want to calculate all the nodes in the network takes several performance (larger network node -> lose greater efficiency (greater than performance node added) Table 3.2.3: Pagerank convergence are not steady when Epsilon small Image taken from the experimental results of the convergence Pagerank algorithm. Network Simulation 1000 node. (Execution time calculation) 26 Research on Node Ranking – Peer to Peer …. Hoàng Cường Dark blue line includes the red dot is the only way to calculate the time convergence of the Pagerank algorithm is at the network node 1000 under the different conditions randomly. With a little error Epsilon, convergence of k Pagerank is completely stable. (Change gap) 3.2.4: HITS convergence ( steady+ take lots of time more than Pagerank) Image taken from the experimental results the convergence of HITS algorithm. Network simulation is 1000 nodes. (Terms Iterator going to converge). Dark blue line includes the red dot is the only way to calculate the time convergence of HITS algorithm is at 1000 node network under the different conditions randomly. With a little error Epsilon, convergence of HITS stable than PageRank. Idea: Combined HITS and Pagerank. HITS method n nodes filter out content authentication best results) and used to calculate the Pagerank that n nodes. (n very little compared to the total number of network nodes) Advantages of this new approach: • Exact result • Accurate • Computing easier. • Easy feasible • faster Let’s going to see what’s happened to get a system which has some features like that in deeply later. Basically, let’s analyze a simple example in Google search engine: 27 Research on Node Ranking – Peer to Peer …. Hoàng Cường Image 3.2: Google almost is not exact In the results we can see that the topic results. They’re different, and no relate. And so some are exact, but some results are not exactly ( many ). 3.2 Ranking Idea: Customized semantic query answering, personalized search (Google is the first of the crucial search engines to take personalized results on a massive scale. Weighing a number of factors including but not limited to user history, bookmarks, community behavior and site click-through rate and stickiness, Google is providing results that are specific to what they believe you are searching for. Currently this service is only available to those who are logged into their Google account), focused crawlers and localized search engines frequently focus on ranking the Nodes contained within a sub-graph of the global Peer to Peer Nodes graph. The claim for these utilizations is to estimate PageRank-style scores expeditiously on the sub-graph, i.e., the ranking must manifest the global link structure of the Peer to Peer Node graph exactly ( which means calculating returns true global pagerank-scored order ) but it must do so without bearing the high overhead affiliated to global computation. We propose a framework of an exact unraveling and an analogous solution for computing ranking on a sub- graph. without running PageRank on the whole Nodes Graph. So, Approaching sub- graph and multiple sub-graphs respectively is the main discovery in this ranking research. 28 Research on Node Ranking – Peer to Peer …. Hoàng Cường Image 3.2.2: Intersect idea The Rank_local_idea algorithm is an rigorous scientific method with the supposing that the scores of “outside” Nodes are known. We analyse that the Rank_local_idea values for nodes in the sub-graph converge. Since the PageRank- style values of out nodes may not expectedly be achievable, we introduce the Pagerank_local algorithm to evaluate score results for the sub-graph. We use graph EG ( Outside Graph ) to symbolize a major graph. Both Rank_local_idea and Pagerank_local put on the set of outside nodes with an outside node graph EG and magnify the sub-graph with links to graph EG. They also modify the PageRank-style transition matrix ( The transition possibility distribution be symbolizeed by a matrix ) with respect to graph EG. We analyze the L1 distance between Rank_local_idea scores and Pagerank_local scores of the sub-graph and show that it is within a constant factor of the L1 distance of the outside nodes (e.g., the true PageRank scores and uniform scores assumed by Pagerank_local). We compare Pagerank_local and a stochastic complementation approach (SC), a current best solution for this problem, on different types of sub-graphs. Pagerank_local has similar or superior performance to SC and typically improves on the runtime performance of SC by an order of magnitude or better. We demonstrate that Pagerank_local provides a good approximation to PageRank for a variety of sub-graphs. 3.3 Idea finding a subset Many assignments have to find salient and diverse items to satisfy the user’s information need. This problem has been addressed in document summarization, information retrieval, and various data mining applications. In this paper, we present a new method that can select salient and diverse items from many information contents. We formulate the mission as a graph summarization problem: given a weighted graph, how can we find a subset of salient and diverse vertices to summarize the graph, subject to some pre-specified constraints. We present a linear programming based approximate algorithm to optimize an objective function which takes into account both salience and diversity. The method was applied to two missions: 29 Research on Node Ranking – Peer to Peer …. Hoàng Cường • multi-document summarization, which brings out salient sentences from sentence graphs; • mining symbolizeative experts in the data-mining community from co-author graphs. In comparison to the state-of-the-art graphed based methods, our method produces superior results in these applications. 3.4: Finding ranking factors Image 3.4: Factor Percent 30 Research on Node Ranking – Peer to Peer …. Hoàng Cường Chapter 4: Ranking algorithm details 4 .1: Idea The explosion of files sharing available on the Peer to Peer Networks has made the ranking of peer to peer systems an high-priced but unavoidable component. The more users try to search, the more system overloads. Since hyper-bandwidth-links from one node to another usually points to an “Authorization” or “Commendation”, bandwidth-link analysis plays a essential role in elect the “importance” of nodes in network. PageRank and HITS are two foremost approaches in this area. PageRank iteratively estimates the score of a page based on the scores of its parent Image 4.1: Bandwidth is the key of ranking trusted Data as above equation 31 Research on Node Ranking – Peer to Peer …. Hoàng Cường HITS separates the role of each node into a hub or authority. In the HITS algorithm, the first step is to retrieve the set of results to the search query. The computation is performed only on this result set, not across all Graph Nodes. Authority and hub values are defined in terms of one another in a mutual recursion. An authority value is computed as the sum of the scaled hub values that point to that page. A hub value is the sum of the scaled authority values of the pages it points to. Some implementations also consider the relevance of the linked nodes. The algorithm performs a series of iterations, each consisting of two basic steps: • Authority Update: Update each node's Authority score to be equal to the sum of the Hub Scores of each node that points to it. That is, a node is given a high authority score by being linked to by nodes that are recognized as Hubs for information. • Hub Update: Update each node's Hub Score to be equal to the sum of the Authority Scores of each node that it points to. That is, a node is given a high hub score by linking to nodes that are considered to be authorities on the subject. The hub score estimates the value of its bandwidth-links to other nodes and the authority score estimates the importance of the node. These algorithms are expensive because of the number of node data/objects involved in the computation. On 15 November 2008, The Pirate Bay announced that it had reached over 25 million unique peers. And according to the Pirate Bay statistics, as of December 2009, The Pirate Bay has over 4 million registered users, although registration is not necessary to download torrents. To make ranking feasible, and to manifest the diversity of node users' information needs, peer to peer networking applications such as semantic search ("Semantic search seeks to improve search accuracy by understanding searcher intent and the contextual meaning of terms as they appear in the searchable dataspace, whether on the Web or within a closed system, to generate more relevant results. Author Seth Grimes lists "11 approaches that join semantics to search", and Hildebrand et al. provide an overview that lists semantic search systems and identifies other uses of semantics in the search process.), focused crawlers (A focused crawler or topical crawler is a web crawler that attempts to download only web data that are relevant to a pre-defined topic or set of topics. Topical crawling generally assumes that only the topic is given, while focused crawling also assumes that some labeled examples of relevant and not relevant data are available. ), localized search engines, and personalized search have emerged. They all have a common objective to rank a sub-graph. The first intriguing application is a focused crawler, also called a topical crawler. A focused crawler is interested in collecting a subset of the node data that are related to a specific topic. Compared to a standard crawler which can easily get lost and waste resources, a focused crawler acquires relevant data using a Best First Search; it selects links based on their scores. In contrast to focused crawlers which are 32 Research on Node Ranking – Peer to Peer …. Hoàng Cường topic specific, a localized search engine indexes a subset of node data that are within a specific domain. The node piece retrieved by the focused crawler (or localized search engine) is a sub-graph of the global node graph. Only PageRank scores for local data in the sub-graph are of interest to users. Users submit queries to the sub-graph collected by a focused crawler and local query answers are returned to the user. The ranking on this local graph, however, should reflect the link structure of all node data. Another interesting scenario is semantic ranking. ObjectRank creates a schema graph to model the semantic connections between entity sets, e.g., authors or conferences. The semantic connections are associated with an authority transfer assignment which can be arbitrarily set by a domain expert based on her interpretation of the domain. Figure 2 shows an authority transfer schema graph for DBLP. Building on previous work on how to model contextual information for desktop search and how to implement semantically rich information exchange in social networks. Peer-Sensitive ObjectRank for ranking resources on the desktop. The algorithm takes into account different trust values for each peer, generalizing previous biasing PageRank algorithms. The ObjectRank system applies the random walk model, the effectiveness of which is proven by Google's PageRank, to keyword search in databases modeled as labeled graphs. The system ranks the database objects with respect to the user- provided keywords. The PageRank technique assigns to each page p a score based on the score of the pages pointing to p. Hence, pages pointed by many high-score pages receive a high score as well. Alternatively, the score of p is equal to the possibility that a random surfer, starting from a random page, will be at p at a specific time. For more information on PageRank see the original PageRank paper. While ObjectRank is flexible and allows the tuning of ObjectRank scores by a domain expert, it leads to computational challenges if a search engine has to consider all possible combinations of keywords and authority transfer assignments. Recent research on reformulating ObjectRank scores based on individual user feedback and a graph exploration framework for the biological node highlights the optimization challenges of query answering and ranking for the semantic Node. If we can model a sub-graph to contain the subset of data associated with the entity sets of interest to some domain expert, we can then define the ObjectRank problem as a problem of ranking a sub-graph. This problem, too, is to exploit existing PageRank scores for other regions of the graph that are not of interest to the domain expert, and whose scores may also remain largely unchanged. Figure 3 shows an example where a sub-graph is associated with an authority transfer assignment and the outside Node are beyond the focus of the domain expert. 33 Research on Node Ranking – Peer to Peer …. Hoàng Cường Another application that involves ranking a sub-graph is peer-to-peer networks. The advent of peer-to-peer(P2P) technology has further encouraged data information retrieval by jump on distributed computing power, storage, and connectivity. A distributed or decentralized system has multiple peers or servers, each of which puts in storage its own sub-graph of the Peer to Peer nodes. A user may take queries on one peer and ranked query answers that are available locally are presented to the user. The ranking relies upon the context of the query. A final scenario is a absorption of the constant change of the Peer to Peer nodes - Peer to Peer node data. The ranking of data needs to be updated frequently, especially for the sub-graph of the nodes that experiences the most change. This sub- graph can be either a set of dangling files that crawlers have not as yet crawled, referred to as the web “frontier”, or the set of data that are most affected by updates. It is desirable that any strategy to update the ranking of this sub-graph exploits existing PageRank scores for other regions of the graph which may remain largely unchanged. In response to these many motivating applications, we address the problem of computing ranking scores for a sub-graph. For ease of presentation, and to compare with existing approaches, we use the PageRank metric for explanation and experiments. However, our general approaches can be applied to estimate ObjectRank scores as well. We note that current ranking techniques (to be discussed in the next section) either bear the cost of a global computation to get an accurate ranking, or they have to solve another potentially difficult problem: to determine a relevant super-graph of web data that impact the rank of the sub-graph. Our challenge is to obtain an accurate ranking that reflects the global link structure of the Web graph and to do so without bearing the high overhead associated with a global PageRank computation or having to solve the difficult problem of identifying a relevant supergraph. We would also like to exploit pre-estimated 34 Research on Node Ranking – Peer to Peer …. Hoàng Cường PageRank scores for outside data if and when they are available and appropriate for use. We propose a framework based on an exact and an approximate solution to estimate PageRank on a sub-graph. The Rank_local_idea algorithm is an exact solution. It assumes that the PageRank scores of outside data are known. We prove that the Rank_local_idea scores for data in the sub-graph converge to the true PageRank scores. Since the PageRank scores of outside data may not typically be available, we propose the Pagerank_local algorithm to estimate PageRank scores for the sub-graph. Both Rank_local_idea and Pagerank_local symbolize the set of outside data with an outside node graph EG and extend the sub-graph with links to graph EG. They also modify the PageRank transition matrix with respect to (the links to) graph EG. The Rank_local_idea and Pagerank_local framework formalizes the problem of ranking a sub-graph. It allows us to model multiple scenarios where ranking a sub- graph is important. Rank_local_idea can be used to model scenarios where PageRank scores of the global graph are known a priori and can potentially be re-used. This includes the case where the sub-graph symbolizes the data that have been updated, or the sub-graph symbolizes the data that contain all the semantic types of interest to a domain expert for a personalized or semantic ranking such as ObjectRank. Pagerank_local can be applied in general to all these problems, when we do not know the PageRank scores of outside data. We compare our approach with the stochastic complementation (SC) approach. SC builds a super-graph by carefully examining candidate outside data and adding them into the super-graph if adding this page has a significant influence on the PageRank scores of the sub-graph. Our approach in contrast models the outside data using a node graph EG, and it can be used in situations when a super-graph cannot be obtained easily. Our approach also avoids the cost of a global computation. The Pagerank_local computation is also much cheaper than SC since SC bears the high cost of constructing the super-graph. We experimentally study the effect of size and type of the sub-graphs on the accuracy of Pagerank_local. We study several types of sub-graphs including domain specific sub-graphs, topic specific sub-graphs, and sub-graphs gathered by a Breadth First Search crawler. We compare our results with SC and two baseline ranking algorithms; one was discussed in [18], and the other is local PageRank on the sub- graph (ignoring the outside data). We show that Pagerank_local has similar or superior ranking accuracy to SC and typically its runtime performance is an order of magnitude better than SC. Pagerank_local also outperforms the two baseline algorithms on ranking accuracy. Our contributions are as follows: • We define an efficient algorithm, Rank_local_idea, to estimate PageRank scores for a sub-graph when PageRank scores of the outside data are known. The random walk defined by Rank_local_idea utilizes these scores. • We prove that the Rank_local_idea scores converge to the true PageRank scores for all local data in the sub-graph, and the Rank_local_idea score for the outside node graph EG converges to the sum of true PageRank scores for all outside data. 35 Research on Node Ranking – Peer to Peer …. Hoàng Cường • When PageRank scores of outside data are not known, we define an efficient algorithm Pagerank_local. We provide important properties of Pagerank_local scores. • We show through empirical results that the Pagerank_local ranking accuracy is similar (sometimes superior) to the best competitor SC, and it overwhelmingly outperforms the runtime efficiency of SC. 4.2 Ranking - RANK_LOCAL_IDEA APPROACH Image 4.2: Eigenvalue We officially define the Rank_local_idea algorithm to estimate PageRank scores for a local graph. Our approach is passional by research on collapsing matrices with the same eigenvector. (If x is an eigenvector of the linear transformation A with eigenvalue λ, then any scalar multiple αx is also an eigenvector of A with the same eigenvalue. Similarly if more than one eigenvector shares the same eigenvalue λ, any linear combination of these eigenvectors will itself be an eigenvector with eigenvalue λ.) Rank_local_idea performs a random walk( Often, the walk is in discrete time, and indexed by the natural numbers, as in . However, some walks take their steps at random times, and in that case the position Xt is defined for the continuum of times ) on a modified local graph called the extended local graph, where an outside node of graph EG is added to the local graph. graph EG symbolizes the set of data that are not local. 36 Research on Node Ranking – Peer to Peer …. Hoàng Cường Image 4.2.2: Random walk The transition matrix probabilities of Rank_local_idea are derived from the transition matrix of PageRank for the global graph. Rank_local_idea assumes that the PageRank scores of all outside data in graph EG are known. This assumption will be relaxed in the next section where we present an approximate solution. Consider two graphs; a global graph of size N, and a local graph of size n. The local graph is a sub- graph of the global graph. The data in the local graph are called local data while data in the global graph and that are not in the local graph are called outside data. The goal is to provide the true PageRank for the local graph without running PageRank on the global graph. List the symbols used to define our algorithms. EG : outside nodes, the artificial node symbolizeing all outside nodes. G1 A sub-graph of the graph nodes with n nodes Gg The global nodes graph with N nodes. Ge The extended local graph with n + 1 nodes The Rank_local_idea algorithm There are edges between the graph and the local node edge outside knot according to the global character. However, this kind of explanation is impossible to distinguish like sees at a link instance or between the place page and between exterior data many links in the below example: 37 Research on Node Ranking – Peer to Peer …. Hoàng Cường Let Figure 4 be a global graph. Node A,B,C, and D are local data, and node X, Y and Z in the cloud are outside data. Figure 5 provides an example of adding an outside node to symbolize the exterior data The edges added from the local data to the outside node do not need the strategy to modify the primitive PageRank transition matrix to reflect each such edge to symbolize in global graph many edges. When computing the standard Pagerank algorithm on this graph, the possibility continuance from a node is proportional to the inverse of its out-degree. node C which has 3 incoming edges from the outside data is treated similarly to node D which has only 1 incoming edge from the outside data. Intuitively, however, we should expect a higher possibility of following links from the outside data to node C. Similarly, the possibility of following links from page A to graph EG is 1=3. This too is lower than the transition possibility based on the global graph. Rank_local_idea addresses this shortcoming with the following solution: The first step is to add an outside node graph EG to the sub-graph to symbolize all outside data. The second step is to construct the extended local graph Ge, the graph EG enriched graph of size n+ 1. There is an edge from graph EG to a local page in Ge if there is an edge from an outside page to that local page. The same hold for edges out of local data. Similarly, there is an edge from graph EG to graph EG if there is an edge between outside data. The next step is to define a transition matrix A_deal_matrix and a personalization vector Pideal. Let’s consider a little about the jargon “personalization vector”. At each time step, with possibility (1 −α), a surfer visiting any node will jump to a random Web Node (rather than following an outlink). The destination of the random jump is chosen according to the possibility distribution given in v. For this reason, we refer to v as the personalization vector Example of personalization vector value in Pagerank formula: 38 Research on Node Ranking – Peer to Peer …. Hoàng Cường The details will be discussed next section. Finally, a random walk is performed on Ge. The Rank_local_idea vector Rank_ideal is defined as follows: Algorithm Rank_local_idea (Gl, Gg) 1. Add outside node graph EG to G1. Image 4.2.4: (n+1) graph nodes 2. Create edge associated with the graph graph EG and get Ge. 3. Assign values to P_ideal and A_ideal. 4. Perform a random walk on the extended local graph according to Formula. Although swarming scales well to tolerate flash crowds for popular content, it is less useful for unpopular content. Peers arriving after the initial rush might find the content unavailable and need to wait for the arrival of a seed in order to complete their downloads. The seed arrival, in turn, may take long to happen, since maintaining seeds for unpopular content entails high bandwidth and administrative costs, which runs counter to the goals of publishers that value BitTorrent as a cheap alternative to a client-server approach. A strategy adopted by many publishers which significantly increases availability of unpopular content consists of bundling multiple files in a single swarm. So it’s only way as the key of ranking system – bandwith analyze. A_deal_matrix and Pideal We define an (n+1) x (n+1) transition matrix A_deal_matrix and a length (n+1) personalization vector Pideal. Let A symbolise the N x N transition matrix for PageRank on the global graph. Entry A[i,j] has the value of the inverse out-degree of page i, if there is an edge (i; j); the value is the expectation of a random walk following this edge from i. Without lack the generality, we assume the local data to be the first close n data in A and the outside data are indexed from n + 1 to N in A. Assume that the PageRank scores for all outside data are known. The values are PageRank (node n+1, node n+2, …, node N ) = { R[n + 1];R[n + 2]; … ;R[N] } 39 Research on Node Ranking – Peer to Peer …. Hoàng Cường (graph Ge có các node từ 1-> n node, và outside graph có (n+1 -> N) node) respectively. Let A_ideal is defined as follows, based on the entries in the original PageRank transition matrix A: Next we explain the elements in A_deal_matrix. These values are as follows: 1. The n x n sub-matrix at upper left is identical to the equating factors in transition matrix A for the global graph. They symbolize the possibility of transition between edges in the local graph. 2. The n x 1 sub-matrix at upper right symbolizes the possibility continuance from a local page to the node graph EG. We note that the possibility of reaching graph EG is the sum of the possibility of reaching any outside node from the local node. For local node k, the value is 3. The 1 x n sub-matrix at lower left represents to the possibility continuance from graph EG to local data. For local node k, the value is 4. The entry at the lower right corner denotes the possibility continuance from graph EG to graph EG. The last row has entries that are each a weighted sum of probabilities summed over all outside data. The weight is determined by the PageRank score of the outside node. This is a key feature of A_deal_matrix and will be discussed next. We define A_deal_matrix formally as follows: A_deal_matrix = Q1AQ2 40 Research on Node Ranking – Peer to Peer …. Hoàng Cường where Q1 is an (n+1) x N matrix and Q2 is an N x (n+1) matrix. Let Q2 be an N x (n + 1) matrix as follows: where In is an n x n identity matrix, B is an n x 1 0-matrix, C is a (N - n) x n 0- matrix, and D is a (N -n) x 1 matrix with all 1's. The effect of AQ2 on the ranking vector is to aggregate the authority continuance from local data to all outside data, which indicates the authority goes to graph EG. Let Q1 be the following (n + 1) x N matrix: where In is an n x n identity matrix, CT is an n x (N - n) 0-matrix and BT is a 1 x n 0-matrix. The matrix of interest is E, a 1 x (N - n) matrix. It considers the PageRank scores for all outside data. Recall that Sum is the sum of PageRank scores for all outside data Then, E can be expressed as follows: Let’s take an example: 41 Research on Node Ranking – Peer to Peer …. Hoàng Cường Image 4.2.5: Graph example - 6 nodes the transition Matrix Ranking value by using Pagerank-Formula: 42 Research on Node Ranking – Peer to Peer …. Hoàng Cường Q1 is an N x ( n + 1) matrix: Assume N = 6, n = 4; In matrix Q2 Matrix 43 Research on Node Ranking – Peer to Peer …. Hoàng Cường Tương tự: Giả sử 2 node outside là node 5 và node 6. N = 6, n = 4, 2 node outside. R[5] = 0.206 R[6] = 0.2862 Sum = 0.206 + 0.2862 = 0.4922 Matrix E E = ( , = (0.418529053, 0.581470947) A_deal_matrix = Q2AQ1 = 44 Research on Node Ranking – Peer to Peer …. Hoàng Cường = [ 0. 0.5 0.5 0. 0. ] [ 0 0 0 0 0 ] [ 1/3 1/3 0 0 1/3 ] [ 0. 0 0 0. 1. ] [ 0. 0 0 0.79073547 0.20926453] Xét lại công thức : Với R[5] = 0.206 R[6] = 0.2862 A[5][5] = 0 45 Research on Node Ranking – Peer to Peer …. Hoàng Cường A[5][6] =1/2 A[6][5] = 0 A[6][6] = 0 A[5][4] = 1/2 A[6][4] = 1 Sum = 0.4922 = [ 0. 0.5 0.5 0 0. = 0+0 ] [ 0 0 0 0 0 = 0 + 0 ] [ 1/3 1/3 0 0 1/3 = 1/3+0 ] [ 0. 0 0 0 1 = 1/2 + 1/2 ] [ 0=0+0 0=0+0 0=0+0 0.79073547 0.20926453 ] 0.79073547 = (0.206*0.5+0.2826)/0.4922 (đúng) 0.20926453 = (0.206*0 + 0.206*1/2 + 0.2826*0+ 0.2826*0)/0.4922 ( đúng) Và ma trận A 46 Research on Node Ranking – Peer to Peer …. Hoàng Cường Image 4.2.6: Multiplication result example The idea of multiplying the values of entries in A with the two matrices Q1 and Q2, where Q1 derived from the ranking vector for outside data, is key to the approach of A_deal_matrix. It has the effect of distributing the possibility continuance from the outside nodes, in a manner that is proportional to the importance of each of the outside data in the original PageRank vector. Recall that the personalization vector in the original PageRank is defined as a uniform vector Instead, for Rank_local_idea we define the personalization vector Pideal according to the number of outside data and total number of data in the graph. More specifically, the i-th entry of Pideal, Pideal[i] can be expressed as follows: Summary, according to the equalization: 47 Research on Node Ranking – Peer to Peer …. Hoàng Cường Convergence of Rank_local_idea Let Rank_ideal be the final ranking vector of Rank_local_idea, where the first n elements are scores for local data and the (n+1)-th element is the score for the outside node graph EG. We show that the scores of first n elements are identical to the true PageRank scores. Theorem 1: In Rank_ideal, scores for the first n data converge to the true PageRank scores. The score for the (n + 1) th element, graph EG, converges to the sum of true PageRank scores for all outside data. Proof: Let R be the true PageRank vector such that R is the converged stationary distribution for A. Let R’ = be a vector with n + 1 entries. We also know that R = . It is obvious that R’[i] = R[i] for first n elements and We will show that R’ is the Rank_local_idea vector. We know that Next consider a left multiply with to obtain the following: Since A_deal_matrix is stochastic and Markov Chain defined by Rank_local_idea is irreducible and aperiodic, there is a unique stationary distribution for A_deal_matrix. Therefore, R0 = Rank_ideal. The Rank_local_idea algorithm addresses several applications. One is where some sub-graph of the Web graph has been updated. A second case is when the personalized authority transfer is limited to the sub-graph. In these cases, the knowledge of PageRank scores can be potentially relied on to estimate new ranking scores. 48 Research on Node Ranking – Peer to Peer …. Hoàng Cường THE PAGERANK_LOCAL ALGORITHM Unlike the previous scenario where PageRank values for outside data are known, we now consider scenarios where the PageRank scores are not known a priori. To cover this state, our framework has an approximate solution Pagerank_local. The key difference is that for Pagerank_local, the algorithm is not able to differentiate the (previously weighted) contribution of authority from each individual outside page (since these PageRank scores are unknown). Instead, Pagerank_local will consider the authority continuance from outside data assuming they are equally important. We analyze the L1 distance between Rank_local_idea scores and Pagerank_local scores of the sub-graph and reveal that it is within a constant factor of L1 distance between the true PageRank scores and uniform scores of the outside data. We will show through experiments that Pagerank_local is a good approximation. A. The Pagerank_local algorithm The Pagerank_local vector Rapprox is defined as follows Pagerank_local adopts the same personalization vector as Rank_local_idea. It however, defines its own transition matrix Matrix_A_approx. B. Matrix_A_approx definition Matrix_A_approx is an (n + 1) x (n + 1) matrix. It is defined as follows: For example( the above graph) The matrix A 49 Research on Node Ranking – Peer to Peer …. Hoàng Cường New matrix Approx: [ 0 0.5 0.5 0 0 ] [ 0 0 0 0 0 ] = [1/3 1/3 0 0 1/3 ] [ 0 0 0 0.5 1 ] [ 0 0 0 0.75 1/4] Calculating local-pagerank Choose alpha = 0.85 ( according to Pagerank ) At iterator 0: Rapprox = [1/6, 1/6, 1/6, 1/6, 1/3] Pideal = [1/6, 1/6, 1/6, 1/6, 1/3] At iterator 1: Rapprox = 0.85* *R approx + 0.25* [0.25, 0.25, 0.25, 0.25, 0.5] … Program results after 10 iterators: 50 Research on Node Ranking – Peer to Peer …. Hoàng Cường Image 4.2.7 : Multiplication result example at iterators We Can be clearly seen: Order ( node 1-> 4) And order Are the same. Matrix_A_approx is different from A_deal_matrix in the last row, since Rank_local_idea does not utilize knowledge about PageRank scores of outside data in the first n rows. For the first n entries in the last row, the value symbolizes the (average) possibility continuance accumulated from (N - n) outside data to each local page. The last entry in this n-th row of the matrix is the (average) possibility continuance from outside data to other outside data. Similar to A_deal_matrix = Q1AQ2, Matrix_A_approx can be formally defined as Matrix_A_approx = Q’ 1AQ2, where the vector E is replaced by a vector Eapprox in Q’ 51 Research on Node Ranking – Peer to Peer …. Hoàng Cường In approx, the values at the last row are as follows: 1) For the first n values, (1 <= k <= n), the possibility from graph EG to a local page k is assigned the summation of continuance from all outside data to k, divided by the number of outside data. For local page k, it is 2) For the (n+1)-th value, the possibility for the self-loop edge is determined by the total authority continuance among outside data, divided by the number of outside data. Given the global graph example in Figure 4, the probabilities assigned by Matrix_A_approx are shown in Figure 6. We provide some examples of edge weight calculation following these rules. According to rule 1, the authority continuance on edge AB, AC, CB, BD, CD, DA are the outdegree inverse. Since A points to page X, Z, the authority continuance on edge (A;graph EG) is 1/2. The authority continuance on edge (graph EG;C) The self-loop edge authority continuance will be An advantageous quality about Pagerank_local is that it is suitable to adopt precomputation for various sub-graphs. With the same global graph, Approx can be figured out easily from the difference between the local values and the global values. This is especially beneficial for applications where there are multiple sub-graphs. 52 Research on Node Ranking – Peer to Peer …. Hoàng Cường 53 Pagerank_local scores converge to a unique vector R approx. There are two reasons. First, the transition matrix is a column stochastic matrix, as the sum of each column is 1. Second, since we complement the random walk with jumps from dangling data, the Markov Chain we defined is irreducible and aperiodic. RANK satisfies the two conditions of being irreducible and aperiodic of the E rgodic Theorem for Markov chains. Next we will vestigate how close is Rapprox to Rank_ideal, which we have shown to be the true ageRank scores for local data. in P Research on Node Ranking – Peer to Peer …. Hoàng Cường 54 Chapter 5 Evaluation The results show that our approach is robust to ranking, and converge fast, feasible. I evaluate our ranking technique in a simulator using real document sets from the small program simulation – peer to peer networks. As below is some results. Table 5.1: the number of iterators which local_pagerank converges It’s Depending on the number of nodes need to rank. But is limited. Research on Node Ranking – Peer to Peer …. Hoàng Cường 55 Chapter 6. Related Work raditional centralized pproaches, and search strategies over structured P2P networks. Our paper builds on prior work on efficient lookup and storage schemes. We assume the existence of a lookup protocol provided by the underlying system. Such lookup protocols have been studied in detail both in a structured setting (e.g., Chord, Pastry, Kademlia, Viceroy and Skipnet). Providing a useful search facility has been an important area of research. Prior work in searching can broadly be classified into two categories: t a Research on Node Ranking – Peer to Peer …. Hoàng Cường 56 Chapter 7. Conclusions and Future Work h s based on asynchronous iteration • ur hile our ys oaches. We plan to investigate these approaches in our future work. 7.1 Contributions In this paper, we have presented a distributed algorithm for ranking searc results. Our solution demonstrates that distributed ranking is feasible with little network overhead. Some of contributions: • Distributed computation of Pagerank o Application in P2P systems o Application on Internet scale Very large scale asynchronous iteration computation There are several areas worthy of further investigation. Performance could potentially be improved by mechanisms such as relevance feedback and caching. O analysis could be extended to account for popularity-based replication. W s tem provides a first solution to distributed ranking, other appr Research on Node Ranking – Peer to Peer …. Hoàng Cường 57 References 1. Google. Google search engine. Sergey Brin, Lawrence Page. “The Anatomy of a Large-Scale Hypertextual 2. Web Search Engine,” In Proc. 7th Int. World Wide Web and val. R etrieval, pp. 334-342. 4. Wikipedia. Conf., 1998. 3. Amy N. Langville & Carl D. Meyer (2006) Google's PageRank Beyond: The Science of Search Engine Rankings information retrie esearch and Development in Information R

Các file đính kèm theo tài liệu này:

  • pdfKLTN_HoangCuong_06020044.pdf
Tài liệu liên quan