The web search engine is a typical distributed system on the Internet. It is designed to search for information on the World Wide Web. Search results are generally presented in a list of results and are often called hits. PageRank is a well-known web graph ranking algorithm that helps Internet users sort hits by their importance.
PageRank calculates a numerical value for each element of a hyperlinked set of webpages, which reflects the probability that a random surfer will access that page. The process of PageRank can be understood as a Markov Chain which requires iterative calculations to converge. An iteration of PageRank calculates the new access probability for each webpage based on values calculated in the previous iteration. The process will repeat until the number of current iterations is bigger than predefined maximum iterations, or the Euclidian distance between rank values in two subsequent iterations is less than a predefined threshold that controls the accuracy of the output results.
Sequential:
I started with a sequential algorithm which I run on my laptop. But when processing larger input data, like web graphs containing more than a million webpages, there was a need to run the PageRank application in parallel so that it can aggregate the computing power of multiple compute nodes.
MPJ:
This is an interface to the MPI API. Used this to mock the distributed page rank to be simulated in 10 machines in parallel, where the ranks were calculated remotely and aggregated at the master node.
Hadoop:
First, Data points & centroids are generated into local file system. Then this data is then copied to hdfs file system. Each iteration is responsible for loading the data from hdfs, performing business logic on that data and then putting that data back to hdfs, thereby updating the ranks in each iteration.
Team:
- Rohit Nair
- Abhijit Karanjkar