The PageRank algorithm

As the internet rapidly grew in the 1990s, it became increasing difficult to find the right webpage or nugget of information. The internet was missing a homepage that could be a portal to the rest of the web.

The simple idea

Imagine there is a hypothetical random surfer of the internet (usually called a "spider"). Upon visiting a page on the internet, the surfer takes at random one of the outgoing links on that page. Here, one should picture the internet as a graph where a node represents a webpage and a directed edge joins node $i$ to node $j$ if page $i$ has an outgoing link to page $j$.

Let $N$ be the number of webpages on the internet and $P$ be the $N\times N$ stochastic matrix (nonnegative entries and columns sum to $1$) such that $P_{ij}$ is the probability of transitioning from page $j$ to page $i$. We want to find the eigenvector of $\underline{x}$ corresponding to the eigenvalue $1$ so that $$ P\underline{x} = \underline{x}. $$ To interpret the vector $\underline{x}$ as a way of ranking webpages take page $j$ as more important than page $i$ if $x_j>x_i$. Since all the columns of $P$ add to $1$, we know that $P^T$ has an eigenvalue of $1$ with eigenvector $(1,\ldots,1)^T/N$ and hence, a solution to $P\underline{x} = \underline{x}$ exists. The trouble is there can be many such vectors $\underline{x}$ and the entries of $\underline{x}$ are not necessarily nonnegative. There can be many different rankings and $x_j = 0$ is possible so that a webpage is left unranked. For example, $$ P = \begin{bmatrix} 0 & 1 & 0 & 0\\ 1 & 0 & 0 & 0\\0 & 0 & 0 & 1\\0 & 0 & 1 & 0\\ \end{bmatrix} $$ has eigenvalues $1$, $1$, $-1$, and $-1$ and leads to a nonunique ranking:

In [1]:
P = [ 0. 1. 0. 0. ; 1. 0. 0. 0.; 0. 0. 0. 1. ; 0. 0. 1. 0.]
(Λ,V) = eig(P')
V[:,3], V[:,4]
Out[1]:
([0.707107,0.707107,0.0,0.0],[0.0,0.0,0.707107,0.707107])

We need a way to make the page rankings unique. That's the main idea behind Google's PageRank algorithm.

Google's idea

One expects that Larry Page knew the following result:

Theorem: If $P$ is a stochastic matrix with strictly positive entries, then $1$ is an eigenvalue, all other eigenvalues have a modulus $<1$, ${\rm ker}(A-I) = 1$, and the eigenvector with eigenvalue $1$ has positive entries.

Proof: Since $P$ is a stochastic matrix, $P^T\underline{e} = \underline{e}$, where $\underline{e} = (1,\ldots,1)^T$. Therefore, $1$ is an eigenvalue of $P^T$ and hence, also $P$.

Suppose that there is an eigenvalue $\lambda$ of $P$ with eigenvector $\underline{v}$ such that $|\lambda|>1$. Suppose that $\underline{v}$ is normalized so that its entries sum to $1$. Then, $P^k\underline{v} = \lambda^k\underline{v}$ has exponentially growing length as $k\rightarrow \infty$. This implies that for large $k$, $P^k$ must have an entry larger than $1$, but $P^k$ is a stochastic matrix. Therefore, all eigenvalues much have modulus $\leq 1$.

The rest of this theorem follows directly from the Perron-Frobenius theorem.

Motiviated by the above theorem, one needs to modify the situation from the previous section so that $P$ is a stochastic matrix with strictly positive entries. There are many ways to do this, but Google's approach is the following (assuming there are no dangling webpages). Imagine there is a slightly different hypothetical random surfer of the internet: Upon visiting a page on the internet, the surfer either takes at random outgoing link (with probability $\alpha$) or is magically teleported to a random webpage (with probability $1-\alpha$). Now, the equation to solve becomes $$ (\alpha P + (1-\alpha)ve^T)\underline{x} = \underline{x}, $$ where $e = (1,\ldots,1)^T/N$ and $v$ is the teleporting vector (usually $v = (1,\ldots,1)^T$. The stochastic matrix $\alpha P + (1-\alpha)ve^T$ has strictly positive entries so there is a unique solution vector $\underline{x}$ with positive entries. Now, there is a unique ranking vector. For example,

In [2]:
# The eigenvalues of a stochastic matrix with strictly positive entries ensure there is a unique page ranking:
α = .85
N = 10
P = rand(N,N)
P = P./sum(P,1)
ev = ones(N)
abs(eigvals( α*P + (1-α)ev*ev'/N ))
Out[2]:
10-element Array{Float64,1}:
 1.0      
 0.152983 
 0.152983 
 0.131925 
 0.131925 
 0.105441 
 0.105441 
 0.116122 
 0.0473375
 0.0453916

How to compute the ranking vector?

Most researchers consider the problem of computing the vector $\underline{x}$ such that $$ (\alpha P + (1-\alpha)ve^T)\underline{x} = \underline{x}, $$ as an example of an eigenvalue problem. However, since the eigenvalue of $1$ is known, we can also treat it as the following linear system to solve: $$ (I - \alpha P)\underline{x} = (1-\alpha)v, $$ where we used the fact that $e^T\underline{x} = 1$. Page's suggestion is equivalent to computing the vector $\underline{x}$ via the iteration: $$ x^{(k+1)} = \alpha P \underline{x}^{(k)} + (1-\alpha)v, \qquad x^{(0)} = (1,\ldots,1)^T/N. $$ This iteration is the power iteration (if thought as the eigenvalue problem $(\alpha P + (1-\alpha)ve^T)\underline{x} = \underline{x}$) and Richardson's iteration (if thought as the linear system $(I - \alpha P)\underline{x} = (1-\alpha)v$).

Let $0<\alpha<1$ and consider the error $\underline{x}-\underline{x}^{(k+1)}$:

$$ \underline{x} - \underline{x}^{(k+1)} = \alpha P \underline{x} + (1-\alpha)v - \left(\alpha P \underline{x}^{(k)} + (1-\alpha)v\right) = \alpha P\left(\underline{x} - \underline{x}^{(k)}\right) = \cdots = \alpha^k P^k\left(\underline{x} - \underline{x}^{(0)}\right). $$

Therefore, the iteration for finding $\underline{x}$ converges geometrically as $k\rightarrow\infty$. The implementation is simple too.

In [3]:
function PageRank( P::Matrix, α::Number, v::Vector )
    # Rank pages in the web, given by $P$, with a teleporting parameter of α. 
    n = size(P,1)
    x = ones(n)/n
    maxIter = ceil(Int,log(eps(Float64))/log(α))
    for k = 1:maxIter
        x = α*(P*x) + (1-α)*v
    end
    return x
end
Out[3]:
PageRank (generic function with 1 method)
In [4]:
α = .85 
N = 10
P = rand(N,N)
P = P./sum(P,1)
v = ones(N)/N
x = PageRank(P, α, v)
Out[4]:
10-element Array{Float64,1}:
 0.0783197
 0.107107 
 0.0959074
 0.105894 
 0.0910374
 0.109549 
 0.104799 
 0.0962829
 0.07653  
 0.134574 

Dealing with dangling webpages

A dangling webpage is one without any outgoing links. If the surfer lands on such a page, then we will magically teleport to another webpage on the internet. We write $$ (\alpha P + \alpha D + (1-\alpha)ve^T)\underline{x} = \underline{x}, \qquad v = (1,\ldots,1)^T/N $$ where $D = ve_{\text{hanging}}^T$ and $e_{\text{hanging}}$ is a vector such that the $j$th entry is $1$ if the $j$th webpage is dangling. We modify our code below to deal with dangling webpages.

In [5]:
function PageRankDangling( P::SparseMatrixCSC, α::Float64, v::Array{Float64,1} )
    # Rank pages in the web, given by $P$, with a teleporting parameter of α. A matrix "D" is calculated
    # to treat dangling webpages. 
    n = size(P,1)
    s = sum(P,1)
    P = P./s
    eDangling = ( s[:] .== 0 )
    x = ones(n)/n
    # Richard's iteration to solve (αP + αD + (1-α)ve^T)x = x:
    maxIter = ceil(Int,log(eps(Float64))/log(α))
    for k = 1:maxIter
        x = α*(P*x) + α*(dot(eDangling,x)*v) + (1-α)*v
    end
    return x
end
Out[5]:
PageRankDangling (generic function with 1 method)

Here, we do an example with 5,000 webpages, where each webpage has on average 10 outgoing links. (We need to change the implementation slightly to do a network with a larger number of pages.)

In [6]:
using PyPlot
srand(1234)
n = 5000
α = .85
P = sprand(Bool, n, n, 10/n) # adjacency of outgoing links
v = ones(n)/n
@time x = PageRankDangling(P, α, v)
s = sortperm( x )
println("Top-ranking webpages: ",flipdim(s[end-9:end],1))
  4.132350 seconds (225.26 k allocations: 283.244 MB, 3.50% gc time)

A small improvement

We note that we can store $P$ as $P = GS$, where $G$ is a $0$-$1$ matrix and $S$ is diagonal. While $Px$ costs $nnz(P)$ additions and $nnz(P)$ multiplications, $G(Sx)$ costs $N$ multiplications (for $Sx$) and $nnz(P)$ additions for $G*(Sx)$.

In [7]:
function PageRankDangling_improvement( P::SparseMatrixCSC{Bool,Int64}, α::Float64, v::Array{Float64,1} )
    # Rank pages in the web, given by $P$, with a teleporting parameter of α. A matrix "D" is calculated
    # to treat dangling webpages. 
    n = size(P,1)
    s = sum(P,1)[:]
    # Normalize P and catch dangling webpages:
    eDangling = ( s .== 0 )
    x = ones(n)/n
    # Richard's iteration to solve (αP + αD + (1-α)ve^T)x = x:
    for k = 1 : ceil(Int,log(1e-6)/log(α))
       x = α*(P*(x./s)) + (α*(dot(eDangling,x)-1) + 1)*v
    end
    return x
end
Out[7]:
PageRankDangling_improvement (generic function with 1 method)
In [8]:
srand(1234)
n = 100000
α = .85
P = sprand(Bool, n, n, 10/n)   # adjacency of outgoing links
v = ones(n)/n
@time x = PageRankDangling_improvement(P, α, v)
s = sortperm( x[:] )
println("Top-ranking webpages: ",flipdim(s[end-9:end],1))
  0.565781 seconds (98.41 k allocations: 335.296 MB, 3.92% gc time)
Top-ranking webpages: [64501,13075,3469,34323,31945,7853,67009,93584,31198,97420]

The man behind the beginnings of Google PageRank algorithm

Larry Page co-founded Google Inc. with Sergey Brin in 1998. His PhD at Stanford was on understanding the internet and while studying this he came up with the beginnings of Google's PageRank algorithm.