# The PageRank algorithm¶

As the internet rapidly grew in the 1990s, it became increasing difficult to find the right webpage or nugget of information. The internet was missing a homepage that could be a portal to the rest of the web.

## The simple idea¶

Imagine there is a hypothetical random surfer of the internet (usually called a "spider"). Upon visiting a page on the internet, the surfer takes at random one of the outgoing links on that page. Here, one should picture the internet as a graph where a node represents a webpage and a directed edge joins node $i$ to node $j$ if page $i$ has an outgoing link to page $j$.

Let $N$ be the number of webpages on the internet and $P$ be the $N\times N$ stochastic matrix (nonnegative entries and columns sum to $1$) such that $P_{ij}$ is the probability of transitioning from page $j$ to page $i$. We want to find the eigenvector of $\underline{x}$ corresponding to the eigenvalue $1$ so that $$P\underline{x} = \underline{x}.$$ To interpret the vector $\underline{x}$ as a way of ranking webpages take page $j$ as more important than page $i$ if $x_j>x_i$. Since all the columns of $P$ add to $1$, we know that $P^T$ has an eigenvalue of $1$ with eigenvector $(1,\ldots,1)^T/N$ and hence, a solution to $P\underline{x} = \underline{x}$ exists. The trouble is there can be many such vectors $\underline{x}$ and the entries of $\underline{x}$ are not necessarily nonnegative. There can be many different rankings and $x_j = 0$ is possible so that a webpage is left unranked. For example, $$P = \begin{bmatrix} 0 & 1 & 0 & 0\\ 1 & 0 & 0 & 0\\0 & 0 & 0 & 1\\0 & 0 & 1 & 0\\ \end{bmatrix}$$ has eigenvalues $1$, $1$, $-1$, and $-1$ and leads to a nonunique ranking:

In [1]:
P = [ 0. 1. 0. 0. ; 1. 0. 0. 0.; 0. 0. 0. 1. ; 0. 0. 1. 0.]
(Λ,V) = eig(P')
V[:,3], V[:,4]

Out[1]:
([0.707107,0.707107,0.0,0.0],[0.0,0.0,0.707107,0.707107])

We need a way to make the page rankings unique. That's the main idea behind Google's PageRank algorithm.

One expects that Larry Page knew the following result:

Theorem: If $P$ is a stochastic matrix with strictly positive entries, then $1$ is an eigenvalue, all other eigenvalues have a modulus $<1$, ${\rm ker}(A-I) = 1$, and the eigenvector with eigenvalue $1$ has positive entries.

Proof: Since $P$ is a stochastic matrix, $P^T\underline{e} = \underline{e}$, where $\underline{e} = (1,\ldots,1)^T$. Therefore, $1$ is an eigenvalue of $P^T$ and hence, also $P$.

Suppose that there is an eigenvalue $\lambda$ of $P$ with eigenvector $\underline{v}$ such that $|\lambda|>1$. Suppose that $\underline{v}$ is normalized so that its entries sum to $1$. Then, $P^k\underline{v} = \lambda^k\underline{v}$ has exponentially growing length as $k\rightarrow \infty$. This implies that for large $k$, $P^k$ must have an entry larger than $1$, but $P^k$ is a stochastic matrix. Therefore, all eigenvalues much have modulus $\leq 1$.

The rest of this theorem follows directly from the Perron-Frobenius theorem.

Motiviated by the above theorem, one needs to modify the situation from the previous section so that $P$ is a stochastic matrix with strictly positive entries. There are many ways to do this, but Google's approach is the following (assuming there are no dangling webpages). Imagine there is a slightly different hypothetical random surfer of the internet: Upon visiting a page on the internet, the surfer either takes at random outgoing link (with probability $\alpha$) or is magically teleported to a random webpage (with probability $1-\alpha$). Now, the equation to solve becomes $$(\alpha P + (1-\alpha)ve^T)\underline{x} = \underline{x},$$ where $e = (1,\ldots,1)^T/N$ and $v$ is the teleporting vector (usually $v = (1,\ldots,1)^T$. The stochastic matrix $\alpha P + (1-\alpha)ve^T$ has strictly positive entries so there is a unique solution vector $\underline{x}$ with positive entries. Now, there is a unique ranking vector. For example,

In [2]:
# The eigenvalues of a stochastic matrix with strictly positive entries ensure there is a unique page ranking:
α = .85
N = 10
P = rand(N,N)
P = P./sum(P,1)
ev = ones(N)
abs(eigvals( α*P + (1-α)ev*ev'/N ))

Out[2]:
10-element Array{Float64,1}:
1.0
0.152983
0.152983
0.131925
0.131925
0.105441
0.105441
0.116122
0.0473375
0.0453916

## How to compute the ranking vector?¶

Most researchers consider the problem of computing the vector $\underline{x}$ such that $$(\alpha P + (1-\alpha)ve^T)\underline{x} = \underline{x},$$ as an example of an eigenvalue problem. However, since the eigenvalue of $1$ is known, we can also treat it as the following linear system to solve: $$(I - \alpha P)\underline{x} = (1-\alpha)v,$$ where we used the fact that $e^T\underline{x} = 1$. Page's suggestion is equivalent to computing the vector $\underline{x}$ via the iteration: $$x^{(k+1)} = \alpha P \underline{x}^{(k)} + (1-\alpha)v, \qquad x^{(0)} = (1,\ldots,1)^T/N.$$ This iteration is the power iteration (if thought as the eigenvalue problem $(\alpha P + (1-\alpha)ve^T)\underline{x} = \underline{x}$) and Richardson's iteration (if thought as the linear system $(I - \alpha P)\underline{x} = (1-\alpha)v$).

Let $0<\alpha<1$ and consider the error $\underline{x}-\underline{x}^{(k+1)}$:

$$\underline{x} - \underline{x}^{(k+1)} = \alpha P \underline{x} + (1-\alpha)v - \left(\alpha P \underline{x}^{(k)} + (1-\alpha)v\right) = \alpha P\left(\underline{x} - \underline{x}^{(k)}\right) = \cdots = \alpha^k P^k\left(\underline{x} - \underline{x}^{(0)}\right).$$

Therefore, the iteration for finding $\underline{x}$ converges geometrically as $k\rightarrow\infty$. The implementation is simple too.

In [3]:
function PageRank( P::Matrix, α::Number, v::Vector )
# Rank pages in the web, given by $P$, with a teleporting parameter of α.
n = size(P,1)
x = ones(n)/n
maxIter = ceil(Int,log(eps(Float64))/log(α))
for k = 1:maxIter
x = α*(P*x) + (1-α)*v
end
return x
end

Out[3]:
PageRank (generic function with 1 method)
In [4]:
α = .85
N = 10
P = rand(N,N)
P = P./sum(P,1)
v = ones(N)/N
x = PageRank(P, α, v)

Out[4]:
10-element Array{Float64,1}:
0.0783197
0.107107
0.0959074
0.105894
0.0910374
0.109549
0.104799
0.0962829
0.07653
0.134574 

## Dealing with dangling webpages¶

A dangling webpage is one without any outgoing links. If the surfer lands on such a page, then we will magically teleport to another webpage on the internet. We write $$(\alpha P + \alpha D + (1-\alpha)ve^T)\underline{x} = \underline{x}, \qquad v = (1,\ldots,1)^T/N$$ where $D = ve_{\text{hanging}}^T$ and $e_{\text{hanging}}$ is a vector such that the $j$th entry is $1$ if the $j$th webpage is dangling. We modify our code below to deal with dangling webpages.

In [5]:
function PageRankDangling( P::SparseMatrixCSC, α::Float64, v::Array{Float64,1} )
# Rank pages in the web, given by $P$, with a teleporting parameter of α. A matrix "D" is calculated
# to treat dangling webpages.
n = size(P,1)
s = sum(P,1)
P = P./s
eDangling = ( s[:] .== 0 )
x = ones(n)/n
# Richard's iteration to solve (αP + αD + (1-α)ve^T)x = x:
maxIter = ceil(Int,log(eps(Float64))/log(α))
for k = 1:maxIter
x = α*(P*x) + α*(dot(eDangling,x)*v) + (1-α)*v
end
return x
end

Out[5]:
PageRankDangling (generic function with 1 method)

Here, we do an example with 5,000 webpages, where each webpage has on average 10 outgoing links. (We need to change the implementation slightly to do a network with a larger number of pages.)

In [6]:
using PyPlot
srand(1234)
n = 5000
α = .85
v = ones(n)/n
@time x = PageRankDangling(P, α, v)
s = sortperm( x )
println("Top-ranking webpages: ",flipdim(s[end-9:end],1))

  4.132350 seconds (225.26 k allocations: 283.244 MB, 3.50% gc time)


## A small improvement¶

We note that we can store $P$ as $P = GS$, where $G$ is a $0$-$1$ matrix and $S$ is diagonal. While $Px$ costs $nnz(P)$ additions and $nnz(P)$ multiplications, $G(Sx)$ costs $N$ multiplications (for $Sx$) and $nnz(P)$ additions for $G*(Sx)$.

In [7]:
function PageRankDangling_improvement( P::SparseMatrixCSC{Bool,Int64}, α::Float64, v::Array{Float64,1} )
# Rank pages in the web, given by $P$, with a teleporting parameter of α. A matrix "D" is calculated
# to treat dangling webpages.
n = size(P,1)
s = sum(P,1)[:]
# Normalize P and catch dangling webpages:
eDangling = ( s .== 0 )
x = ones(n)/n
# Richard's iteration to solve (αP + αD + (1-α)ve^T)x = x:
for k = 1 : ceil(Int,log(1e-6)/log(α))
x = α*(P*(x./s)) + (α*(dot(eDangling,x)-1) + 1)*v
end
return x
end

Out[7]:
PageRankDangling_improvement (generic function with 1 method)
In [8]:
srand(1234)
n = 100000
α = .85

  0.565781 seconds (98.41 k allocations: 335.296 MB, 3.92% gc time)