## MyEncyclopedia Blogs

#### Month: September 2015

Originated from Coursera Machine Learning, week 2.
Code at https://github.com/MyEncyclopedia/AlgImpl/tree/master/02_Machine_Learning/Linear_Regression

# Hypothesis

Vector X=[1, x_1, x_2, x_3, …, x_n] is input data

Linear hypothesis is

h_theta(x) = theta_0 + theta_1(x_1) + theta_2(x_2) + … + theta_n(x_n)

Define cost function to be

J(theta)=1/(2m)Sigma_(i=1)^m(h_theta(x^((i)))-y^((i)))^2

The task is to find theta that minimizes J
Gradient descent method is to find converged theta iteratively.
The partial derivative of J(theta) is 1/(2m)Sigma_(i=1)^m(h_theta(x^((i)))-y^((i)))*x_(j)^((i))

Therefore, iteration equation is
theta_j=theta_j – alpha*1/(2m)Sigma_(i=1)^m(h_theta(x^((i)))-y^((i)))*x_(j)^((i))

# Feature Scaling

Get every feature into approximately [-1, 1] range.
x_i = (x_i-mu_i)/ delta_i where mu_i is mean of i-th feature and delta_i is standard deviation of i-th feature.

# Python Code

import numpy as np

"Batch Gradient Descent method of linear regression"
alpha = 0.01
step = 100

def __init__(self, alpha=0.01, step=100):
self.alpha = alpha
self.step = step

def distance(self, v1, v2):
v = v1 - v2
v = v * v
return np.sum(v)

def h(self, X, theta):
n = len(theta)
return np.dot(theta, X)

def cost(self, X):
c = np.sum(X*X)
m = len(X)
c = 1.0/(2*m) * c
return c

def featureNormalize(self, X):
"Get every feature into approximately [-1, 1] range."
featureN = len(X)
mu = np.mean(X[0:featureN + 1], axis=1, dtype=np.float64)
sigma = np.std(X[0:featureN + 1], axis=1, dtype=np.float64)
X_norm = X
for i in range(featureN):
X_norm[i] = (X[i]-mu[i])/sigma[i]
return X_norm, mu, sigma

def iterate(self, data, toNormalize=False):
m = len(data)
data = np.transpose(data)
featureN = len(data) - 1
Y = data[featureN]  # shape 1 * m
X = data[0:featureN]  # shape N * m
if toNormalize:
X, mu, sigma = self.featureNormalize(X)
X = np.vstack((np.ones(m), X))  # shape (N+1) * m

theta = np.zeros(featureN + 1, dtype=np.float64)

for iter in range(self.step):
theta_temp = np.zeros(featureN + 1, dtype=np.float64)
V = self.h(X, theta)
V = V - Y
costValue = self.cost(V)

for i in range(featureN+1):
theta_temp[i] = np.sum(V * X[i])
theta_temp *= self.alpha / m
theta_temp = theta - theta_temp
# dist = distance(theta, theta_temp)
theta = theta_temp
return theta


The result running the code above with test data (in homework of course Machine Learning) is
ex1data1.txt: theta=[-3.63029144, 1.16636235]
ex1data2.txt: theta=[ 338658.2492493, 103322.82942954, -474.74249522]

Algorithm Origin: Coursera: Mining of Massive Datasets week 1
Code is at https://github.com/MyEncyclopedia/AlgImpl/tree/master/01_Mining_of_Massive_Datasets/PageRank

## Power Iteration Method

1. Init: r^((0))=[1/N, …1/N]^T
2. Iterate: r^((t+1))=M * r^((t))
3. Stop when |r^((t+1)) – r^((t))| < epsilon

where M_(ji)=1/d_i if i points to j, else M_(ji)=0

## Naive Implementation

import numpy as np

class PageRank_NoTeleport:
"Power iteration but does not address Spider trap problem or Dead end problem "
epsilon = 0.0001

def __init__(self, beta=0.85, epsilon=0.0001):
self.epsilon = epsilon

def distance(self, v1, v2):
v = v1 - v2
v = v * v
return np.sum(v)

def compute(self, G):
"G is N*N matrix where if j links to i then G[i][j]==1, else G[i][j]==0"
N = len(G)
d = np.zeros(N)
for i in range(N):
for j in range(N):
if (G[j, i] == 1):
d[i] += 1

r0 = np.zeros(N, dtype=np.float32) + 1.0 / N
# construct stochastic M
M = np.zeros((N, N), dtype=np.float32)
for i in range(N):
for j in range(N):
if G[j, i] == 1:
M[j, i] = 1.0 / d[i]
while True:
r1 = np.dot(M, r0)
dist = self.distance(r1, r0)
if dist < self.epsilon:
break
else:
r0 = r1

return r1


The naive version solves normal matrix M like the following graph

in case that
G=((0,1,1),(1,1,0),(1,0,0)) => M=((0,0.5,1),(0.5,0.5,0),(0.5,0,0))
The result is [0.40293884, 0.39887747, 0.1981837]

But cannot address Dead Ends problem (see picture below) or Spider trap problem

in case that
G=((0,1,0),(1,1,0),(1,0,0)) => M=((0,0.5,0),(0.5,0.5,0),(0.5,0,0))
With the naive implementation, we get wrong result, [0.01896159, 0.03068034, 0.01171875]

## Teleport Version

The Google Matrix A: A=beta*M + (1-beta)*(1/n) e * e^T where e is vector of all 1s.
The iteration becomes r^(t+1) = A * r^t

import numpy as np

class PageRank_Matrix:
"Power iteration with random teleports that addresses Spider trap problem or Dead end problem "
beta = 0.85
epsilon = 0.0001

def __init__(self, beta=0.85, epsilon=0.0001):
self.beta = beta
self.epsilon = epsilon

def distance(self, v1, v2):
v = v1 - v2
v = v * v
return np.sum(v)

def compute(self, G):
"G is N*N matrix where if j links to i then G[i][j]==1, else G[i][j]==0"
N = len(G)
d = np.zeros(N)
for i in range(N):
for j in range(N):
if (G[j, i] == 1):
d[i] += 1
if d[i]==0:   # i is dead end, teleport always
d[i] = N

r0 = np.zeros(N, dtype=np.float32) + 1.0 / N
# construct stochastic M
M = np.zeros((N, N), dtype=np.float32)
for i in range(N):
if (d[i]==N):  # i is dead end
for j in range(N):
M[j, i] = 1.0 / d[i]
else:
for j in range(N):
if G[j, i] == 1:
M[j, i] = 1.0 / d[i]

T = (1.0 - self.beta) * (1.0 / N) * (np.zeros((N, N), dtype=np.float32) + 1.0)
A = self.beta * M +  T
while True:
r1 = np.dot(A, r0)
dist = self.distance(r1, r0)
if dist < self.epsilon:
break
else:
r0 = r1

return r1


In this version, beta is introduced to reflect the possibility of teleporting. In addition, we preprocessed matrix M to make it stochastic for dead end nodes. The result is [0.30623639, 0.43920633, 0.25455755].

ME-Mydoc is a desktop application that is able to run on Windows, Linux and Mac.  Compared to other personal knowledge management app such as Evernote, it does not have intrinsic support for online data sharing.  So are we restricted to single machine usage?  Of course, no!  Let’s turn it into an online data sharing application!

1. Install Google Drive/SkyDrive/Dropbox/BaiduYun Sync Drive (actually, you can install any online storage with local directory mapping support)
2. Install ME-Mydoc and configure it such that all data paths point to the local folder mapped by the online storage app. See picture below.
3. Yes! You can now use ME-Mydoc to visit online data from any Windows, Mac or Linux as long as the online storage is available on that platform.
4. Note that you must pause synchronization when you are using ME-Mydoc to avoid data corruption.