Coursera Machine Learning

I completed this course and earned the certificate at November 17, 2015. This machine learning course is considered as the famous fundamental machine learning course for beginners and is provided by Stanford University, Andrew Ng. I summarized the programming assignments (please don't look if you're on-going) at my github and some lecture notes at here (below).

Session 1.
Week 1 - Introduction, Linear Regression with One Variable, Linear Algebra Review
Week 2 - Linear Regression with Multiple Variables, Octave/Matlab Tutorial
Week 3 -

Programming 1: Linear Regression (Predicting house prices) [Github][Report]

Session 2.

Introduction

Need to know how to get the algorithms and math to work in problems.
Best way to do building intelligent machines is to have some way for machines to lean things themselves.

Machine Learning Definition

Tom Mitchell(1998)

"A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E."

Example : Spam Filter

"Spam filter program is said to learn from watching you label emails as spam or not spam with respect to classifying emails as spam or no spam and the number of emails correctly classified as spam/not spam, if its performance on T, as measured by P, improves with experience E."

Supervised Learning

Housing price prediction"

"Right answers" given.
Regression : Predict continuous valued output(price).
모든 영역에서 어떤 값들을 right answer로 결정할 것인가. Prediction과 Actual value의 차이. 즉, error가 가장 작아지도록 하는 값.
일차원 그래프이냐.. 이차원 그래프이냐..

Breast cancer (malignant, benign)

classification : discrete valued output( 0 or 1)
clump thickness, uniformity of cell size, uniformity of cell shape 등 다양한 feature들이 존재

*참고 Learning Algorithm은 prediction을 하기 위해 infinite한 feature를 사용하기도 한다. 수학적인 trick을 이용한 Support Vector Machine 알고리즘을 사용한다.

Unsupervised Learning

"No correct answers" (unlabeled data)
ex) clustering( k-means, hierarchical, mixture models ), hidden markov models, blind signal separation using feature extraction techniques for dimensionality reduction.
ex) recommended products : 제품 고유의 라벨이 없다고 생각하고 오로지 기록을 통한 고객의 성향을 기반한 clustering으로 비슷한 그룹끼리 묶는다.
ex) DNA microarray data : by using clustering, we can have a group of individuals on each measure expression of a gene. we can automatically generate structure.
ex) Organize computing clusters : Identify potential weak spots or distribute workload effectively.

Cocktail party problem

English + Spanish at the same time --- overlapping combination --- Cocktail party algorithm(SVD기반) ( speaker들과의 distance를 이용해서 grouping) --- Only English로 분리 가능.

Linear Regression

Univariate(one-feature) & Multivariate(multi-feature) linear regression
Hypothesis : h(x) = theta0*x0 + theta1*x1 + theta2*x2 ... ; ( theta0 : zero condition, theta1,theta2 ... : gradient(slope), 보통 convention하게 x0=1 )
Vectorization approach : H,X = matrix, theta = vector : H = X*theta;
Think of h(x) as a "y imitator" and hypothesis tries to convert the x into y.

Cost Function (= Squared Error Function)

DEF : Figure out how to fit the best possible straight line to out data.
Minimize Problem ( Generation version -> X & theta = matrix, y = vector ) - Minimize(theta) : J = 1/(2*m) * (X*theta-y)' * (X*theta-y); - 1/m : we determine the average ↔ 1/2m : the 2 makes the math a bit easier and doesn't change the constants we determine at all( half the smallest value is still the smallest value. ), 이런식으로 실제 현장에서는 approximation 하는듯 original로 계산하면 시간, 속도면에서 비효율적이기 때문..
The optimization objective for the learning algorithm is find the value of theta1, theta2 .. which minimize Cost Function(J).
Cost Function for linear regression is always going to be a bowl-shaped function(=a convex function). 즉, 오직 하나의 global optimum을 가진다.

Gradient Descent

DEF : algorithm for minimizing the cost function J
Learning rate(alpha) : a number which controls how big a step we take downhill with gradient descent.
- If alpha is too small, gradient descent can be slow
- If alpha is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.
alpha 값을 결정할 때는 가장 작은 값에서 일정한 비율(x3)로 alpha값을 늘리거나 줄여가면서 가장 적합한 값을 찾는다.
Simultaneous Update 필요. distance(error)를 계산할 때 특정 theta 한개만 변하는 걸 살펴보기 위해 다른 theta들은 constant로 간주 -> linear의 특징 : 3차원, 다차원으로 동시에 일어나는 현상들을 2차원 단위로 펼쳐서 계산하면서 동기화 시킬 수 있다. 마치 동시에 일어난 것처럼 똑같이 계산할 수 있다.
Derivative 의미 : Let's take the tangent at the point and look at the slope of the line.
gradient descent가 converge됬는지 확인하려면 automatic convergence test(choosing the threshold is hard)를 하기 보다는 그래프를 통해 눈으로 확인하는게 좋다. 제대로 작동하는지 눈으로 실시간으로 볼 수 있고, 일반적으로 threshold를 정하기가 어렵기 때문.. 그리고 application마다 수렵되기까지 iteration 횟수가 매우 다양하다.

n = length(X(1, :)); % n features
temptheta = zeros(n, 1);
H = X*theta;
for j=1:n
sum = X(:, j)'*(H-y);
temptheta(j) = theta(j) - alpha/m*sum;
end
% simultaneously update theta
for j=1:n
theta(j) = temptheta(j);
end

Normal Equation

DEF : it gives us a better way to solve the optimal value of the parameters theta.
Gradient Descent(many steps) ↔ Normal Equation(Just one shot!)
gradient 처럼 서서히 찾아가는 방식이 아니고, J(theta)=0인 지점을 미분을 통해서 한번에 찾는다.
Normal Equation을 사용하면 Feature Scaling을 할 필요가 없다.
theta = pinv((X'*X)*X'*y;

Gradient Descent	Normal Equation
(단점) Need to choose alpha	(장점) No need to choose alpha
(단점) Needs many iterations	(장점) Don’t need to iterate
(장점) Works well even when n is large	(단점) Slow if n is very large
	(단점) Need to compute : pinv(X’*X)
If n is large, use Gradient Descent If n is small, use Normal Equation. n크기 기준 : inverting을 빠르게 처리해줄 수 있는 크기. (현대 컴퓨터 n=1000까지는 무난히 빠르게 처리해준다.) 상황에 맞게(feature 개수) 골라서 사용하면 된다.

Feature Scaling ≒ Mean normalization

Get every feature into approximately a -1≦xi≦ 1 range.
Make sure features are on a similar scale.
ex) x1 = size(0~2000 feet^2), x2 = # of bedrooms(1~5) 이런식으로 feature마다 scale크기가 많이 차이 난다면, global minimum까지 도달하는 시간이 많이 걸린다. 왔다갔다 많이 한다. (y축 길이가 길기 때문), 즉, feature scaling 또는 mean normalization 방법을 사용하면 gradient descent가 converge를 빨리하는데 도움을 준다.
Mean normalization 방법 : ex) x1 = x1-u1(mean) / s1(range(max-min) or standard deviation)

Polynomial Regression for non-linear function

non-linear한 데이터 분포에서는 일차원 x가 아닌 다항식 x^2, x^3 ... 의 feature들이 좋은 모델을 만드는데 더 도움을 줄 수 있다. ex) h(x) = t0 + t1*x + t2*x^2 + t3*x^3..
Input 정보는 1개 이지만, polynomial 모델에서는 서로 다른 정보를 여러개 가질 수 있는 효과를 준다. ( can make new features ) ex) x1=(size), x2=(size)^2, x3=(size)^3
x의 range가 증가할 수록 feature scaling을 잘해줘야 하고, overfitting을 조심해야한다.

* 참고 : h(x) = theta0 + theta1 * frontage + theta2 * depth 에서 frontage와 depth는 area=frontage*depth로 하나의 변수로 나타낼 수 있다. Simple한 모델일 수록 좋은 모델이다.

Logistic Regression (=classification)

Prediction 하고 싶은 변수 y가 discrete value(0 or 1, OR 0,1,2,3...)일 경우 사용한다.
Linear regression은 extreme value인 outlier에 민감하게 반응하여 정확한 threshold 값을 구하는데 어려움이 있지만, Logistic은 outlier에 민감하지 않다.

Sigmoid function (Hypothesis Representation)

어지럽고 무질서한(비선형적인) 값들을 선형적으로 변환해주고, 모든 값들의 range 또한 0~1로 linear하게 변환시켜줘서 확률 값을 도출하는데 도움을 준다.
g = 1 ./ (1+exp(-z));

Decision boundary

The decision boundary is not a property of the data but a property of the hypothesis including theta0,1,2...
Learning을 끝마치고, decision boundary를 가지면, data가 없어도 상관없어진다. 그리고 new data에 대한 추측이 가능해진다. 하지만, 계속해서 learning을 하기 위해서는 data가 필요하다.

H(X) = g(X*theta); g(z) = 1 ./ (1+exp(-z)); Suppose predict "y=1" if h(x) ≧ 0.5; H(x)=g(X*theta) ≧ 0.5 whenever z ≧ 0 (z=X*theta); ex) Predict "y=1" if -3+x1+x2 ≧ 0

Sigmoid 함수 특징 때문에, 보통 이렇게 hypothesis의 threshold=0.5일때, z값(X*theta의 each element)이 0보다 크면 y=1이 된다고 가정할 수 있지만, threshold가 달라지면, 모든게 달라진다.

Logistic Cost Function

It derived from statistics using the principle of maximum likelihood estimation and there's an underlying Gaussian assumption relating to the distribution of features.

How to choose parameters theta ? feature와 theta를 linear하게 이용해서 decision boundary를 어떻게 만들 것인가 ?
very non-linear한 sigmoid function을 square cost function에 적용하면 J(Cost function)는 non-convex 형태를 가지게 된다. 즉, many local minimum이 생기게 되어 global minimum에 도착한다는 보장이 없다. 따라서, non-linear한 sigmoid function기반인 Logistic regression을 위한 cost function은 J가 convex형태를 가지기 위해서 log함수를 이용해서 J를 만든다.
-log함수는 0~1까지 linear하므로, monotonic transform을 통해서 J를 convex형태로 만들어 준다.

H = sigmoid(X*theta);

for i=1:m

J = J + ( -y(i)*log(H(i)) - (1-y(i))*log(1-H(I))); % y=1을 defualt로 설정

end

J = 1/m*J;

Regularization ( The problem of overfitting )

Overfitting : wiggly curve, high variance, If we have too many features, the learned hypothesis may fit the training set very well but, we can easily fail to generate to new expamples.

Addressing overfitting
Reduce # of features	Regularization
중요한 feature들만 남겨둔다	Keep all the features, but reduce magnitude/values of parameters theta.
By reducing the number of features, we may lose some information.	Works well even when we have a lot of features, each of which contributes a bit to predicting y.

we penalize and make some thetas really small. Small values for parameters corresponds to a simpler hypothesis and then a simpler hypothesis is less prone to overfitting.
Lambda (regularization parameter) : to control trade off between to fit the training set well and to keep the parameter small(shrink or penalized).

% Cost Function Reg

J = J + lambda/(2*m)*regsum;

% Gradient Reg

grad(j) = 1/m*(X(:, j)'*(H-y) + lambda/m*theta(j);

%% lambda=0일 때 overfitting, lambda=100일 때 underfitting, lambda=1일때 적합했다.

%% 주의 : regularization을 할때 theta0(intercept)는 penalize하면 안된다. 기준축이기 때문에..

Neural Network ( Non-linear hypothesis )

machine learning algorithm을 사용하는 많은 application(computer vision)은 매우 많은 feature 개수를 가지고 있는 경우가 많다.
Logistic regression을 사용하면, n이 증가할 수록 sigmoid함수에 들어가는 다항식이 dramatically하게 길어진다. 좋은 방법이 아니고, 속도, 비용면에서 매우 비효율적이다. 다시 말해, quadratic features를 많이 넣으면 overfitting이 될 수 있고, 매우 비싼 computation이 될 것이다. ~ Neural Network 사용 배경
복잡한 System을 Decomposition하여 여러 계층의 Layer로 나눈 뒤 Learning한다. ex) XNOR(Complicated system) = NOT(1Layer) + XOR(2Layer)

Feedforward Propagation and Prediction

%% 모든 theta는 Learning되었다고 가정, a1=input layer, a2=hidden layer, a3=output layer

% X = [5000 400]; 20x20 pixel의 image사진 5000장 test set

a1 = X;

a1 = [ones(m, 1) a1]; % input layer의 intercept 추가

% a1 = [5000 401]

for i=1:m

a2(i, :) = sigmoid(a1(i, :)*theta1');

end

a2 = [ones(m, 1) a2]; % hidden layer의 intercept 추가

% a2 = [5000 26]

for i=1:m

a3(i, :) = sigmoid(a2(i, :)*theta2');

end

% a3 = [5000 10]

for i=1:m

max_value = max(a3(i, :), [], 2);

for j=1:num_labels

if(a3(i, j) == max_value) p(i) = j;

end

Backpropagation algorithm

처음엔 radom 값을 가진 weight(theta)들로 시작하여 forward propagation하고 backpropagation 실시 -> backpropagation하면서 각 layer마다 error를 최소화 한다. -> 충분히 training set을 learning했으면, 실전에 적용한다.
The backpropagation algorithm looks for the minimum of the error function in weight space using the method of gradient descent. (똑같이 J를 각 Layer의 theta들에 대해서 부분미분하여 error를 줄여 나간다.)

%% Advanced Optimization 알고리즘 사용 ( Cost function과 gradient만 있으면 어떤 알고리즘이든(linear, logistic..) 사용할 수 있다 .)
%% Backpropagation중 각 Layer별 gradient를 구하는 과정
a3 = H(확률); % a3 = [5000 10]
d3(에러) = a3 - Y; % a3 = [5000 10] , output layer 단순히 빼기로 error 산출
Theta2 = Theta2(:, 2:end) % Theta2 = [10 26] -change-> [10 25]
d2 = d3*Theta2 .* sigmoidGradient(z2); % d2 = [5000 25]

D1 = d2' * a1; % D1 = [25 400]
D2 = d3' * a2; % D2 = [10 26]

Theta1_grad = D1;
Theta2_grad = D2;

Advice for applying machine learning

training set을 구하기 위해 많은 시간을 소비하지 말 것. 그렇게 큰 도움을 주지 못하는 경우도 있다.
무작정 Large features를 사용하는 것 보다는 중요한 feature들만 추려서 사용하는게 좋다. (overfitting 방지)

Evaluating a hypothesis

일일이 feature 하나씩 대응해서 그래프를 그려서 check해서 하면 overfitting인지 아닌지 바로 확인이 가능하지만, feature가 매우 많은 경우에는 힘든 작업이다. 다른 방법으로 일반적으로 70대 30 비율로 training set과 test set을 random으로 sort한 후 나눈다.
estimation of theta와 generation error 측정 이외에 a degree of polynomial을 선택하기 위해 또 다른 pure한 subset이 필요하다. 하나의 subset으로 2가지 이상 용도로 사용하면 조금이라도 overlap이 되기때문에 최상의 효과를 보지 못하므로, 완전히 서로 배제한 상황에서 test를 한다.

Training set	60 %	For using an estimation of theta
Cross validation set	20 %	To select the best polynomial model
Test set	20 %	It gives a bias analysis ( to check a generation error )

Bias vs variance

Bias(underfit) : training set error와 cross validation error가 모두 높을 때
Variance(overfit) : training set error는 낮고, cross validation error는 높을 때
보통 cross validation error가 처음에 줄어들기 시작하다가 증가하기 시작하는 지점, 즉 cv error가 가장 작은 지점에서의 degree of polynomial d와 lambda값을 선택한다. 해당 지점이 최대한 overfitting과 underfitting을 피할 수 있는 지점이다. 합의점이다.

Machine learning system design

Building a spam classifier

x = features of emails(choose 100 words indicative of spam or not, y=spam or not. ( In practice, use most frequently occurring n words(10,000~50,000)

Error analysis

Start with a simple and dirty algorithm quickly. Implement it and test it on your cross-validation data.
Plot learning curves to decide if more data, features, etc.
Do error analysis like manually examine the examples in cross-validation set that your algorithm made errors on. ( Text analysis 같은 경우는 error를 줄이기 위해서 pre-processing(tolower, Punctuation, Stopwords, stemming.. 등)을 잘해준다. )

Error metrics for skewed classes

보통 1% error는 매우 작은 수치라서 무시할 수도 있지만, Only 0.5% of patient have cancer이라는 상황에서는 매우 중요한 수치로 여겨진다.
만약 Test set의 accuracy가 99.2%(0.8% error), 99.5%(0.5% error)로 나왔다면, 모델(classifier)이 좋아져서 error가 낮아진 건지 확신할 수 없다. 수치가 매우 작고, 그 정도의 변화는 많은 변수들에 의해 생길 수 도 있기 때문이다. 하지만, skewed class에 한해서는 정확하게 분석해야 한다.
Skewed Class에 대비한 a different evaluation metric이 필요하다 -> "Prevision Recall"

"Precision/Recall"

Precision(TP/TP+FP) : Of all patients where we predicted y=1, what fraction actually has cancer?
Recall(TP/TP+FN) : Of all patients that actually have cancer, what fraction did we correctly detect as having cancer? False Negative와 연관되어 있으므로, Cancer diagnosis에서는 critical하다. 병에 걸렸는데, 안걸렸다고 진단하면 매우 위급한 상황이 되므로..

If a classifier is getting high precision and high recall, then we are actually very confident that the algorithm has to be doing well even if we have very skewed classes but there is trade-off between them.

Trading off precision and recall

일정한 test sample 개수에서 False positive를 줄이면(higher precision), False negative가 증가 할 수밖에 없다.(lower recall)
Suppose we want to predict y=1(cancer) only if very confident : higher recall & lower recall (ex. 0.5 -> 0.7) 하지만, 암환자를 제대로 진단할 확률은 증가하지만, false negative는 증가하게 되어, 암환자를 알아보지 못하는 확률도 증가한다.(매우 Critical한 상황!!)
Suppose we want to avoid missing too many cases of cancer(avoid false negatives) : higher recall & lower precision (ex. 0.5->0.3) 암환자를 아예 못알아보는 확률이 줄어드는 대신에 생사람을 암환자로 진단할 확률은 높아진다.
따라서, Precision과 Recall은 서로 trade-off관계가 되어, threshold를 결정하는데 큰 역할을 하고, application에 따라 달라진다. 의료 진단에 있어서는 higher recall & lower recall을 선호하여, critical 상황을 피하는게 좋다고 생각한다. 생사람을 암환자라고 잘못 진단한 경우는 mistake로 간주해도 further study로 이어질 수 있으므로 암환자를 아예 못알아보는 경우보다 위험성이 낮기 때문이다.

F1 Score ( 2PR / (P+R) ) : Q. How to compare precision/recall numbers and to decide which algorithm is the best ? ( harmonic measure of both )

	Precision	Recall	Average	F1 score
Model1	0.5	0.4	0.45	0.444
Model2	0.7	0.1	0.4	0.175
Model3	0.02	1.0	0.51	0.0392

다양한 모델을 Evaluation하다 보면, Model3과 같이 extreme한 경우가 발생할 수도 있다. 이렇게 Precision과 Recall간의 차이가 많이 나는 모델은 별로 좋지 않은 모델이다. 좋은 모델을 선택할 때, Average로 측정하면 extreme한 case를 높은 점수를 주게 되는데, 이를 보완하기 위해서 F1 score를 사용한다. 특별히, 차이가 큰 모델에는 상당히 낮은 점수를 부여하게 된다.

x. Support Vector Machines

compared to both logistic regression and neural networks, a SVM sometimes gives us a cleaner way of learning non-linear functions.

Logistic regression and SVM

INTUITION : we want to set y=0 or y=1 when X*theta is much less or much lager than 0. In this case, we can have a very confident set of classifications for all the training examples and this seems to be a nice goal to aim for all the training examples.

Logistic regression	Support Vector Machine
log(sigmoid(Xtheta)) or log(1-sigmoid(Xtheta), error를 나타내는 그래프를 살펴보면 Linear한 곡선을 이룬다(z가 증가 or 감소할수록 error가 점차 감소 or 증가한다. )	곡선이 아닌, 직선으로 Straight portion과 flat portion(safe zone)으로 이루어져있다. z=1 or -1 기준으로 flat portion은 error=0인 지점이다.
y=1 기준 : X*theta ≧ 0	y=1 기준 : X*theta ≧ 1 (safety margin facor)

즉, SVM의 Cost function은 logistic과 많이 유사하지만, it gives a computational advantage and an easier optimization problem.

Logistic regression	Support Vector Machine
A + B	CA + B
Larger  ~))) High weight B	Smaller C ~))) High weight B
A : training data set term B : regularization term

we need some way to deal with the tread-off between regularization and data set terms. -> Set different values for  to parameterize this trade-off.
the convention is to use a different parameter called C.
Unlike logistic, the hypothesis of SVM doesn't give us the probability, but instead we get a direct prediction of 1 or 0 (곡선이 아닌 직선으로 이루어진 이유, error 0인 지점과 error가 있는 지점이 명확히 구분되기 때문에..)

Large Margin Intuition

Sometimes, people refer to SVM as large margin classifiers.
Logistic과 달리 SVM에는 error=0인 Safe zone이 있으므로 min(CA+B)를 구현할 때, A=0으로 만들어 C값에 상관없이 CA=0으로 만들 수 있다. 그리고 우리는 B만 즉, theta값만 minimize하면 되므로, 최소한의 오차만 남기겠다는 의도이다.
Decision boundary를 이질적인 data들의 최대 distance(margin)으로 결정하면서, more robust한 separator가 될 수 있다.

Large margin classifier in presence of outliers

Large C	Small C
- More sophisticated margin - very sensitive to outliers (별로 안좋은 모델) - 만약 C가 not too large이면, outlier들을 무시할 수 있다. - more prone to be overfitting - you will use quite naïve maximized margin approach.(그냥 단순하게 max margin)	- If C is reasonably small, you can stick with the black decision boundary. (적합한 위치) .

The idea of SVM being a large margin classifier is only really relevant when you have no outliers and you can easily linearly separable data. It means we ignore a few outliers.

SVM Decision Boundary

In the optimization object, we're trying to find a setting of parameters where the norm of theta is small to enlarge margin classifier.

Kernels Intuition

to develop complex nonlinear classifier.
to find a nonlinear decision boundary to distinguish the positive and negative examples.
for nonlinear training set, If I use high order polynomial classifier, it's very expensive. we need a different way of solving this problem.
어찌됬든, 우리의 목표는 non-linear data set의 decision boundary를 만드는 것인데, high polynomial말고, Gaussian Kernel을 이용해 효율적으로 complex한 decision boundary를 만들 수 있다.

Kernels and Similarity

Similarity Function = Gaussian Kernels Function ; this measures how close x is to the landmark. data x는 landmark와 멀어질수록 0이 되고, 가까울수록 1이 된다.
기준점, Landmark를 임의의 지점 또는 Complex problem에서는 모든 Training data 자리에 지정하여 모든 data point와의 Similarity를 측정한다.
Landmark( f1_i = similarity(x_i, l1) , f2_i ... ,f3_i , ... )는 새로운 feature( f1_i, f2_i, f3_i, ...)들을 정의할 수 있다. 즉, 하나의 data point는 모든 feature들에게 적용된다. Given a data point, compute all the f values.
마치, x,x^2,x^3,x^4,...의 polynomial feature vector처럼 f1, f2, f3, f4,... 로 feature vector가 정해진다.
Kernel과 SVM를 결합하여 Singular Value Decomposition을 이용해 계산을 효율적으로 한다. ( # of landmark = # of dimension )

SVM parameters

C	Sigma^2
Large C (small Lambda) lower bias, high variance	Large Sigma^2 Features f_i vary more smoothly, higher bias, lower variance
Small C (large Lambda) higher bias, low variance	Small Sigma^2 Features f_i vary less smoothly, lower bias, higher variance

Logistic regression vs SVMs

n = # of features, m = # of training examples
If n is large(relative to m), use logistic regression or SVM without a kernel("linear kernel").
If n is small, m is intermediate, use SVM with Gaussian kernel.
If n is small, m is large, add more features, then use logistic regression or SVM without a kernel.

Clustering

where we learn from unlabeled data instead of the label data
to automatically find some structure or coherent subset in the data

K-means algorithm

ITERATIVE ALGORITHM : cluster assignment(1 STEP), move centroid step(2 STEP)
to avoid Local Optima or to better find Global Optima, need to do multiple random initialization and then among them, you select the lowest cost ( # of cluster가 많을수록, multiple 횟수를 늘리는게 좋다 )
to reasonably choose K, # of clusters, try to do Elbow method or market segmentation, but there isn't always a clear cut answer.

Dimensionality Reduction

DATA COMPRESSION : not only allows us to compress the data and then use up less computer memory but also allows us to speed our learning algorithms.
INTUITION : engineering team1 (100 features), team2(200 features), team3(300 features) ... 를 combine할 때 feature들 사이간의 correlation을 이용해 Data compression을 한다.
DATA VISUALIZATION : 50D를 2D 나 3D로 표현할려면, 50D 중에서 2개 or 3개의 main dimensions of the deviation으로 나타낸다.

Principal Component Analysis Algorithm

PCA 하기전 Data preprocessing(feature scaling/mean normalization)을 한다.
to reduce data from large dimensions to small dimensions

Compute "covariance matrix"
Compute "eigenvectores" by using Singular Value Decomposition(SVD)

to choose k ( # of principal components ), 99% of variance retained의 조건에서 smallest value of k 선택 (SVD함수 이용)
APPLICATION : Compression, Visualization ...
모델을 Learning하기전에 무조건 PCA하는 것은 옳지 않고, raw data를 먼저 Learning해보고, 제대로 되지 않거나 필요할 경우가 있을 때 PCA를 사용한다.

Anomaly detection

Assume that the feature x1,x2...xn are distributed according to a Gaussian distribution.

Anomaly detection vs. Supervised learning

SPAM filtering은 anomaly detection 특징처럼 many different 'types'를 가지고 있지만, Large set of SPAM을 가지고 있기 때문에, Supervised learning이 더 잘 어울린다.
매우 거대한 volumes을 manufacturing 하고 있으면서, 많은 수의 bad example(p(x)

Choosing what features to use

raw 데이터 distribution이 non-normal일 경우 transform함수를 이용해서 normal curve로 바꿔준다.
PROBLEM : P(X) is comparable both large for normal and anomalous examples
feature가 1개일 때, anomalous example 임에도 불구하고 high probability를 받는 경우가 있는데, 이럴 때는 add new feature을 한다. (즉, 1차원에서 2차원으로 바꿔 보면 anomaly 분간이 쉽다) because it becomes easier to distinguish the anomalies from your good examples.

Original model vs. Multivariate Gaussian

Original model	Multivariate Gaussian
항상, 단순히 feature들간의 product, divide를 하기 때문에 normal distribution은 항상 둥근 원이 된다. (flexible 하지 않다, 모든 data distribution을 포함할 수 없다)	Parameter mean과 sigma를 사용해 given training set을 이용해 parameter fitting을 한다. Data distribution에 따라 모양이 flexible하게 변한다.
Manually create features to capture anomalies where x1, x2 take unusual combinations of values. (ex. x3=x1/x2 )	Automatically captures correlations between features
Computationally cheaper (large n에 유리)	Computationally more expensive
OK even if m(training set size) is small	Must have m > n or else sigma is non-invertible (m은 n보다 충분히 커야 한다. Covariance matrix 때문에 n은 제곱승이 되어 버리기 때문에..)

The Original Model is actually a special case of the Multivariate Gaussian Model.

x. Recommender Systems
In machine learning algorithm, features are very important because the features you choose(or select) will have a big effect on the performance of your learning algorithm. There are some algorithms that can try to automatically learn a good set of features for you rather than hand code. Actually, recommender systems are sort of the algorithms which can automatically identify the crucial and relevant features.

The problem of recommender systems is that given $r(i,j)$ and $y^{(i,j)}$, we try and predict missing values(i.e.,?) to provide users with ratings of movies unseen before. There exist two representative approaches, content-based and collaborative filtering.

x.1 Content-based recommendations
In content-based, we assume that we have a set of features for different movies, and that captures what is the content of the movies. In particular, each of the movies have two features which denoted by $x_1$ and $x_2$, where $x_1$ measures the degree to which a movie is a romantic movie and $x_2$ measures the degree to which a movie is an action movie.

Content-based Problem

Therefore, each movie can be represented with a feature vector, $X^{(n)}$. For example, "Love at last" is represented by $X^{(1)}$ $= [ 1, 0.9, 0 ]$ where $x_0=0$.

In order to make predictions, we could treat predicting the ratings of each user as a separate linear regression problem. Specifically, for each user $j$, first of all, we're going to learn a parameter vector, $\theta^{(j)}$∈$R^3$ (more generally, $\theta^{(j)}$∈$R^{n+1}$ where n is the number of features). Second, we're going to predict user $j$ as rating movie $i$ with inner product between parameters vectors and the feature vectors (i.e., $(\theta^{(j)})^T$$x^{(i)}$. Thus, all we're doing here is to apply a different copy of the linear regression for each user.

In order to learn the parameter vectors, $\theta^{(j)}$ for all of the user, Optimization objective (i.e.,the process of minimizing our cost function) is that:

For recommender systems, they often get rid of the term, $m^{(j)}$, to simplify the subsequent math (i.e.,we regard it as just a constant). When minimizing this, we can still get the same value of $\theta^{(j)}$ as before.

Next, in order to do the minimization, we're using the gradient descent update:

We divided it into two parts because we don't regularize $\theta^{(0)}$. The only miner difference with linear regression is that we delete $1/m^{(j)}$.

Drawback of Content-based
For many movies, we don't actually have such features or it may be very difficult to get such features for all of our movies also it's very time consuming and expensive.

it may be very difficult to get such features for all of the movies.
모든 영화를 구체적으로 분석해서 장르별로 weight를 부여하여 feature vector로 표현하는 것은 쉽지 않고, 정확하지도 않을 뿐더러 무엇보다 비용이 많이 든다.
OPTIMIZATION OBJECTIVE : training 1 set = { some 영화의 장르별 weight(feature vector), 해당 영화의 someone의 rating점수 }, 이러한 많은 training examples를 가지고 theta vector를 learning한다. min( X*theta - y ) 목표로... 다시 말해, 각각 user들의 rating을 사용해서, user 개개인이 영화 장르들(feature vector)의 weight(theta vector)를 최적화시키는게 목적이다. (user의 장르별 성향 추측)

x.2 Collaborative filtering
Collaborative filtering has a very interesting property that it does "feature learning". It can be possible that an algorithm starts learn for itself what features to use.

Collaborative filtering Problem

In collaborative filtering problem, we don't know we have no idea how romantic each movie is and how action each movie is.

Unlike context-based, we don't know the values of these features(영화 각각에 대한 장르들의 weighting)
On the contrary to context-based, we're not using any information about the movie itself, just the similarity between users.
The algorithm has very interesting property that i does what is called feature learning, in other words, it's kind of a algorithm that can start to learn for itself what features to use.
With every user is rating some subset with the movies, every user is helping the algorithm a little bit to learn better features.
By rating a few movies myself, I will be helping the system learn better features and these features can be used by the system to make better movie predictions for everyone.

Mean Normalization(Pre-processing) for System

there are so many, many users who have not rated any movies ( sparse matrix ).
to compute the average rating that each movie obtained.
0 대신 the average rating을 사용하기 때문에, user가 rating을 안해서 system에 주는 피해를 최소화 한다.

Large scale machine learning

training example이 매우 클 경우(m=100,000,000) 매우 비싼 computation이므로 갯수를 줄일 필요가 있다. 만약 model이 high-variance이라면 adding extra training examples가 성능향상에 도움을 주지만, high-bias이라면, adding extra features가 성능향상에 도움을 주고, original m(m=십억)의 subset m(m=1000)을 사용해도 성능변화에 큰 변화가 없다. ( Cost-m 그래프의 converge 지점 확인 )

Stochastic gradient descent

Large training set에는 gradient descent 알고리즘은 매우 비싼 procedure이다. 대신, the modification of gradient descent인 stochastic gradient descent은 효율적인 procedure을 가진다.
each step마다 모든 training example들의 error를 사용하지 않고, randomly shuffle한 후 a single example의 error만 사용한다.
정확히 global minimum에 수렴하지 않고 가까운 근처에서 계속 wandering한다. 약간의 오차가 있지만 괜찮다. ( global minimum의 approximation )
alpha값을 잘 설정해주면 global minimum으로 향해 갈수록 진동(noise) 크기가 줄어든다. ( alpha = const1 /(# of iteration+const2) )

Mini-batch gradient descent

Batch gradient descent : Use all m examples in each iteration
Stochastic gradient descent : Use 1 example in each iteration
Mini-batch gradient descent : Use b example in each iteration (b=보통10)

If you have a good vectorized implementation, Mini-batch can sometimes run even faster than Stochastic.

Online learning (= a large scale machine learning setting )

It allows us to model problems where we have a continuous flood or continuous stream of data.
data가 홍수처럼 밀려 들어오므로( 매우 많기 때문에 ), theta update에 사용한 example들은 버린다. (메모리 절약), 만약 small data라면 online learning보다는 모든 example을 저장한 후 machine learning 알고리즘을 수행해야 한다.

Map-reduce and data parallelism

400만개의 training example을 gradient descent알고리즘을 사용해서 learning하려고 할 때, 4개의 computer를 사용해서 병렬적으로 계산하면 속도가 4배 증가할 것이다. 하지만, Network latency 때문에 생기는 the overhead of combining the results 때문에 you get slightly less than a 4x speed-up.
To avoid Network latency, we use Multi-core machines.

Multi-core machine
good numerical linear algebra libraries ( automatically parallelize )
very good vectorizing implementation of the learning algorithm

Application : Photo OCR ( Photo Optical Character Recognition )

One of the things many developers has interested in is how to get our computers to understand the content of the pictures a little bit better.
The Photo OCR problem focuses on how to get computers to read the text to the purest in image that we take.

Photo OCR pipeline

IMAGE
TEXT DETECTION : find the regions where there's text.
CHARACTER SEGMENTATION : try to segment the picture into the locations of the individual characters.
CHARACTER RECOGNITION : look at the images of the individual characters.

Team을 구축할 때, module별로 배치
Given the OCR problem, how do you break it down into a sequence of different modules, is important.

Sliding window detection ( = supervised learning classifier )

through our classifier, we can determine whether or not there is a pedestrian.
step size(=stride)가 1이면 정확하지만, computation이 매우 비싸다.

Getting lots of data : Artificial data synthesis

One of the most reliable ways to get a high performance ML system is to take a low bias learning algorithm and to train it on a massive training set.
creating new data from scratch
using a small training set to turn that into a larger training set.

Synthetic data (큰 효과x) : they are remarkably similar with real data.
Amplifying : by distortion, we can create new examples.

Ceiling analysis ( upper bound )

What part of the pipeline should you spend the most time trying to improve?
각 Module마다 upside potential을 정확히 알고 시간과 비용을 투자해야 한다. ex) Character segmentation은 아무리 성능을 높여도 전체적으론 1% 성능 효과 밖에 없다.

Review

It was the first course for me to learn Machine Learning. As the beginner, I could take the course without big difficulty, but MATLAB programming assignments were a little challenging.

I could learn the various types of machine learning algorithms through some real-world examples such as house price, spam classification and learn more deeply or interestingly through MATLAB programming assignments which require the way of vectorial implementation specialized in parallel hardware rather than simple for-loop logic, especially when building gradient descent and forward propagation algorithms.

Given the starter code of the program, I could concentrate only on the algorithm function core part, which reduces some redundant time.

I have learned the basic machine learning model such as Linear or Logistic Regression, Neural Network, Clustering, Support Vector Machine, Collaborative Filtering and other various data mining techniques such as anomaly detection, dimensionality reduction, and so on.

Lastly, I think, since the course consists of the basic concepts, I feel I need to have more deep inside from other courses.

Reference
"Machine Learning by Stanford University on Coursera. Certificate earned on November 17, 2015".

Rubbing & Scrubbing My Data

이 블로그 검색

Coursera Machine Learning

태그

댓글

댓글 쓰기

이 블로그의 인기 게시물

Pattern Discovery in Data Mining

Vector Space Model

Operating System