Deep Learning - Week 2 Lecture Notes
This is my notes for Deep Learning Course in Coursera. I jumped straight to week 2 because week 1 is about introduction that I’ve known. Week 2 in summary is structured as: starting from binary classification with logistic regression, loss function and cost function, computational graph.
Binary Classification
in image classification you will have to process and image that would most probably consists of RGB image.
Notation
(x,y),x∈Rnxm training example denoted by
(x(i),y(i))Logistic Regression
Given x, want ˆy=P(y=1∣x)
output ˆy=σ(wTx+b)
sigmoid function with input variable z definition:
σ(z)=11+e−zLogistic Regression Cost Function
Usually we use loss function as follow
L(ˆy,y)=12(ˆy−y)2but this is non-convex function, so no single solution exists. We need to find another loss function for logistic regression!
L(ˆy,y)=−(ylogˆy+(1−y)log(1−ˆy))Cost function
J(w,b)=1mm∑iL(ˆyi,yi)Gradient Descent
for training or learning how to get logistic regression parameters.
Repeat
w := w - \alpha dJ(w) / dw
where α is the learning rate
Computation Graph
Computation graph explains the use of backpropagation mechanism. Backward function computes the gradient of a cost function. Backpropagation distribute/propagate the gradient of a cost function to all the layers before.
assume we have function J(a,b,c)=3(a+bc), then we further define:
u=bc
v=a+u
J=3v
then we want to optimize function J, which in the past lectures, J is the cost function. Remember the difference between cost function and loss function: cost funtion is the average of loss function, while loss function is the single instance loss or error made by our prediction.
Since we have this graph, now one of the question that we would like to answer is
how much of J value change if we change a
from J=3v we know that:
dJdv=3now we wanted to answer dJda=?
we know that a affect v and v affect J
so following the chain rule:
dJda=dJdvdvdawe have dJdv=3 and dvda=1 (this from calculus derivative, for more practical explanation please see the lecture video). Then we got:
dJda=dJdvdvda=(3)(1)=3What does this means? it means if we change the value of a by some number, it would change the final function J in magnitude of 3. Say if we change the value of a to 5.0001, then v=11.001 and J=33.003.
Incorporating Computation Graph Into Logistic Regression
we have
z=wTx+b ˆy=a=σ(z) L(ˆy,y)=−(ylog(ˆy)+(1−y)log(1−ˆy))then we can construct computation graph as follow
so we got (like Andrew Ng said: don’t worry about the calculus now)
da=dL(a,y)da=−ya+1−y1−a dz=dLdz=dL(a,y)dadadz=(−ya+1−y1−a)(a(1−a))=a−y dw1=dLdw1=x1 dz dw2=dLdw2=x2 dzthen we can update weights as follow:
w1=w1−α dw1 w2=w2−α dw2where α is the learning rate that we set. Well this all is for single training set, if we wanted to calculate, say w1 from overall cost function, where:
J(w,b)=1mm∑i=1L(ai,yi)then
dd1J(w1,b)=1mm∑i=1dw1L(ai,yi)