Loading [MathJax]/jax/output/CommonHTML/jax.js
Skip to the content.

Deep Learning - Week 2 Lecture Notes

This is my notes for Deep Learning Course in Coursera. I jumped straight to week 2 because week 1 is about introduction that I’ve known. Week 2 in summary is structured as: starting from binary classification with logistic regression, loss function and cost function, computational graph.

Binary Classification

in image classification you will have to process and image that would most probably consists of RGB image.

Notation

(x,y),xRnx

m training example denoted by

(x(i),y(i))

Logistic Regression

Given x, want ˆy=P(y=1x)

output ˆy=σ(wTx+b)

sigmoid function with input variable z definition:

σ(z)=11+ez

Logistic Regression Cost Function

Usually we use loss function as follow

L(ˆy,y)=12(ˆyy)2

but this is non-convex function, so no single solution exists. We need to find another loss function for logistic regression!

L(ˆy,y)=(ylogˆy+(1y)log(1ˆy))

Cost function

J(w,b)=1mmiL(ˆyi,yi)

Gradient Descent

for training or learning how to get logistic regression parameters.

Repeat
w := w - \alpha dJ(w) / dw

where α is the learning rate

Computation Graph

Computation graph explains the use of backpropagation mechanism. Backward function computes the gradient of a cost function. Backpropagation distribute/propagate the gradient of a cost function to all the layers before.

assume we have function J(a,b,c)=3(a+bc), then we further define:

u=bc

v=a+u

J=3v

b=3
u=bc -> 6
c=2
a=5
v=a+u -> 11
J=3v -> 33

then we want to optimize function J, which in the past lectures, J is the cost function. Remember the difference between cost function and loss function: cost funtion is the average of loss function, while loss function is the single instance loss or error made by our prediction.

Since we have this graph, now one of the question that we would like to answer is

how much of J value change if we change a

from J=3v we know that:

dJdv=3

now we wanted to answer dJda=?

we know that a affect v and v affect J

a
v
J

so following the chain rule:

dJda=dJdvdvda

we have dJdv=3 and dvda=1 (this from calculus derivative, for more practical explanation please see the lecture video). Then we got:

dJda=dJdvdvda=(3)(1)=3

What does this means? it means if we change the value of a by some number, it would change the final function J in magnitude of 3. Say if we change the value of a to 5.0001, then v=11.001 and J=33.003.

Incorporating Computation Graph Into Logistic Regression

we have

z=wTx+b ˆy=a=σ(z) L(ˆy,y)=(ylog(ˆy)+(1y)log(1ˆy))

then we can construct computation graph as follow

x1
z=w1 x1 + w2 x2 + b
w2
x2
b
yhat = σ of z
L yhat, y

so we got (like Andrew Ng said: don’t worry about the calculus now)

da=dL(a,y)da=ya+1y1a dz=dLdz=dL(a,y)dadadz=(ya+1y1a)(a(1a))=ay dw1=dLdw1=x1 dz dw2=dLdw2=x2 dz

then we can update weights as follow:

w1=w1α dw1 w2=w2α dw2

where α is the learning rate that we set. Well this all is for single training set, if we wanted to calculate, say w1 from overall cost function, where:

J(w,b)=1mmi=1L(ai,yi)

then

dd1J(w1,b)=1mmi=1dw1L(ai,yi)
Written on January 5, 2019