# Basic Architecture of Neural Network

graph LR subgraph input layer X1[x1] X2[x2] X3[x3] end subgraph hidden layer H1((h1)) H2((h2)) H3((h3)) end subgraph output layer YHAT((yhat)) end X1 --> H1 X1 --> H2 X1 --> H3 X2 --> H1 X2 --> H2 X2 --> H3 X3 --> H1 X3 --> H2 X3 --> H3 H1 --> YHAT((yhat)) H2 --> YHAT H3 --> YHAT

This example ilustrate 2 Layer Neural Network because we do not count input layer. the hidden layers can be think as multiple logistic regression nodes that passing output to one another.

Using superscript like $^{[1]}$ denotes which layer will be pointed, for example in the picture above, input layer is $^{[1]}$, hidden layer is $^{[2]}$, and output layer is $^{[3]}$.

$z^{[1]}$ = $W^{[1]}x + b^{[1]}$
$a^{[1]}$ = $\sigma ( z^{[1]} )$
$z^{[2]}$ = $W^{[2]}a^{[1]} + b^{[2]}$
$a^{[2]}$ = $\sigma ( z^{[2]} )$

but all these operations must be repeated by $n$ training sample, so in order to do that faster, we need to vectorize these operations.

# Activation Functions

## Sigmoid

Andrew’s has rarely use sigmoid activation function for hidden units, he prefer tanh for these hidden units. However, sigmoid function might be used in output layer if the ouput is binary.

## Tanh

Centering the mean towards zero will make hidden units faster to converge.

## ReLU

Rectified Linear Units

if you don’t know what activation function to use, use this.

### Leaky ReLU

But why $0.01$, sometimes just work! (no idea or whatsoever)

## Why Neural Network Need Non-Linear Function?

$z^{[1]}$ = $W^{[1]}x + b^{[1]}$
$z^{[2]}$ = $W^{[2]}z^{[1]} + b^{[2]}$
$z^{[2]}$ = $W^{[2]} (W^{[1]} x^{[1]} + b^{[1]}) + b^{[2]}$
$z^{[2]}$ = $(W^{[2]} W^{[1]}) x + (b^{[1]} + b^{[2]})$
$z^{[2]}$ = $W^{‘} x + b^{‘}$

Turns out we only computing linear function! No matter how many layers that we’ve put in, duh!.

Written on January 7, 2019