# Basic Architecture of Neural Network

graph LR subgraph input layer X1[x1] X2[x2] X3[x3] end subgraph hidden layer H1((h1)) H2((h2)) H3((h3)) end subgraph output layer YHAT((yhat)) end X1 --> H1 X1 --> H2 X1 --> H3 X2 --> H1 X2 --> H2 X2 --> H3 X3 --> H1 X3 --> H2 X3 --> H3 H1 --> YHAT((yhat)) H2 --> YHAT H3 --> YHAT

This example ilustrate 2 Layer Neural Network because we do not count input layer. the hidden layers can be think as multiple logistic regression nodes that passing output to one another.

Using superscript like $^{}$ denotes which layer will be pointed, for example in the picture above, input layer is $^{}$, hidden layer is $^{}$, and output layer is $^{}$.

$z^{}$ = $W^{}x + b^{}$
$a^{}$ = $\sigma ( z^{} )$
$z^{}$ = $W^{}a^{} + b^{}$
$a^{}$ = $\sigma ( z^{} )$

but all these operations must be repeated by $n$ training sample, so in order to do that faster, we need to vectorize these operations.

# Activation Functions

## Sigmoid

$a = \sigma(z) = \frac{1}{1 + e^{-z}}$

Andrew’s has rarely use sigmoid activation function for hidden units, he prefer tanh for these hidden units. However, sigmoid function might be used in output layer if the ouput is binary.

## Tanh

$a = tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}$

Centering the mean towards zero will make hidden units faster to converge.

## ReLU

Rectified Linear Units

$a = ReLU(z) = max(0, z)$

if you don’t know what activation function to use, use this.

### Leaky ReLU

$a = LReLU(z) = max(0.01 z, z)$

But why $0.01$, sometimes just work! (no idea or whatsoever)

## Why Neural Network Need Non-Linear Function?

$z^{}$ = $W^{}x + b^{}$
$z^{}$ = $W^{}z^{} + b^{}$
$z^{}$ = $W^{} (W^{} x^{} + b^{}) + b^{}$
$z^{}$ = $(W^{} W^{}) x + (b^{} + b^{})$
$z^{}$ = $W^{‘} x + b^{‘}$

Turns out we only computing linear function! No matter how many layers that we’ve put in, duh!. Written on January 7, 2019