Neural networks

yueyuan
4 min readNov 2, 2020

Activation Function

Purpose:

Add non-linearity in data.

Why non-linear activation functions in nn?

Without activation function, nn is no difference from linear regression or logistic regression.

Universal Approximation Theorem

It has been proven that a feed-forward network can approximate any continuous function. It says a neural network can learn any continuous function. Now the question is what makes it to do so. The answer is the non-linearity of activation functions.

Vanishing and Exploding Gradient Problems

有些 Activation function 比如 sigmoid 在接近于 1 or 0 的时候是 flat, 它的 gradient 本来就很小,如果几个层都是用 sigmoid 的话,那就是因为它们是 product 的效果(叠加),所以就会 vanishing。

In neural networks during backpropagation, each weight receives an update proportional to the partial derivative of the error function. In some cases, this derivative term is so small that it makes updates very small. Especially in deep layers of the neural network, the update is obtained by multiplication of various partial derivatives. If these partial derivatives are very small then the overall update becomes very small and approaches zero. In such a case, weights will not be able to update and hence there will be slow or no convergence. This problem is known as the Vanishing gradient problem.

Similarly, if the derivative term is very large then updates will also be very large. In such a case, the algorithm will overshoot the minimum and won’t be able to converge. This problem is known as the Exploding gradient problem.

There are various methods to avoid these problems. Choosing the appropriate activation function is one of them.

Activation Functions:

What issue with ReLUs does Leaky ReLU resolve?

When z <= 0, the derivative equals zero, causing some neurons to get stuck and stop learning.

The sigmoid function does not used often in hidden layer, because the derivative at both tails approach 0. Vanishing gradient and saturation problems.

Tanh keeps the sign of the input, in contrast with sigmoid.

Batch Normalization

Purpose: to speed up and stabilize training in nn.

Each feature has very different distribution (skew, mean, std dev). Compare diff features, like compare diff fruit. This cost function will be elongated. So that changes to the weights relating to each of the inputs will have kind of a different effect of varying impact on this cost function. It makes training fairly difficult, make it slower and highly dependent on how your weights are initialized. If one feature of a new training data change in data distribution, the location of the minimum could move and cost function changes. => use batch normalization to smooth cost function and reduce covariate shift.

If both features are normalized(mean:0, std dev:1), the cost function will be more balanced and training would be much easier and potentially much faster.

Internal covariate shift in hidden layers => To reduce it, batch normalization to normalize all the internal nodes based on batch statistics calculated for each input batch. => smooth cost function, make training easier and faster.

shift factor, scale factor:

You use the batch mean and standard deviation during training and the running statistics (that was computed over the entire training set) for testing. The running values are fixed after training.

Convolutions

Convolutions reduce the size of an image while preserving key features and patterns.

Convolutions apply learned filters to detect features. Convolutions allow you to detect key features in different areas of an image using filters that are learnable during training.

Stride: determine how many sections of the image your filter will move. how filter scan the image.

padding: like a frame on the image, gives similar importance to the edges and the center

pooling: reduce the size of the input (by taking average or maximum), eg. max pooling.

upsampling: Increasing the size of an image by inferring pixels. Output higher resolution of image, nearest neighbors, eg. linear interpolation, bi-linear interpolation

Both pooling and upsampling have no learnable parameters.

Transposed convolution: upsample, have learnable parameters. A transposed convolution uses a learnable filter to upsample an image.

An upsampling layer calculates the pixels while transposed convolutions are an upsampling method that uses learnable parameters, unlike upsampling layers.

use upsampling+convolution is more popular than transposed convolution to avoid checkboard pattern.

Backpropagation

In order to start from the back, assume that we already have optimal values for all of the parameters except for the last bias term, because we don’t know the optimal value for this bias, we have to give it an initial value. We can quantify how good the current fitted line fits the data by calculating the sum of the squared residuals.

The main ideas for backpropagation are that when a parameter is unknown, we use the chain rule to calculate the derivative of the sum of the squared residuals with respect to the unknown parameter. Then we initialize the unknown parameter with a number, and used gradient descent to optimize the unknown parameter.

Vanishing gradient problem during back propagation, the gradient is the value we use to update the neural network’s weight. The vanishing gradient problem is the gradient shrinks as backpropagation through time.

Gradient update rule: new weight=weight-learning rate*gradient

If the gradient value becomes small, it doesn’t contribute much to learning. So because this layer doesn’t learn, RNN can forget what it sees and learns in the sequence, which cause short term memory. LSTM and GRU are created to solve this problem.

State from earlier time steps get diluted over time. This can be a problem, for example when learning sentence structures.

RNN: the neuron does not only get input from the current time step, also from previous output of time step. Get fed to the new neuron, after applying activation function, we get new output.

LSTM (Long short term memory cell): cell state, various gates. maintain separate short term and long term states

GRU (gated recurrent unit): simplified LSTM cell that performs about as well

--

--