Gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layers’ connection weights virtually unchanged, and training never converges to a good solution, which is called the vanishing gradients problem. Oppositely, the gradients can grow…

In the context of an optimization algorithm, the function used to evaluate a candidate solution (i.e. a set of weights) is referred to as the objective function. Typically, with neural networks, we seek to minimize the error. As such, the objective function is often referred to as a cost function…

This blog introduces three most commonly used activation functions in hidden layers: Rectified Linear Activation (ReLU), Logistic (Sigmoid) and Hyperbolic Tangent (Tanh).

Sigmoid

The sigmoid activation function, σ(z) = 1/(1+exp(-z)), is also called the logistic function. Logistic function has a well-defined nonzero derivative everywhere, allowing Gradient Descent to make some progress…

An activation function in a neural network defines how the weighted sum of the input is transformed into an output from a node or nodes in a layer of the network. Sometimes it is called transfer function or squashing function.

Activation functions allow Deep Learning models to capture nonlinearity. If…

Neural Networks have two major processes: Forward Propagation and Back Propagation. During Forward Propagation, we start at the input layer and feed our data in, propagating it through the network until we’ve reached the output layer and generated a prediction. Back Propagation is essentially the opposite of Forward Propagation. In…

An MLP (Multilayer Perceptron) is composed of one passthrough input layer, one or more layers of TLUs (threshold logic units), called hidden layers, and one final layer of TLUs called the output layer. The layers close to the input layer are usually called the lower layers, and the ones close…

An ANN (artificial neural network) is a Machine Learning model inspired by the networks of biological neurons found in the brains. An artificial neuronhas one or more binary on/off inputs and one binary output. …

This blog introduces to two measures: AIC (Akaike information criterion) and BIC (Bayesian information criterion), which give a comprehensive measure of model performace taking into account the additional variables. We can try to find the model that minimizes a theoretical information criterion, i.e., AIC and BIC, which defined as

A Gaussian mixture model (GMM) is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown. All the instances generated from a single Gaussian distribution form a cluster that typically looks like an ellipsoid. …

DBSCAN algorithm defines clusters as continuous regions of high density. For each instance, the algorithm counts how many instances are located within a small distance ε from it. This region is called the instance’s ε-neighborhood. If an instance has at least instances in its ε- neighborhood including itself, then it…