In this article we are going to discuss training neural networks using particle swarm optimization (PSO). Training a neural network is an optimization problem so the optimization algorithm is of primary importance. Vaimal has three types of neural networks available:
- Multi-layer perceptrons (MLP)
- Probabilistic neural networks (PNN)
- Generalized regression neural networks (GRNN)
When training MLPs we are adjusting weights between neurons using an error function as our optimization objective. PNNs and GRNNs use a smoothing factor, σ to define the network. The objective is to find sigmas that minimize error. For the remaining discussion we will focus on MLPs, but it is the same for the other two networks except we are talking about sigmas instead of weights.
Particle Swarm Optimization (PSO)
To train a network, Vaimal uses a PSO algorithm. PSO uses a population of candidate solutions called a swarm. Each member is a particle. The swarms moves about the search space in a manner similar to a flock of birds or a school of fish. PSO is a heuristic that doesn’t use calculus in finding a solution.
Translating this terminology to the problem at hand, we get:
Particle → A set of weights for the network
Swarm → A group of particles
To start the optimization process we initialize the swarm by randomly generating weights for each particle. We save each weight as the personal best weight since its the initial pass. Each case’s inputs are fed into the network to calculate a total error for all cases using the initial weights for each particle. If there are 10 particles in the swarm, we now have 10 sets of weights and 10 values of the error function. Now we loop through each error value and find the lowest error. The weights associated with the lowest error are saved as the current global best.
The main PSO loop is entered by updating all weights using the following relations.
new velocity = inertia*current velocity + Φ1 *rp*(pbest – current weight) + Φ2*rg*(gbest – current weight)
new weight = current weight*new velocity
where:
inertia → a multiplier value on the current velocity. > 1 will keep the velocity moving more along the current velocity. < 1 will diminish the current velocity’s impact on new velocity.
Φ1 → A local constant for personal best. Vaimal uses 2.0.
Φ2 → A global constant for global best. Vaimal uses 2.0.
rp → A random number between 0 and 1.
rg → A random number between 0 and 1.
pbest → The personal best of this weight for this particle. This weight resulted in the lowest error for this particle.
gbest → The global best of this weight. This weight resulted in the lowest global error.
We can see from above that the updated weight is based on
- The current weight and velocity.
- The difference between the current weight and the pbest weight with some randomness added (rp).
- The difference between the current weight and the gbest weight with some randomness added (rg).
Once the weights are updated, each particle’s network is fed all training cases and the error is calculated. The error for each particle is compared to their personal best, and if it’s lower, the pbest weights are updated. After all particle errors have been calculated, the lowest error among them are compared to the global best error. If one of the new errors is lower than global best, then the gbest weights are updated with the particle that has the lowest error so far.
The PSO loop continues until error is zero or a stopping condition is met. Vaimal can be stopped automatically by maximum epochs, a time limit, or when a minimum error is reached. It can also be stopped manually at any time during training.
Swarm Size
The swarm size to select is a trade-off. A larger swarm does not necessarily mean that convergence is faster since more particles are searching for a solution. The larger swarm requires more computation time each epoch which will result in fewer weight updates for a given training time. On the other hand, a small swarm will update more often, but with fewer particles, it may miss a new minimum error solution. The best practice is to balance these competing factors by selecting a swarm size somewhere between the two extremes. Vaimal uses a default value of 15 which is a good starting point for experimenting with swarm size.
Training Behavior with PSO
Now let’s discuss the behavior of PSO while training. Vaimal shows the training and validation error (if used) after every epoch. These errors behave a little differently depending on the settings.
Training Error – No Regularization
After each epoch, the reported training error is the global best error. Therefore training error only decreases or stays the same as time elapses.
Training Error – L1 or L2 Regularization
From the PSO algorithm’s perspective, training error with regularization is the actual error with a penalty added. For L1 the penalty is a function of the sum of the absolute value of weights. For L2 the penalty is a function of the sum of squared weights. With these regularization schemes we are penalizing larger weights and forcing the PSO algorithm to find lowest error with the smallest weights.
The reported training error has the regularization penalty removed so training error may increase, decrease or stay the same as time elapses. Let’s show this with an example. We are using L2 regularization with the following:
epoch_n: train error = .72; penalty = .4 → total error = .72 + .4 = 1.12
epoch_n+1: train error = .75; penalty = .35 → total error = .75 + .35 = 1.10
To the PSO algorithm, epoch_n+1 has a lower total error so the particle’s weights that resulted in this error are saved as gbest. The penalty is removed before reporting the train error, so the first epoch is .72 and the second is .75.
Validation Error
To calculate validation error, we use the global best weights that have resulted in the lowest training error so far to calculate the error on the validation data set. Since we are not optimizing based on validation error, as global best weights change, they may do better or worse on the separate validation data. Therefore, the validation error may increase, decrease or stay the same as time elapses.
When to Stop Training
Before discussing when to stop training, the assumption is that overtraining is not occurring. Overtraining is a topic for another article, but for the discussion here, we are trying to find lowest error possible.
So we’ve been training for some time and error is not changing. How do we decide when to stop training? The answer of course is, it depends. There are often “lulls” where training error does not decrease over many epochs and then error suddenly drops. When one of these lulls appear, we have to decide if this is a global minimum or if a new minimum just hasn’t been found yet. A few things to consider:
- Let the training continue much longer and see if it converges on a new solution.
- Try varying inertia.
- Reinitialize the weights.
- If the network is not complex enough, it may have reached the minimum error and is not able to produce a better solution (under-training).
- If the network is too complex, training is slower and more difficult. The minimum error may still decrease, but a simpler network would train faster.
PSO vs. Backpropagation
Since PSO is a heuristic, it is not guaranteed to find a global minimum. This does not imply that it’s bad compared to the commonly used backpropagation algorithm. Backpropagation is based on computing gradients and is susceptible to getting stuck in local minima. PSO can jump out of local minima or the weights can be reinitialized to start looking in a new area of the search space, thus if run long enough it does well at minimizing error. The drawback is that PSO can often take longer to converge on a solution versus backpropagation. The advantage is that PSO can do a better job of finding a more optimal solution, bearing in mind that we don’t want to over-train.