We used the sigmoid and linear activation functions for the hidden and output layers respectively. Regarding the training algorithms, they fall into two main categories: heuristic techniques (momentum, variable learning rate) and numerical optimization techniques (conjugate gradient, Levenberg– Marquardt). Various comparative studies, on different problems, were initiated in order to establish the optimal algorithm (Demuth and Beale, 2001; Costea, 2003; Nastac and Koskivaara, 2003). As a general conclusion, it is difficult to know which training algorithm will provide the best (fastest) result for a given problem. A smart choice depends on many parameters of the ANN involved, the dataset, the error goal, and whether the network is being used for pattern recognition (classification) or function approximation. Statistically speaking, it seems that numerical optimization techniques present numerous advantages. Analysing the algorithms that fall into this class, we observed that the scaled conjugate gradient (SCG) algorithm (Moller, 1993) performs well over a wide variety of problems, including the experimental dataset presented in this paper. Even if SCG is not the fastest algorithm (as Levenberg–Marquardt in some situations), the great advantage is that this technique works very efficiently for networks with a large number of weights. The SCG is something of a compromise: it does not require large computational memory, and yet it still has a good convergence and is very robust. Furthermore, we always apply the early stopping method (validation stop) during the training process in order to avoid the overfitting phenomenon. And it is well known that, for early stopping, one must be careful not to use an algorithm that converges too rapidly (Hagan et al., 1996; Demuth and Beale, 2001). The SCG is well suited for the validation stop method.
In our experiments we have kept all parameters of the ANN constant (the learning algorithm (SCG), the performance goal of the classifier, the maximum number of epochs), except the number of neurons in the hidden layers (NH1, NH2).
The procedure used to determine the proper values for NH1 and NH2 consists of iteratively performing the following experiment:
• Randomly split the training set (TR) into two parts: one for the effective training (TRe) and the other for validation (VAL). In order to avoid the overfitting phenomenon we have applied the early stopping method (validation stop) during the training process.
• Train the network for different values of NH1 and NH2. For each combination of NH1 and NH2, we performed four random initializations of the weights. NH1 and NH2 take values in the vicinity of the geometric mean (Masters, 1994) of the number of inputs NI and outputs NO respectively:
NI NO NH× − 2 ≤ i ≤ NI × NO + 2
e.g. NI = 7, NO = 7 ⇒ NH1, NH2 = 5,9. In this case, 5 × 5 × 4 = 100 trainings in total are performed for each experiment.
• Save the best ANN architecture in terms of mean-square error of the training set MSETRe with the supplementary condition: MSEVAL ≤ (6/5)MSETRe. This supplementary condition was imposed so that the validation error is not too far from the training error, thus avoiding overfitting for the test set.
We ran three experiments like the one described above (3 × 100 = 300 trainings) to determine the proper values for NH1 and NH2. See the flowchart of the procedure in Appendix A, Figure A.1.
Regarding the number of output neurons, we have two alternatives when applying ANNs for pattern classification. The first alternative, which is the most commonly used, is to have as many output neurons as the number of classes. The second alternative is to have just one neuron in the output layer, which will take the different classes as values. We chose the first approach in order to allow the network to disseminate the input space better.
Уважаемый посетитель!
Чтобы распечатать файл, скачайте его (в формате Word).
Ссылка на скачивание - внизу страницы.