Intelligent systems in accounting, finance and managementassessing predictive performance of ann-based classifiers, страница 13

Based on the above discussion we formulated the following hypotheses:

H1. The training mechanism used to refine the solution obtained when determining the ANN architecture will have an influence on the classification performance of ANNs. The GA-based ANN will outperform the RT-based ANN both in training and in testing in the refining process.

H2. Data preprocessing will have an influence on both RT- and GA-based ANN training and testing performances.

H3. Data distribution will have an influence on both RT- and GA-based ANN training and testing performances.

Additional hypotheses:

H4. The crossover operator will have an influence on GA-based ANN training and testing performances.

H5. The stage at which we generate the effective training and validation sets will have an influence on RTbased ANN training and testing performances.

The main hypothesis of our paper is formulated as follows:

H6. All binary and ternary combinations of the above three factors (training mechanism, preprocessing method and data distribution) will have an influence on both RT- and GA-based ANN training and testing performances.

5.  DATASETS AND DESCRIPTIVE STATISTICS

5.1.  Telecommunications Sector Dataset

We used financial data about worldwide telecom companies. There are 88 companies structured in five groups: USA (32), Europe except Scandinavian companies (20), Asia (20), Scandinavia (10), and Canada (6). The time span is 1995–2001. For each company and for each year seven financial ratios were collected, with the Internet as the primary source. These ratios are suggested in Lehtinen’s (1996) study of financial ratios’ reliability and validity in international comparisons. The ratios measure four different aspects of companies’ financial performances: profitability, three ratios (operating margin OM; return on total assets ROTA; return on equity ROE); liquidity, one ratio, i.e. current ratio = current assets/current liabilities; solvency, two ratios (equity to capital EC; interest coverage IC); and efficiency, one ratio, i.e. receivables turnover ReT (Karlsson, 2002). In total, the dataset consists of 651 rows taken from companies’ financial statements in their annual reports: 88 companies × 7 years = 616 rows. Thirty-five more rows were obtained with the averages for the five groups (5 groups × 7 years = 35 rows). Out of 651 rows, 21 rows were discarded due to lack of data for calculating some ratios, resulting in a final dataset of 630 rows.

In order to ease the training and to avoid the algorithms placing too much emphasis on extreme values, we removed far outliers and outliers from the data. An outlier is sometimes more technically defined as ‘a value whose distance from the nearest quartile is greater than 1.5 times the interquartile range’ (SPSS for Windows, Release 11.5.1, SPSS Inc., Chicago). To remove the outliers we calculated the quartiles for each variable. If we denote with l the lower quartile, with m the median and with u the upper quartile of variable x, then the far outliers fo, outliers o and anomalies a for that variable belong to the following intervals:

fo ∈ (−∞, l − 3d) ∪ (u + 3d, +∞) o ∈ [l − 3d, l − 1.5d) ∪ (u + 1.5d, u + 3d] a ∈ [l − 1.5d, ld) ∪ (u + d, u + 1.5d]

where d = ul is the distance from the upper quartile to the lower. For example, Figure 2 shows the frequencies of far outliers, outliers and anomalies for the OM ratio. There are 30 far outliers, 30 outliers, and 17 anomalies.

We have two alternatives once we have detected the far outliers and the outliers of each variable: to discard a sample that has at least one far outlier or outlier value, or to keep it by taking the peak(s) off. We chose the later alternative. For example, in the case of the OM ratio, we ‘levelled’ 49 left outliers values (29 far outliers + 20 outliers) with l − 1.5d (= −22.48 for the OM ratio) and 11 right