Genetic programs, data-snooping, and technical analysis, страница 4

In this study, only two genetic operators are used to create rules. In reproduction, rules from the parent generation are inserted into the child generation unchanged. In recombination, two parent rules are chosen, and the subtrees are randomly chosen from each parent rule and exchanged. Figure 2 shows the recombination of two parent rules into two child rules. While many other genetic operations have been proposed, reproduction and recombination are the two most common, and additional operators typically offer little benefit (Koza, 1992).

To prevent overfitting, the rules are generated using two sets of futures price data, as inAllen and Karjalainen (1999). Rules are evaluated for selection and operation based upon their fitness in “training” data, which are prices for 2 years for a given commodity. After each generation is evaluated using the training data, the fittest rule is applied to the “selection” data, which is also 2 years of price data. If this rule is fitter than the previous rules evaluated with selection data, it is retained.

An initial generation of 20,000 random rules is created. Each successive generation consists of the fittest rule from the previous generation, 1,999 randomly chosen rules are inserted unaltered (reproduction), and the remaining 18,000 are the product of recombination of randomly chosen pairs. Analogously to the evolutionary process, rules are not truly randomly chosen. Instead, the probability that a rule will be chosen for insertion or recombination is a function of its fitness. Specifically, the probability is a function of a rule’s rank within the population,

2i

                                                                                         pi                                             (1)

N(N  1)

where pi is the probability that the ith rule will be chosen, and i is the ordinal rank of the rule, with i  N the most fit, and i  1 the least fit. Successive generations are created until the “best” rule (when applied to the selection data) doesn’t change for five generations, to a maximum of 200 generations.

Because GP cannot guarantee convergence, either locally or globally, the quality of a solution is a monotonic function of its computational cost; as larger populations of larger rules are allowed to evolve longer, the probability of convergence increases. Balancing this need is the time required for estimation. For this study, the population size is 20,000 rules, each of which is constrained to 50 nodes. In the initial rule generation, the rules are constrained to be no more than 10 levels deep, but in recombination, the rules can grow to be 16 levels deep.[3] To further improve the results, 20 optimizations are performed over each set of training/selection data, differing only in the seed value to the random number generator, and the best rule of the 20 is used in the out-of-sample testing. The out-of-sample evaluation uses the year of prices following the end of the selection period.

TRADING STRATEGY EVALUATION

Because rules are selected for operation based upon their fitness, the specification of the fitness measure is crucial for the success of genetic programming. Net profits are the simplest and most common measure of the usefulness of a trading strategy. The leverage nature of futures contracts makes the use of simple return-based measures of performance more difficult, as it is unclear what denominator should be used in computing the return. One could assume that no leverage is used, although this seems a very strong assumption, especially as leverage is frequently cited as an advantage of futures markets. Alternatively, one could use the margin requirement as the denominator. This is also problematic, as U.S. Treasury Bills can be pledged as collateral, meanwhile still accruing interest for the futures-holder, which reduces the forgone interest of holding futures to zero.