Numb3rs 407: Primacy

In this episode, Charlie suggests using an evolutionary algorithm to search for abnormal behavior after a murder occurs in a video game with a real-life component. Evolutionary algorithms imitate the biological process of natural selection to find optimal solutions. There is a notion of which potential solutions are "fitter" than others. The fittest solutions survive as less fit solutions are weeded out, hopefully resulting in very fit (optimal) solutions over time. On this page, we will see how a genetic algorithm (a particular kind of evolutionary algorithm) could conceivably be used with a satellite to find a point of maximal elevation on the Earth. Charlie explains:

"Think of a pond. A pond is a healthy ecosystem where predator and prey are in balance. However, if an outside predator is introduced, well then the ecosystem changes. An evolutionary algorithm highlights abnormal alterations, it discerns subtle patterns, and locates the source of a problem.''

From this description, one might confuse the uses of evolutionary algorithms with the metaphor on which they are based. As we will see, abnormal alterations (mutations) occur in the execution of such algorithms. Presumably, Charlie has in mind a way of comparing behavior patterns and determining which actions are "abnormal" for video game play. These will be the fittest for his purposes. In general, evolutionary algorithms can be potentially used anywhere there is a meaningful notion of "fitness". This has nothing to do with searching for abnormal behavior. satellite

For instance, let's suppose we have a satellite that will tell us the elevation of the Earth at any specific longitude/latitude pair we ask of it. In this example, we are only allowed to ask the satellite to check one point at a time. One way of finding a high point is to ask the satellite to check (in some order) each point on a fine grid laid over the region, like a bitmap image. A genetic algorithm may be a more efficient approach. We define a longitude/latitude pair to be more "fit" than another if it has a higher elevation.

Biology Review

Evolution Before describing such a genetic algorithm in detail, let's first review the biology on which such algorithms are modeled. Biological traits (phenotype) of an organism are based in part on its DNA (genotype), which takes the form of a sequence of four different types of bases (A,C,G,T). Variation in genotype exists across a given population. Some individuals have genes that make them more likely than others to survive, reproduce, and produce fertile offspring. Such individuals are more "fit". Fitness often depends on complicated relationships between various genes simultaneously present. Since fit organisms are more likely to pass their genes to the next generation, genes promoting fitness tend to become more prevalent with each successive generation. This process is known as natural selection.

Various factors influence the natural selection process. Random mutations can occur in the genes of individual organisms. When these have any effect on fitness at all, it tends to be negative. Nonetheless, mutations can occasionally increase the fitness of an individual. Such mutations may become more prevalent in successive generations, by natural selection. Sexual reproduction, where the geneotype of offspring depends on the genotype of two parents, tends to lead to greater diversity of offspring than asexual reproduction, where offspring have a single parent. Population diversity, by virtue of a large number of gene combinations, may lead to the existence of some very fit individuals. (Diverse populations may also be better suited to surviving changes in how fitness is determined as environmental circumstances vary.)

Genetic Algorithm

So how can we model an algorithm for our elevation problem on this biology? We already have a way of determining fitness - points where the elevation is higher are more fit. We need a way of representing candidate longitude/latitude pairs as "genetic" information. Instead of sequences of A,C,G,T, it is easier to work with sequences of 0's and 1's. One way of encoding our longitude/latitude pairs as such a string is to write both numbers in binary to some specified degree of precision (removing the decimal point), and then concatenate them (write one after the other to form a single string of 0's and 1's.) For instance, if our region is in the Northern and Western hemispheres, this encoding scheme is shown below. (Note: our organisms will only have a single chromosome, rather than pairs of them.) encoding

Now let's take an initial "population." This is a collection of such strings as above, often chosen at random. For our particular genetic algorithm, we can choose to keep the population at a constant size, as follows. We will choose half the members of the population to reproduce by sexual reproduction, in pairs, each pair producing four offspring. At this point, the parents (and organisms that did not reproduce) "die", and will be removed from the population. Such choices were somewhat arbitrary, and there are many similar algorithms.

In order to choose which half of the population reproduces, we could pair off individuals at random and let the more fit of each pair survive a given (large) proportion of the time, and the less fit survive the remainder of the time. Alternatively, we may wish to sort our population in order of fitness, letting the top third reproduce, "killing" the bottom third, and giving individuals of the middle third a 50% chance of reproduction. If we never allow the less fit organisms to reproduce, we might risk having low population diversity. Thomas Hunt Morgan's crossover diagram

Now that we have chosen which individuals "reproduce", we need to determine how their genetic information is combined. One easy way of doing this is based on an actual biological process: crossing over of chromosomes. For each pair of reproducing individuals, choose a random position in the genetic string. Now, swap the portions of the two parents' chromosomes after this position, and we have two offspring with genes based on, but (usually) not the same as either parent. Swapping at a different point, the same parents produce a second pair of offspring. crossing over

Finally, we would like to add some mutations to our population in hopes of potentially finding some fitter individuals. Choose some specific (small) number p. Now, look at each bit (0 or 1) of each string in the population, and change it to its opposite (1 or 0) with probability p. Now we have the next generation! Repeating the process of selection, reproduction, and mutation for many generations, the genetic algorithm will hopefully produce a population with high fitness, and give us good solutions to our problem.

Activity 1

Get a group of people together, and try to implement a demonstration of the above algorithm (without a computer.) If you're feeling ambitious, find a hilly terrain and mark off a rectangular region, as in a coordinate grid. In place of the satellite, choose a person to declare which points are higher than others. If you're feeling less ambitious, you may want to replace our satellite model with a simple function, like a polynomial on a given interval. Each person can hold a piece of paper with their genetic sequence written on it. People who are not selected to reproduce can wait to be "reincarnated" as the offspring of the people who are selected to reproduce.

For the random element, roll a six-sided die. When deciding which of two organisms will reproduce, let the more fit (higher) organism reproduce if the roll is 1-5, say, and the less fit organism reproduce if the roll is 6. Use dice to determine where crossovers and mutations occur as well.

Activity 2

If you are familiar with a programming language, try writing a computer program to implement a genetic algorithm. First, choose a function you would like to optimize. (Choose something complicated!) Come up with a scheme for encoding points in your search space with strings of 0's and 1's. Choose a population size. Choose how strongly you want fitness to be correlated to reproduction. How you will combine organisms to reproduce? How often will you make mutations occur?

Now, play around with the choices you made. Try changing population size. Try increasing or decreasing the probability for unfit individuals to reproduce. Change the rate of mutations. How do these choices affect the fitness of your population after a given number of generations? (You will want to run your program several times for each set of parameters, to see how much variation occurs due to the random factors.)