Numb3rs 522: Greatest Hits

In this episode Charlie uses some search algorithms to find crimes that are similar to a specific crime, so we'll discuss some abstract search algorithms and how they might be applied to finding crimes.

Search Algorithms

In this episode Charlie suspects that there is a seriel killer on the loose, and since serial killers often follow patterns, he wants to see if there are any unsolved crimes that match the patterns of the crimes he suspect the seriel killer has committed. This could be very difficult mathematically, so we'll break up the task into a few steps and talk about how some of the easier ones might work in general. Of course, for the specific application of trying to match crime patterns, there might be some specialized algorithms, but we won't talk about these.

Let's think about what it would take to get a computer to find crimes that are similar to some particular crime. First, you would have to find some way of telling the computer when two crimes are similar. This is difficult and depends strongly on what information one has about the crimes, so we'll discuss it very briefly at the end. Second, one would have to have an efficient way of searching through a large database of crimes to find ones that are similar. This could be tricky, so instead we'll describe some algorithms for searching a list of numbers to see if some particular number is in the list, and then we'll discuss possible ways to generalize this to find a number close to a particular given number.

First, let's think about looking for a word in a dictionary (which is basically the same thing as looking for a number in a list of numbers). How would you go about doing this? One way is to look through the dictionary in order, one word at a time, until you find the word you're looking for. This is obviously a dumb method because it would take a very long time and it doesn't take advantage of the fact that the words in a dictionary are in order. How long would this take? Well, if you looked up a whole lot of random words, on average you would have to look at half of the words in the dictionary for each word. So if there are N words in the dictionary, on average this method of search would take time N/2. This is a very long time.

A much better method is to open the dictionary to the middle page. Then if the first word on that page is later in the alphabet than the word you want, you can ignore the pages in the second half, pretend that you have a new dictionary which is just the first half of the original dictionary, and repeat. Otherwise, pretend you have a new dictionary which is just the second half of the original one. Now how long will this take, on average? Well, one way to answer this is to figure out the maximum number of times that the dictionary can be split in half. If there are N pages and 2^M = N, then the maximum number of times the dictionary can be split in half is M. This is called the base 2 logarithm of N, and this is almost always way smaller than N/2. This analysis confirms the obvious fact that this way of searching for a word is much faster than the first way.

Now there's actually a subtle point in the last paragraph that we glossed over. Did you spot it? It's in the first sentence, when we said "open the dictionary to the middle page." To do this precisely, a human would have to search for the page, which would be inefficient. However, computers store lists in arrays, which allow them arbitrary access to elements based on their position in the list. So if the computer wants to access the element in position i, it can do so in one computational step, which means it can use the algorithm above exactly as described. If a person were to use the algorithm, they would have to replace the step "open to the middle page" with the step "open to approximately the middle page." However, this doesn't affect the algorithm or its efficiency significantly.

Computers can also store lists in other ways. For example, they can use arrays, which we talked about in the text. However, they can also use linked lists. These are data structures where each element of the list stores a pointer to the element before it and the element behind it. Then if you have one particular element of the list, you can move to the next element or the previous one. This isn't very efficient for searching, but it can be much more efficient for memory management.

Now let's suppose that we have a list of numbers in random order. Is there any efficient way to search through this list? In this case, the only reasonable way to search for something in a randomly ordered list is to start at the beginning and go through the list one entry at a time. Just as before, if you're given a random number and the list has N entries, then on average it will take you N/2 tries to find the number. Also, if you had M different numbers that you wanted to search for, this would take M*N/2 tries, on average. If M is a large number, then this is very inefficient. How could we make this more efficient? One way is to take the list of randomly ordered numbers and sort it so that we can use the algorithm described above. Then after the list is sorted, performing M searches will take approximately M*log_2(N) tries, which is much more efficient.

  1. Can you think of an algorithm for sorting a list? Imagine taking a deck of cards and putting them in order. How would you do it? If you wanted to describe this precisely enough for a computer to do it, how would you describe it?
  2. If there are N numbers in the list, about how long would it take to sort them?
  3. You probably came up with an algorithm that takes about N*N time to sort a list of length N. Can you come up with one that would take about time N*log(N)? (Hint: try using a divide and conquer technique.)
  4. If you're interested, search for "search" (haha) with Google to find more algorithms for searching. You'll find much more information than you want :-)
One reason the data structure might be called a tree is that it's similar to a family tree of a bacteria that reproduces asexually. A tree (in the data structure sense) is also similar to a tree (in the plant sense), and much of the terminology referring to trees comes from terminology referring to the plants. For example, the words "branch, leaf, root, pruning, and grafting" all have a meaning for either type of tree.
a picture of a tree

Now we'll talk a little bit about a more advanced method for storing data that will make it easier to find numbers in a list that are close to a certain number. The data structure that we were using before was just a list of numbers, which we can picture as being on a line. However, we can also use a "2-dimensional" data structure called a tree. A tree is a special type of a graph, which have been discussed in other Numb3rs episodes. A tree is a collection of some nodes, and each node contains a piece of data (a number, a word, etc.) and links to its children. We can picture a tree as having multiple levels. On the first (top) level, there is one node called the root. It has links to one or two children, which are on the second level, and it doesn't have a parent. The children of the root each can have 0, 1, or 2 children themselves, and they can have children, etc. The picture is an example of a unordered tree, where there doesn't have to be any relationship between the parent's data and its children's data (in the picture the data that each node stores is a number). However, the type of trees that are useful for searching are ordered trees, where for each parent, the number in the left child is less than the number in the parent, which is less than the number in the right child. Now if you wanted to search for numbers in a list which were close to a particular number, you could put the list into a tree and then perform a search for that number. Even if you don't find that particular number, you will end up finding numbers that are close to that number in the list.

Building Trees
  1. If you had a list of numbers and wanted to put them into a tree data structure, how might you go about doing it?
  2. Can you think of any problems that might come up with your answer to the first question? (Hint: What would happen if the list of numbers was already in increasing order?)