In this episode Charlie uses some search algorithms to find crimes that are similar to a specific crime, so we'll discuss some abstract search algorithms and how they might be applied to finding crimes.

In this episode Charlie suspects that there is a seriel killer on the loose, and since serial killers often follow patterns, he wants to see if there are any unsolved crimes that match the patterns of the crimes he suspect the seriel killer has committed. This could be very difficult mathematically, so we'll break up the task into a few steps and talk about how some of the easier ones might work in general. Of course, for the specific application of trying to match crime patterns, there might be some specialized algorithms, but we won't talk about these.

Let's think about what it would take to get a computer to find crimes that are similar to some particular crime. First, you would have to find some way of telling the computer when two crimes are similar. This is difficult and depends strongly on what information one has about the crimes, so we'll discuss it very briefly at the end. Second, one would have to have an efficient way of searching through a large database of crimes to find ones that are similar. This could be tricky, so instead we'll describe some algorithms for searching a list of numbers to see if some particular number is in the list, and then we'll discuss possible ways to generalize this to find a number close to a particular given number.

First, let's think about looking for a word in a dictionary (which is
basically the same thing as looking for a number in a list of numbers).
How would you go about doing this? One way is to look through the
dictionary in order, one word at a time, until you find the word you're
looking for. This is obviously a dumb method because it would
take a very long time and it doesn't take advantage of the fact that the
words in a dictionary are in order. How long would this take? Well, if
you looked up a whole lot of random words, on average you would have to
look at half of the words in the dictionary for each word. So if there
are *N* words in the dictionary, on average this method of search
would take time *N/2*. This is a very long time.

A much better method is to open the
dictionary to the middle page. Then if the first word on that page is
later in the alphabet than the word you want, you can ignore the pages
in the second half, pretend that you have a new dictionary which is just
the first half of the original dictionary, and repeat. Otherwise, pretend
you have a new dictionary which is just the second half of the original
one. Now how long will this take, on average? Well, one way to answer
this is to figure out the maximum number of times that the dictionary
can be split in half. If there are *N* pages and *M*. This is called the * base 2 logarithm* of *N*,
and this is almost always way smaller than *N/2*. This analysis
confirms the obvious fact that this way of searching for a word is
much faster than the first way.

Now there's actually a subtle point in the last paragraph that we
glossed over. Did you spot it? It's in the first sentence, when we
said "open the dictionary to the middle page." To do this precisely,
a human would have to search for the page, which would be inefficient.
However, computers store lists in *arrays*, which allow them
arbitrary access to elements based on their position in the list.
So if the computer wants to access the element in position *i*,
it can do so in one computational step, which means it can use
the algorithm above exactly as described. If a person were to use
the algorithm, they would have to replace the step "open to the
middle page" with the step "open to approximately the middle
page." However, this doesn't affect the algorithm
or its efficiency significantly.

Computers can also store lists in other ways. For
example, they can use arrays, which we talked about in the text. However,
they can also use *linked lists*. These are data structures where
each element of the list stores a *pointer* to the element before
it and the element behind it. Then if you have one particular element
of the list, you can move to the next element or the previous one.
This isn't very efficient for searching, but it can be much more
efficient for memory management.

Now let's suppose that we have a list of numbers in random order.
Is there any efficient way to search through this list? In this case,
the only reasonable way to search for something in a randomly ordered
list is to start at the beginning and go through the list one entry
at a time. Just as before, if you're given a random number and the
list has *N* entries, then on average it will take you *N/2*
tries to find the number. Also, if you had *M* different numbers
that you wanted to search for, this would take *M*N/2* tries,
on average. If *M* is a large number, then this is very inefficient.
How could we make this more efficient? One way is to take the list of
randomly ordered numbers and sort it so that we can use the algorithm
described above. Then after the list is sorted, performing
*M* searches will take approximately *M*log_2(N)* tries,
which is much more efficient.

- Can you think of an algorithm for sorting a list? Imagine taking a deck of cards and putting them in order. How would you do it? If you wanted to describe this precisely enough for a computer to do it, how would you describe it?
- If there are
*N*numbers in the list, about how long would it take to sort them? - You probably came up with an algorithm that takes about
*N*N*time to sort a list of length*N*. Can you come up with one that would take about time*N*log(N)*? (Hint: try using a divide and conquer technique.) - If you're interested, search for "search" (haha) with Google to find more algorithms for searching. You'll find much more information than you want :-)

One reason the data structure might be called a tree is
that it's similar to a family tree of a bacteria that reproduces
asexually. A tree (in the data structure sense) is also similar to
a tree (in the plant sense), and much of the terminology referring
to trees comes from terminology referring to the plants. For example,
the words "branch, leaf, root, pruning, and grafting" all have a meaning
for either type of tree.

Now we'll talk a little bit about a more advanced method for
storing data that will make it easier to find numbers in a list that
are close to a certain number. The data structure that we were using
before was just a list of numbers, which we can picture as being on
a line. However, we can also use a "2-dimensional" data structure
called a **tree**. A tree is a special type of a **graph**,
which have been discussed in other Numb3rs episodes. A tree is
a collection of some *nodes*, and each node contains a piece
of data (a number, a word, etc.) and links to its *children*.
We can picture a tree as having multiple levels. On the first
(top) level, there is one node called the *root*. It has links
to one or two children, which are on the second level, and it doesn't
have a parent. The children of the root each can have 0, 1, or 2 children
themselves, and they can have children, etc. The picture is an example
of a unordered tree, where there doesn't have to be any relationship
between the parent's data and its children's data (in the picture the
data that each node stores is a number). However, the type of trees that
are useful for searching are ordered trees, where for each parent,
the number in the left child is less than the number in the parent,
which is less than the number in the right child. Now if you wanted to
search for numbers in a list which were close to a particular number,
you could put the list into a tree and then perform a search for that
number. Even if you don't find that particular number, you will end
up finding numbers that are close to that number in the list.

- If you had a list of numbers and wanted to put them into a tree data structure, how might you go about doing it?
- Can you think of any problems that might come up with your answer to the first question? (Hint: What would happen if the list of numbers was already in increasing order?)