Numb3rs 308: Hardball

In this episode the most significant mathematics involved was the discussion of sabermetrics, which are various methods of collecting statistics in baseball to evaluate players. There was also a discussion of recovering data from hard drives and a reference to evaluating people's economic potential from some variables in their life in a way similar to sabermetrics.


There are many different possible ways to measure the performance of baseball players. However, before we talk about these we should probably review the rules of baseball briefly first (since it is a very American game, so some foreigners aren't familiar with its rules, or with apple pie). Of course, the rules described here are only a brief summary of the main ideas. If you are already familiar with these rules, you can skip to the next paragraph. In the game of baseball, there are two teams, and the object of the game is for your team to score more runs than the other team. At each point in time there is one team on offense trying to score runs, and the other team is on defense trying to prevent this. There are 9 players on the defensive team, 3 in the outfield, 4 in the infield, one pitcher, and one catcher. The field is approximately a quarter of a circle with an approximately 400 foot radius. There are 4 bases arranged in a diamond pattern with home base at the center of the circle which the field is a quarter of. The side length of the diamond the bases are vertices of is 90 feet, and two of the edges lie on the edges of the field. The offensive player who is at bat stands on home plate holding a wooden stick called a bat (which cannot be filled with cork). The pitcher stands 60.5 feet away on a raised mound of dirt and throws the baseball to the catcher, who is standing behind the batter. The batter's goal is to hit the ball with the stick, or bat, and the pitcher's goal is to prevent this. If the batter hits the ball, then he gets to run towards first base, which is the closest base to him in the counterclockwise direction. If the batter hits the ball, then several things can happen. If a defender catches the ball before it touches the ground, then the batter is out. Similarly, if a defender touches first base while holding the ball before the batter does, the batter is out. If the batter touches a base and stays there, he is safe. Then he gets to stay on the base and the next batter in line gets to bat. If an offensive player ever gets back to home base, then his team gets a point, and if the defense gets 3 outs, then the teams switch offensive an defensive roles. Also, when the pitcher is pitching, he must throw the ball through a specified zone. If he doesn't, it is a ball, and if he does and the batter misses then it is a strike. If there are 4 balls, the batter automatically advances to first base, and if there are 3 strikes, the defensive team gets an out. If the pitcher hits the batter with the ball, the batter gets to go to first base. The game consists of 9 innings, each of which consists of each team playing offense and defense once. At the end of the game, the team with the most runs wins (and if at the end of 9 innings the score is tied, more innings are played).

Now that we know the important rules and terminology of baseball, we can start talking about what ways we can measure the value of different players mathematically. Some obvious choices are number of hits, batting average (percentage of times at bat a batter gets a hit), number of bases a batter has attained by his own hitting, etc. However, these simple traditional measurements aren't necessarily all that well correlated with the actual performance of a team. In the last couple decades several different measurements have been devised that predict a team's success or a player's value more accurately.

The famous Babe Ruth has some of the top scores under these measurements. His on base percentage is second ever at .4740 (Ted Williams was first with .4817). His slugging percentage was first at .6898 (Ted Williams was second with .6338). Combining these two, his on base plus slugging is first at 1.1638 (Ted Williams is second at 1.1155 and Barry Bonds is fourth at 1.0513).
One example of a sabermetric is the on base percentage, which is a measure of how often a batter reaches a base (except for some unusual circumstances). It the number of times a batter gets to a base divided by the number of times he was trying to get to a base (sometimes it is strategically optimal for a batter to deliberately get an out so that another offensive player can move closer to the home plate). Specifically, the top of the fraction is the number of hits plus the number of walks plus the number of times the batter was hit by a pitch. The denominator of the fraction is the number of at bats plus the number of walks plus the number of times hit by a pitch plus the number of sacrifice flies (these are hits which let another runner score but are caught before they hit the ground, resulting in an out). Another example of a sabermetric is the slugging percentage, which is supposed to measure the power of a hitter. It the total number of bases a runner earns divided by their number of at bats. Here if the runner gets to first base, he has earned one base, if he gets to second (without another batter batting) he has earned two, etc. A third example of a sabermetric is on base plus slugging, which is just the sum of the previous two metrics. This is a demonstration that coming up with these various metrics is fairly ad-hoc and not necessarily very scientific or rigorous.

However, at least one of these metrics, the Pythagorean expectation, has been studied rigorously. The name of the metric comes from its similarity to the Pythagorean formula, which says that if a right triangle has side lengths a,b and hypoteneus length c, then a^2 + b^2 = c^2. If RS,RA are the runs scored and runs allowed, respectively, by some particular team in a season, then the Pythagorean expectation formula states that the winning percentage of the team is approximately RS^2/(RS^2+RA^2). In actuality, this formula works better if the power 2 is changed to 1.81. At first glance this formula seems as ad hoc as the formulas mentioned before, but Professor Steven J. Miller has written a which gives a theoretical derivation of this formula assuming that the runs for each team follow a Weibull distribution.

Finally, we should discuss one of the statements in the episode which was somewhat misleading. In the show, the head researcher who was murdered at the beginning was supposedly coming up with a method of evaluating a person's economic potential, or their potential contribution to the economy and hence to society, using techniques similar to these metrics for baseball that we have been talking about. However, one of the reasons these metrics are useful in baseball is that a player will presumably play for many seasons, and his performance in one season should be similar to his performance in the next. The difficulty of applying these ideas to economics is that each person only lives one life, and predicting later events or contributions accurately for an individual person based on some numbers derived from measurements of various quantities in their upbringing would be nearly impossible. Of course, one could in theory derive probability distributions for predicting a person's "contribution to the economy", but it is likely that these distributions would have a much higher variance than the metrics for baseball, and would hence be much less useful to apply to individuals.

Hard Drives

As you may know, hard drives are the main device which is used for storing permanent data in personal computers. They have several platters, which have a similar shape to CDs and have many concentric rings of tiny segments of magnetic material. These tiny magnets are used to store a sequence of 0's and 1's which are organized into groups of 8 bits, each of which is called a byte. Modern hard drives can hold around 100 gigabytes of information, which means they hold approximately 100 billion bytes.

During the episode, Charlie is asked to try to recover data that has been erased from the murder victim's hard drive. At first one might think this is impossible since the data has been erased. However, when computers erase data, they don't actually write over the data, they just erase the information about where the files start and stop. They only write over the space that the deleted files occupied if they need that space to store other files. This means that it is often possible to recover erased data. In addition, due to slight misalignments of the hard drive, each of the bits often holds a slight trace of the previous magnetic state that it was in. That means that even if the data has been written over, it may be possible to see what had been written on it before, making recovery of some of the old data possible. However, this is very difficult in practice, and rarely leads to full recovery of the data. The lesson here is that if you really want to erase data, you should erase the files and then fill your hard drive up with surperflous files, delete these files and fill it up again, and repeat this process several times. This will practically ensure that the segments that hold the data you want to completely erase will have been written over several times, which will make recovery of the old data extremely difficult at best.