Moneyball with Maple 2: The Search for Even More Money

The baseball season may have yet to come forth, but we can still have a little fun with the game using Maple.

What started as a curiosity and evolved into a hobby and then a project, I have created and now uploaded my BaseballLineupSimulator package. This package uses Maple to address four fundamental questions with a natural progression about the offensive side of baseball:

  1. Given a lineup, can we simulate the hits, base-running, and scoring using basic statistics and random number generators?
  2. Can we determine the distribution of runs scored by a lineup over many games?
  3. Can we compare two or more lineups with respect to mean runs scored?
  4. Can we compute the optimal lineup given a selection of players available for the batting order?

Simulations capture many quintessential aspects of baseball, enough to make the results fairly interesting and insightful, but the package is not intended to reproduce the many complexities and nuances of the game, especially professional baseball.

The fourth question above was addressed in a different manner using Markov Chains here. The title of this post refers, with permission, to the first. Major League Baseball, of course, has always involved careful analysis of statistics, and in recent years has, for better or worse, taken this analysis even further. See Moneyball and Sabermetrics.

Many often ask why a MLB season takes so long, to the tune of 162 games just for the regular season. One reason, I suppose, is that baseball is a backdrop for the summer months, even if most fans are unable to watch every pitch of every game of their favourite teams. Another reason, though, is that it usually takes numerous games for the batting order to be refined, and for better lineups and teams to emerge.

For a fixed set of players, an optimal batting order will, on average, maximize the strengths and minimize the weaknesses of the batters. The hitters in a lineup work in concert, yet strong lineups often get shutout, and weak lineups still pile up runs every now and again. It takes many games, or simulations in the case of the BaseballLineupSimulator package, to determine how well all the different batting orders perform and which order is indeed the best.

Installation

The BaseballLineupSimulator package is available from the MapleCloud. It can be downloaded from there, or installed by executing the following in Maple:

PackageTools:-Install( 5019395307339776, 'overwrite' );

Simulating Lineups

To simulate a lineup, you first need a text file for the roster. Each line of the roster is in the following form:

Name,1B,2B,3B,HR,BB,RR

The values 1B, 2B, 3B, and HR represent fair probabilities (strictly less than 1.0) of obtaining the outcome in a given At Bat (AB), and BB represents the fair probability of drawing a walk ("Base on Balls") during a given Plate Appearance (PA). The value of RR ("Runner Rating") is one of the values 1, 2, 3, 4, 5 which subjectively rates the player as a runner, with 1 being "poor", 3 being "average", and 5 being "great".

Second, you specify a lineup. For example, consider this one from the 1993 Toronto Blue Jays:

restart;
with( BaseballLineupSimulator );
randomize():
Lineup := [ "Henderson", "White", "Alomar", "Molitor", "Carter", "Olerud", "Fernandez", "Sprague", "Borders" ];
Note that loading the package also sets the default roster which contains the statistics for the above lineup. Custom rosters can be set using the SetRoster command, and queried using GetRoster.

Single Simulation

Now, we can run a simulation. Here, I use only one inning of one game for simplicity:

SimulateLineup(
	Lineup,
	numsimulations = 1,
	numinnings = 1,
	playbyplay = true,
	displaysummary = false,
	simulationoutput = dataframe
):

This prints out a play-by-play of what happened:

========== Start of game 1. ==========
++++++++++ Start of inning 1. ++++++++++
--------------------
Henderson hit a single.
--------------------
White hit a single.
Henderson hustled and advanced to third base.
--------------------
Alomar hit a double.
Henderson scored.
White scored.
2 runs in total scored on the play.
--------------------
Molitor did not get a hit.
1 out.
--------------------
Carter did not get a hit.
2 out.
--------------------
Olerud hit a home run.
Alomar scored.
2 runs in total scored on the play.
--------------------
Fernandez did not get a hit.
3 out.
--------------------
Total runs scored in inning 1: 4.
--------------------
++++++++++ Game has finished. Total runs scored in game 1: 4. ++++++++++

The type of hit by a batter is determined by his/her statistics and random number generators. The runner ratings are used to determine if a runner is doubled-off when the batter gets out, or if he/she takes an extra base if the batter gets a hit.

Multiple Simulations

Of course, we can simulate many games using the same lineup:

Simulations := SimulateLineup(
    Lineup,
    numsimulations = 2500,
    numinnings = 9,
    playbyplay = false, # don't worry, the default is false!
    displaysummary = false,
    simulationoutput = record,
    numgames = 150
):

Here, 2500 simulations are used, with the statistics scaled over 150 games (a typical number of games played per 162-game season).

The output record includes a Maple DataFrame for statistics. Formatted in a spreadsheet, it looks like this:

Furthermore, the output record stores a histogram showing the distribution of runs scored:

 

Optimizing Lineups

Since we can simulate a given lineup many times and know, for instance, the average number of runs scored, we can also compare two or more lineups and decide which is better. The package includes the command CompareLineups for this purpose. Thinking bigger, though, we must ask, is there a best lineup? This is the purpose of the OptimizeLineup command.

Key Performance Indicators

The obvious way of defining "best", certainly in the MLB, would be to use the total runs scored as the metric, or Key Performance Indicator ("KPI"). The OptimizeLineup command accepts any user-defined KPI, with two pre-existing in the package, namely KPIs:-TotalRuns (the default) and KPIs:-FewestRunsProduced. The latter is useful in less competitive softball leagues, where the goal is to have fun, and is intended to get everyone to contribute as much as possible, by maximizing the smallest number of runs produced (R+RBI-HR) by all players in the lineup.

Constraints

The optimizer accommodates constraints, like positions in the order and spacing. Suppose, for instance, there are 12 players on a 3-Pitch Slo-Pitch softball league team, and the pitchers are Aaron and Becky. Since the pitchers pitch to their own players, the pitchers need to be spaced apart in the lineup. Further, suppose player Chris is going to be late to the game and needs to bat towards the bottom of the order. These restrictions can be included in the optimization:

OptimizeLineup(
	...
	playerpositions = { [9..,"Chris"] },
	playerspacingabsolute = { [5,{"Aaron","Becky"}] },
	...
);

Methods

There are three methods for optimization:

  1. Random. Given a time limit, the optimizer tests as many random permutations of the original lineup as possible using the specified number of simulations per lineup. The best lineup found, with respect to the KPI, is returned.
  2. Systematic. The optimizer systematically tests every permutation of the original lineup using the specified number of simulations per lineup. The best lineup found, with respect to the KPI, is returned. For longer lineups with a sufficiently high number of simulations to give reliable results, this method is impractical and time-consuming.
  3. Tweaking. This method has the advantage of leveraging and refining baseball intuition. Given a lineup, a tweaked lineup is the result of switching two or more players. With the search=tweaking option, you also specify the tweak sizes, for instance tweaksizes={2,3}, which tells the optimizer to consider tweaks that move two or three players in the lineup.

    Recursive algorithm:

    1. Calculate the KPI value of the original lineup L[0].
    2. Stage n: Given lineup L[n-1], simulate every tweaked version of the lineup and record the tweaked lineup L[n] with the largest improvement in KPI value.
    3. When no improved lineup is found in a stage, exit.

    The tweaking method is analogous to the Gradient Method of Multivariate Calculus, and finds a locally-optimal lineup. Note that larger tweaks may be required for a globally-optimal lineup, but when time is limited, squeezing an extra run or two out of a lineup is sufficient. Moreover, if the original lineup is fairly good, then the tweaking method likely only requires tweaks of size two (or at most three), and should take less than a hour, usually less than half an hour.

    Note: I included a separate command, TweakingOptimization, for illustration purposes. One application is finding solutions of Magic Puzzles.

1993 Toronto Blue Jays

Consider this typical lineup from the 1993 Toronto Blue Jays:

restart;
with( BaseballLineupSimulator );
randomize():
OriginalLineup := [ "Henderson", "White", "Alomar", "Molitor", "Carter", "Olerud", "Fernandez", "Sprague", "Borders" ];

The stretch of players two through six were popularly known as "WAMCO". Originally, White usually led off, but when Henderson joined via trade, there were the "HWAMCO" (above) and "WHAMCO" variations. The lineup was already very refined and potent, but we may be able to refine it a little more, at least with respect to the baseball model used in the simulator:

ImprovedLineup := OptimizeLineup(
    OriginalLineup,
    numinnings = 9,
    numsimulations = 2500,
    kpi = KPIs:-TotalRuns,
    optimization = maximum,
    search = tweaking,
    numtweaks = 10,
    tradesizes = {2,3},
    retestfactor = 2,
    usethreads = true,
    numthreads = 4,
    printprogress = true
);

During the optimization, when a potentially improved lineup is detected, this lineup is re-tested (in this case, with retestfactor*numsimulations=5000 simulations), and if the lineup passes the re-test, it is accepted as an improvement. The logic behind this step is that, generally, a false positive is more difficult to reverse than a false negative.

This optimization gave, when I ran it:

BetterLineup := [ "Henderson", "White", "Alomar", "Molitor", "Olerud", "Fernandez", "Carter", "Sprague", "Borders" ];

This improved the average runs scored per game from 4.799 to 4.889, which is about 14.58 runs over a 162-game season.


Please Wait...