NFLNBAMLBNHLWNBACFBSoccer
Featured Video
Benches Clear in Detroit 😳

Statistical Report on Run Scoring In Baseball

Joe ReganMar 9, 2009

Without perspective, baseball statistics are meaningless. But with, even the most basic of statistics can provide a wealth of information on the sport.

With the help of http://www.baseballprospectus.com's wealth of statistical information, and a whole lot of Excel usage, I bring my analysis of run scored in baseball.

In this study, I observed the team totals of every season, initially just from 2004 to 2008, but eventually expanded to include the seasons from 1967 to 2008, excluding the strike shortened years of 1981 and 1994. Overall, I worked with 1,064 team seasons of data in this study.

TOP NEWS

Detroit Tigers v Atlanta Braves
Baltimore Orioles v New York Yankees
Chicago White Sox v San Diego Padres

The statistics used in this research were all the basic count statistics: Outs, Walks, Singles, Doubles, Triples, Home Runs, Hit By Pitch, Stolen Base, and Caught Stealing. These numbers were regressed on the one statistic that really matters for an offense in the end: runs.

Before actually beginning my research, I decided to split up the outs and walks category into more specialized data points. Walks were split into two: unintentional (UBB) and intentional (IBB).

Outs took some more digging, however. I eventually broke up outs into 4 categories: strikeouts (SO), sacrifice hits (SH), sacrifice flies (SF), and Non-sacrifice outs on balls in play (NSOBIP), which encompasses outs on in play balls.

Initially only interested in researching the recent history of run production, I took the last 150 team-seasons of MLB. Using the Excel function LINEST (with explanation and output found here), I was able to derive this regression.

CSSBSOSHSFHBPIBBUBBHR3B2B1BNSOBIPb
-0.04500.0282-0.1928-0.28950.16260.40100.26090.29381.47261.59240.73250.5186-0.1838313.8798
0.23550.07260.04790.10840.26320.13490.14120.03390.06320.21250.07930.03550.0491211.3092
0.916621.3023            
114.9234136            
677961.561715.2            

My r-squared value for this regression was .916565.

Translated, this means that 91.66% of run output could be explained through these basic count statistics for each team. However, problems exist in this data. For one, four different count statistics' absolute value of their t-score (The 1st row of this output divided by the 2nd) of under 2 (CS, SB, SF, IBB), meaning that the data lacked proof to suggest that any of these statistics have a real correlation with the data.

For the more casual reader of this information, the real nonsense comes from the coefficient of triples being higher than the coefficient of Home Runs. Are triples really more valuable than Home Runs? Of course not. You do not need to be a math major or a baseball guru to know this.

Because of this, I realized the need to expand my data. And as I improved my model by continually adding team-seasons, I continued to realize a need to move back further. Eventually, I stopped in 1967, after compiling 1,064 team seasons of data. Using these numbers, I came up with this result.

CSSBSOSHSFHBPIBBUBBHR3B2B1BNSOBIPb
-0.13340.1603-0.0974-0.00630.67730.29900.22160.32561.44791.18130.60540.4977-0.10432.0437
0.05810.02260.00840.03900.08990.05430.04790.01090.02400.07280.02330.01250.007228.0662
0.950921.2318            
1565.5271050            
9174373473327.4            

My regression is much improved from before. Not only are my basic hit and out coefficients more expected, and a b coefficient close to 0, but almost all of my data has been proven statistically significant. Also, 95.1% of runs scored over the last 41 years can be explained using my regression model, a very high amount. However, odd figures still jump out at me.

One striking figure is how suddenly, sacrifice hits have a coefficient of nearly 0, and no statistical significance. This could be a result of the difference in times from now and 40 years ago, when low scoring games and small ball was much more common and efficient than in modern times.

The more pressing issue, however, continued to be my stolen base category. While both SB and CS show statistical significance (absolute value of their t scores are both over 2), the ratio of the two coefficients are odd. Overall, the coefficients seem to suggest that a 45% stolen base ratio helps a team in the long run.

Needing answers to this issue, I ran a 3-variable regression, with nothing but stolen bases, caught stealings, and runs. My output was this:

CSSBb
-2.50070.795302760.0481
0.21740.0879768.967319
0.11347689.78411 
67.904561061 
10947838552918 

After this, I ran another regression, counting all of the previously used variables except for Stolen bases and caught stealing.

SOSHSFHBPIBBUBBHR3B2B1BNSOBIPb
-0.09185-0.003360.7956180.2739240.2529630.3314721.4185731.2496140.6236570.504617-0.109283.287156
0.0085860.0398010.0909330.0552370.0486290.0112020.0241730.0738270.0236610.0128330.00736928.66922
0.94813621.80908          
1748.3451052          
9147332500369          

These two sets of data seem to suggest a very low relationship between stolen base statistics and runs scored, if any.

While both Caught Stealing and Stolen Bases are statistically significant in the first regression model, and suggest a much more logical successful stolen base ratio (about 76%), the b value of the data is extremely high, and the r-squared value is extremely low, suggesting a very low correlation from those numbers and runs scored.

In turn, the second regression is the same data that was used in the original 1967-2008 regression, minus SB and CS. The r-squared value dropped very slightly, from about .951 to about .948, meaning SB and CS only added 0.3% of understanding to the model. In turn, the standard error of runs only increased from 21.23 to 21.81.

Statistically speaking, using the original model, we would be able to find approximately 97.5% of teams within 2 standard errors of their expected runs using the coefficients and variables we have available.

Given that teams averaged 716.2 runs a season in this time period, I believe that these statistics have provided a strong model moving forward in being able to understand what wins baseball games offensively (extra base hits, walking) and what just does not matter as much (strikeouts vs. in play outs, steals, sacrifice hits).

Benches Clear in Detroit 😳

TOP NEWS

Detroit Tigers v Atlanta Braves
Baltimore Orioles v New York Yankees
Chicago White Sox v San Diego Padres
MLB: APR 20 Rays at Yankees
Detroit Tigers v Boston Red Sox

TRENDING ON B/R