This is a 2 part series in which I will analyze a current Cy Young Predictor formula, offer a replacement formula to account for the change in philosophy for the Cy Young voters with the growing influence of new-age statistics (sabermetrics), and use this new formula to project the Cy Young race in 2011 and beyond.
Part 1 will look into a widely accepted Cy Young Predictor formula and explain the flaws in it. As voters are considering sabermetric statistics more and more, the Cy Young formula also needs to adapt to the new ways of thinking and voting. Part 1 will examine and breakdown a new formula and check it with past data for accuracy.
Part 2 will look to the 2011 Cy Young race in the National League. First an analysis will be done to determine how the top pitchers will fare in the 2011 season. Next, a projection will be done to determine how these pitchers will end up placing on the Cy Young ballot. A similar analysis will be done for the American League at a later date.
PART 1: A TILT TOWARDS ADVANCED STATISTICS AND WHAT IT MEANS TO THE CY YOUNG AWARD
Baseball fans are getting smarter.
There’s been a change in the way we watch, discuss and analyze baseball in the past few years. A lot of that is due to fantasy geeks around the world and their constant strive to find an edge in the game. We have Daniel Okrent and the guys from the original Rotisserie League to thank for that. They brought the game into our lives, and now millions play it day in and day out.
Fantasy baseball is really a derivative of what Bill James was trying to do with sabermetrics. James began his work in the 70’s to try and find a better way of assigning value to players on the field and at the plate. His work won the approval of many, including Okrent. Without James developing the theory and Okrent bringing a “silly little game” to our lives, in all likelihood, baseball wouldn’t be nearly as popular as it is today, and it certainly wouldn’t be dissected as much.
James knew back in the 70’s that many highly regarded baseball statistics weren’t telling the whole story. One of them was the win/loss category. Pitchers can only do so much to win games, so if they don’t have a decent offense behind them their wins will be lower and their losses will be higher than a pitcher with the same arm on a team with a great offense. That seemed obvious to him, yet the mainstream media and baseball gurus around the league had been using certain barometers for good pitchers and bad pitchers for years, so while James had done some incredible work it took years for it to be truly recognized.
The tide has shifted lately though, and many are finally coming around the advanced metrics James and other had been writing about for years. This is especially evident in the 2010 Cy Young voting. Felix Hernandez won the award with a startling 21 of 28 first place votes from the Baseball Writers’ Association of America even though he only won 13 games. If this happened in the 70’s he wouldn’t have even made the ballot.
THE EXISTING CY YOUNG PREDICTOR
James wrote a formula with Rob Neyer of ESPN to calculate a projected Cy Young winner prior to this shift in voting. His formula is as follows:
Cy Young Points (CYP) = ((5*IP/9)-ER) + (K’s/12) + (SV*2.5) + Shutouts + ((W*6)-(L*2)) + VB
(where VB is a Victory Bonus of 12 points awarded for leading your team to the division championship.)
Why he even bothered to write the formula is questionable it itself as the Cy Young almost always went to the pitcher with the most wins prior to 2009. Period. But that’s beside the point. James’ formula worked great up until the past few years. But then a noticeable shift in voting occurred.
In 2008, James’ formula correctly selects Cliff Lee and Tim Lincecum.
In 2009, however, James’ formula selects Felix Hernandez and Adam Wainwright. Zach Greinke won in the A.L and was ranked 2nd on James’ formula. Tim Lincecum won in the N.L and was only ranked 4th in James’ formula.
In 2010, again we see the shift in voting. James’ formula selects Roy Halladay and CC Sabathia, but Felix The Kid ended up taking home the A.L. award. Felix was ranked 6th in James’ formula! Above him were CC, Price, Lester, Soriano, and Buchholz.
If the trend in voting continues down this path, it’s clear that James’ original formula needs to be modified to fit this new-age thinking. In the following study I’ll explain the flaws in the old formula and provide a new formula to account for the shift in Cy Young voting.
THE ADJUSTED CY YOUNG PREDICTOR
If we take apart James’ formula and break it into variables and constants we have this:
Cy Young Points (CYP) = ((A*IP/9)-ER) + (K’s/B) + (SV*C) + (Shutouts*D) + ((W*E)-(L*F)) + (VB*G)
(where the constants are A through G, and the variables are each pitcher’s individual stats)
As Cy Young voters are becoming more and more accepting of sabermetrics statistics, this formula seems to be leaving out some key data that voters look at. While I could make the case that voters should look at advanced statistics such as WAR, CERA, or DIPS, that isn’t yet a reality. Maybe in the coming years these advanced statistics will be looked at, but that time is not now. But there is one glaring piece of information left out of James’ formula that voters are clearly looking at now, WHIP (walks+hits/IP).
WHIP. It even sounds cool. It’s simple enough for anyone to understand, yet very telling of a pitcher’s dominance on the mound. With a quick glance at WHIP you can get a snapshot of the pitcher and understand how much luck was involved with his ERA and overall record. With Greinke and Hernandez winning the A.L Cy Young the last few years yet not dominating the Win category, it’s clear the voters are looking into a category that those pitchers did well in. WHIP.
Greinke had a 1.073 WHIP in 2009(good for 2nd in the A.L) to go with his 2.16 ERA and 242 K’s, and Hernandez had a 1.06 WHIP (2nd in the A.L) to go along with his 2.27 ERA and 232 K’s. Neither pitcher was in the top 5 in Wins in the A.L, and in fact Hernandez only amassed 13 throughout the entire season.
If we incorporate WHIP into James’ equation and modify the constants we can find an equation much more suitable for the present day. The easiest way to explain how and why I made the changes is to show both the EXISTING equation and ADJUSTED equation, and then provide an explanation and commentary below. The basic equation, including WHIP, is:
Cy Young Points (CYP) = ((A*IP/9)-ER) + (K’s/B) + (SV*C) + (Shutouts*D) + ((W*E)-(L*F)) + (VB*G) + ((H*IP)-(IP*WHIP/J))
And the constants used in both James’ (EXIST) and my (ADJUSTED) study are as follows:
I came to these adjusted constants by analyzing the relative strength each individual constant would add to the overall total. To do this I analyzed the 2009 N.L Cy Young race. Using James’ existing equation, the top 10 finishers should have been the following pitchers in the order shown below in Table 1a (with stats included). The The CYP(exist) is the value calculated with the EXISTING equation and the CYP(adjusted) is the value shown with the ADJUSTED equation.
The CYP(adjusted) components were separated into percentages of the sum in order to understand why a certain player received a certain score, i.e. answering the question, “what did they do well in”. Table 1b summarizes those findings.
|SP(exist)||30-40%||9 to 12%||0.00%||0 to 1%||45-55%||2-5%||0.00%|
|RP(exist)||10 to 15%||3 to 7%||60-70%||0.00%||13-17%||2 to 5%||0.00%|
Look at the averages to make sense of it all.
As you can see by looking at the averages, with the ADJUSTED equation, the overall score depends on roughly 50% ERA + WHIP whereas the EXISTING equation would account for roughly 30-40% ERA + WHIP for Starting Pitchers (SP). Another big change is the dependence on Wins. In the EXISTING equation, wins accounted for roughly 45-55% of the total score, whereas in the ADJUSTED equation, Wins account for much less (an average of 20.66% in 2009). Strikeouts were also valued higher in the ADJUSTED equation, as the voters seem to value that more now too.
To verify that this ADJUSTED equation would work for more than just one circumstance, it was tested on the past 2 years’ Cy Young races in both the N.L. and the A.L. The data is shown below:
Using the adjusted constants, we have a new Cy Young Predictor formula as shown below:
Cy Young Points (CYP) = ((5*IP/9)-ER) + (K’s/5) + (SV*1.5) + (Shutouts*2) + ((W*3)-(L*2)) + (VB*5) + ((0.5*IP)-(IP*WHIP/3))
By looking at the results from Tables 1c-1f, it’s clear that this formula will result in more accurate results in the sabermetrics age.
The Red lines in each chart indicate the Cy Young winner from that year. As you can see, the ADJUSTED equation correctly chooses the Cy Young winner from that year and league. Unfortunately, we still have a relatively small sample size with this new trend of voting, so the ADJUSTED equation doesn’t overemphasize the importance of WHIP or completely disregard the value in Wins.
As years pass, even this equation will most likely need to be updated to account for the new trends in Cy Young voting. There may be an even greater importance placed on sabermatrics in the future. Only time will tell. For now, this ADJUSTED equation seems fairly accurate to predict, with given data, who will win the Cy Young award.
In part 2, I will examine the 2011 N.L. Cy Young race with this ADJUSTED Cy Young Predictor and find each pitcher’s probability of winning the Cy Young. Will a Phillies pitcher take it home, or do the odds rest with another N.L. starter?
Written By Todd Drager
Republished With Permission From 7thAndPattison.com
Follow Him on Twitter @7thandpattison