Last night I realized that r-squared can be calculated with Excel. And graphs with trendlines are very easy to create with Office 2007. So I had a little fun:
I used 2007-2009 data from www.fangraphs.com to create the graphs to the right, specifically cumulative batting lines for the 383 hitters with over 500 plate appearances.
The average BABIP (batting average on balls in play) from the nearly half a million plate appearances captured by this study is .305 with a standard deviation if .027. Shin-Soo Choo's .376 BABIP put him at a MLB best 2.610 deviations above the mean (1075 PA).
When I've thought about BABIP in the past, I've figured that trends and fluctuations are largely caused by line-drive rates and a player's speed. But I started to run across samples this offseason that made me wonder if a big piece of the puzzle was being overlooked: flyballs.
While a larger sample may yield slightly different results, the data I used points to a relationship between fly balls and BABIP that may be as significant, if not more significant, than the relationship between line drives and BABIP.
Additionally, the r-squared correlation coefficient between fangraph's speed score and the data is .0921.
Without a system that removes much of the subjectivity in interpreting batted-ball data -- help me HitFX, you're our only hope! -- this data has the potential to be misleading. But given that it came from over 100,000 games that took place under the watchful eye of Baseball Info Solutions, I'd match it up against any publicly available information out there.
Note that the correlation coefficient for infield fly balls is .2248 compared to .1542 with outfield fly balls. If a hitter consistently hits a lot of infield fly balls, his BABIP may suffer. But relatively large quantities of outfield fly balls may not significantly lower his BABIP. And a player's infield fly ball rate may not be stable from year-to-year, as players do not hit many infield fly balls -- this is where a larger sample could really help to clarify our data.
Quantifying speed is difficult. While quantitatively derived speed scores may mirror actual speed at times, they are hardly a perfect match for actual speed.
totFB% and BABIP: y = -1.6201x + 0.9658
LD% and BABIP: y = 0.3944x + 0.0729
ifFB% and BABIP: y = -0.6718x + 0.3005
ofFB% and BABIP: y = -0.9483x + 0.6653
GB% and BABIP: y = 0.5544x + 0.2617
Spd and BABIP: y = 19.782x - 1.9552
Using the correlation coefficients from this study, I created an equation that yielded an r-squared of 0.4137 with BABIP. But it's a little complicated.
Rather than inputting raw batted ball percentage values, I calculated how far each player from our population was from the mean in LD%, ifFB%, ofFB%, GB% and spd. I then used to coefficients to weight the deviation values and come up with the following equation:
X = (.2560*LDdev) - (.2248*ifFBdev) - (.1542*ofFBdev) + (.0571*GBdev) + (.0921 Spddev)
It may be possible to work off of this equation and come up with a BABIP prediction system based on batted ball and speed data. Anyone want to give it a shot?
Follow me on Twitter.