Changing The State Of Publicly Available Minor League Stats | Astromets Mind

Wednesday, February 24, 2016

Changing The State Of Publicly Available Minor League Stats

There’s not much information about a major league game that can’t be found online from one of a few sites these days, but minor league data is stuck back in the dark ages, so I’ve scraped everything MiLB tracks from every minor league game over the past 5 years to create the most complete minor league stat page to date

The Problem

Thanks to PITCHF/x, Statcast, MLBAM, and other stat tracking companies, baseball fans can learn every piece of information about a major league baseball game, from how fast pitches were thrown (or hit), to an outfielder’s reaction time and route efficiency on a fly ball to the gap. Sites like Fangraphs and Baseball-Reference offer park and league adjusted stats to compare and analyze major league players, and further break down performances by every split you can imagine. However, when it comes to the minor league side of the game, unless you go to the game with your own calibrated radar gun, you’re not even guaranteed to get reliable pitch velocities, and no tracked pitch information (type/speed/location) is available for free online (possibly at all?). Also, Baseball-Reference has a few splits, but they only track the basic stats for those splits, and you can’t compare players by split. Similarly, while Fangraphs posts FIP for minor league pitchers and a league-adjusted wRC+ for minor league batters, they don’t share the constants they use, or offer any minor league splits. Statcorner and Minor League Central offer a little more batted ball information (although Minor League Central hasn’t updated since 6/18/2015), and you can get spray charts from MLBFarm, but no site brings it all together… until now. Introducing the Astromets Mind Minor League Stat Page, which is my attempt at bringing the best from the aforementioned stat sites into two Tableau worksheets – one for batters, one for pitchers.

My Solution

            With the help of the pitchRx package for R, I scraped all of the information available from MLB for games played between 2011 and 2015, from AAA down to Rookie ball. The MLB site in that link is an XML version of Gameday, so I scraped everything available from MiLB Gameday, which includes a pitch log, play log, ball in play log, and other less important information. I then used that database to create a Minor League Guts page, which has the linear weights and 1-year regressed park factors I used to create the stat worksheets. I tracked a dozen splits and all of the stats available and applicable to those split breakdowns (full glossary of stats used here), and gave each one its own tab in the worksheets. The splits I currently include in the worksheet are: Month, L/R, Home/Away, Batted Ball, Pitch Count (currently refers to the results of PA’s ending in a #-# count), Outs, Base State, Inning, Times Faced, Opponent, Field (opposite, pull, center), and Position. For comparison, the Baseball-reference minor league splits are: Home/Away, L/R, Month, a couple of base states, and younger/older. I also included a tab that allows you to pull up player Gamelogs, a ‘Season Totals’ tab, and spray charts.
The Gamelog tab has a mostly hidden column that you can just ignore, and that is only there so when you click on any stat cell for any game, a link to the MiLB Boxscore for that game will appear. The batters worksheet has 3 tabs dedicated to spray charts, while the pitchers worksheet has only the one (although I’ll probably add a spray chart comparison tab to that worksheet in a future update). As the description suggests, the spray chart comparison tab (which is currently the default opening display for the batters worksheet), allows you to compare the spray charts of any two players, or of one player over multiple seasons (hat tip to Bill Petti, who created the original spray chart comparison worksheet for major league players). I also included a defense spray chart in the batter’s worksheet, which tracks all plays fielded by a player.
Since I only calculated 1-year regressed park factors, I created two versions of the +/- stats, one that is only league-adjusted, and the other that is both league and park adjusted – stats beginning with a lowercase ‘p’ have the park adjustment (pERA-, pFIP-, pxFIP-, pwRC+). Finally, I included league average rate stats that will appear when you hover over a stat (broken down by split where appropriate), which is something I’ve only ever seen at Statcorner.

Disclaimers about the data

            The standard data is reliable for all leagues, but ball-in-play and pitch-by-pitch data appears to be less reliable for leagues/parks with no data stringers. As far as I know, if a park has a data stringer who works the games, then there will be a live MiLB Gameday link available for their home games. If not, the page for that game will only have links to a box score or game log (example), and the official scorer phones the game updates in at the end of each inning or pitching substitution. The data stringer inputs the pitch-by-pitch information, and they are responsible for marking where a ball was fielded, which is used in the minor league spray charts we have available, but I’m not sure how much of that detailed information is tracked and reported by the scorer, and whether the information reported is consistent for all levels. It’s clear that in some leagues/parks (like the rookie level GCL/AZL), the pitch-by-pitch information is not tracked at all, as all walks are 4 balls, all strikeouts are 3 strikes, and all other plays end on 1 pitch. I didn’t treat these leagues any differently when calculating stats, so there will still be pitch information that shows up in the worksheets, but the pitch counts or ball in play percentages should stick out as looking wrong. I don’t expect anybody will be looking too closely at splits from rookie ball anyway, but I’m giving this disclaimer in case something looks funny about pitch data from the A/A+ level. The example link above brings you to a game where the St. Lucie Mets were home team, and the accompanying components link shows that they did not track the information at the pitch-by-pitch level. You can still get some pitch information on St. Lucie players, as there are teams in the FSL that have MiLB Gameday, so the information will show up in individual games of the gamelog, but you should otherwise just ignore the pitch information from the FSL. I should be able to create a fix to exclude that unusable pitch log information from the dataset, but that’s just one of a few minor fixes I plan to make for future updates. You shouldn’t have to worry about that problem for AA/AAA games, and that’s when that information is of most interest anyway.
            Another potential problem with the pitch log or ball in play information is human or CPU error. When I tracked the pitches of Thor and Matz in the minors with, I’d occasionally notice some misclassified pitch result (called strike instead of swinging) or missing pitches in the Gameday pitch log. Also, some ball in play location information is y-shifted for some reason, which I suspect is a CPU error. I know MLBFarm is using the same data and corrections for their spray charts as I am because they have the same occasionally shifted data points in the same places – for example, the default Defense Spray Chart is for Gavin Cecchini’s 2015, and it shows that he committed an error on a groundball in shallow CF, but I’ve posted the GIF for that error (see all of them here), and you can see that he was actually near 2B.

Future Efforts

            I still have some minor updates to make to the worksheets, but they should be otherwise ready for public consumption, which should give you something fresh to look at until Spring Training games start in a couple of weeks:
       -       I’d like to add some more player information, for example DOB and positions played, and forgot to track games played/started.
       -       A few times a player would appear on the Gameday XML page with a pitcher/batter ID, but no name listed, so I need to create a master list of ID’s and names that can be used to fill in the missing names.
       -       I need to improve the ‘Fielded By’ list for the ‘Defense Spray Chart’ tab, because right now a few players are missing a little information. I had to extract fielder information from the Gameday at-bat description, and getting the full name was tricky sometimes for players with initials instead of a first name. It shouldn’t be a hard fix, but it will be time consuming (for my computer), and fixing it wasn’t high on my priority list before sharing these workbooks.
       -       I’d also like to fix the hover response to only show a league average rate stat if you are hovering over the cell in the column of that rate stat, because right now it will show you all the rate stats if you hover over any cell, and that can lead to a big pop up box (unless I head that people like it as is).
       -       Lastly, the list of Gameday ID’s pitchRx had available for 2011-2015 appears to be missing a few games in a few leagues per year, so I’d like to scrape the missing data to complete my database. However, I’m probably not going to be tracking down those few missing games before next offseason because the missing games count is such a small fraction of the total that it just wouldn’t be worth my time right now.

Wrapping things up for now

            Naturally, if you have an idea for other stats or splits that you’d like to see included in the worksheets, leave a comment below or with @Astromets31. You may have noticed that I also have a Winter League stat page, which will track the same stats from the ABL, DWL, MPL, PRL, and VWL over the past 5 years, but the page needs to be updated, and the pitchers worksheet created. I’d also like to create a similar worksheet for the AFL data, except pitch information is available for those games, so I will be able to include even more stats and splits – not sure if I will get to the AFL workbooks before the season starts though. As for 2016 minor league season, since Tableau public doesn’t allow you to auto-update the data connections, I may have to create a separate worksheet with only 2016 stats (so it’s not such a huge update every time) and/or just not update the worksheets that often.

  • 1Blogger Comment
  • Facebook Comment
  • Disqus Comment

1 Comment

comments powered by Disqus
submit to reddit