Thursday, November 30, 2023

An Inside Look At One Of My Models

 The one thing I've never really talked about with regards to my sports betting journey is the nuts and bolts of my models, for pretty obvious reasons. I've always treated them like state secrets and it seems like just about every other sports bettor I know who models is the same way. I've been in a couple different big betting groups in one way or another, and that is one topic that is just off limits. I remember one time during a night out at Foxwoods I asked someone in my group a very basic question about what kind of data they use and it was like the most serious I've ever seen the guy in over a decade. He said something like "I can't tell you a single thing about it, ever" and that was the last time it was ever brought up. And it's understandable, of course. A good model that beats major markets can be worth millions of dollars, if not more. So obviously you'd want to keep it a secret as much as possible.

But I've been thinking, as my career in sports betting is mostly behind me (the modeling aspect at least), I don't really have anything to lose by sharing it. And it isn't super advanced whatsoever and everything in it can be found online. Oh, and by the way, the model I'll be sharing hasn't even won in like a full years worth of sample. I still refuse to believe that it's unprofitable. Just with the amount of time and work I've put into it and how soft these lines generally are, I simply refuse to believe it's a loser. But the numbers don't lie and I don't trust it enough to use it anymore. So I don't think anyone can just take this and start making money. However, there is always something to learn when you read about other people modeling. I've learned that everyone approaches modeling quite differently. They might use the same general stats and end up at the same place, but there are countless ways to model sports. 

I've always been more of the granular sort. I like cutting edge stats, I want the most recent data and I try to stay up to date on any and all lineup/injury changes (TRY being the operative word there). I'm looking for increases or decreases in velocity in a pithcers last start. A hockey player running bad with getting his shots blocked or a reciever suddenly getting a higher percentages of throws. Others take a more 'thousand feet up' approach. I remember reading "Trading Bases" by Joe Peta which was a super interesting book and a great look into his model. He was a stock guy who got hit by a car and had to spend a year or something in bed. And so he took up trying to beat baseball. He made this model that graded every team in terms of expected winning percentage just based on their runs scored and runs given up, and would place bets all season accordingly. He just updated the model as games ended. (At least this is what I remember of it. There may have been more to it but the general idea is right). He didn't know who was the starting pitcher, who was out that day, etc. And he crushed it. I just finished reading Billy Walters new book (which is great by the way and I'll talk about it more later) and he somewhat goes into his models. I'll probably make a seperate post about it and a couple other books I've read recently, but he makes power ratings for every team and bets when he has an edge over the posted line. He actually has a power rating for every PLAYER. His edge seems to come from superior info. He said something like "everyone knows who's injured, but not everyone knows HOW injured they are".

I started modeling props after reading "Attacking Las Vegas" by Justin7. He gives some really simple and easy examples for modeling NFL player props and I just took the logic from them and applied it to different things. He also taught me about the Poisson distribution in that book which is like a literal cheat code for pricing props. I first started with pen and paper and all I did really was normalize their number for the opponent. So for example, if a QB averaged 200 yards per game, I would just take 200, divide by the league average yards given up by defenses, then multiply that by what the opponent gives up. And I basically made a living doing that for a few years. Obviously it was more due to the lines being super soft, but that really is all it took for a while. Like I said before, once DFS came onto the scene, the prop party dried up, for me at least. It coincided with me getting married and more into my real job so it all worked out. 

Once I learned excel, that was like a whole different ballgame. It really is amazing the stuff you can do in excel. I made models for props in all four major sports and have multiple models for a few different sports. The one I'll be talking about today is the one I made for MLB starting pitcher strikeouts.

When I first started betting this prop, it really was as easy as taking the pitchers average K's per start, normalizing for the opponent and betting when it was way off (you can use poisson for this, kind of, but I usually wouldn't bet anything that wasn't obviously off). But as thing progressed and I moved everything into excel, I began thinking about it differently and slowly developed what I think was a pretty neat little model that won me plenty of money for years until it basically became obsolete a couple years ago.

Whenever I start with a model, I always try to think of it like an alien would. Like someone who's never seen or heard of the sport before. What is it that I'm trying to do here; come up with an expected number of strikeouts for a starting pitcher. Ok, what is a strikeout, exactly. What are the events that lead up to that. What stats do I know I need. Number one, the pitchers expected K rate. Aka, the percentage chance that each at bat ends in a strikeout. I need to know for how long the pitcher is going to stay in the game. Do I use balls thrown? Innings pitched? I need to know the K rate of the opponent and the K rate of the average team. 

The biggest part of this model, the most 'artsy' I would say, is coming up with the expected K%. My first instinct would be to simply use the pitchers career or current season K% (or some combo), and that's probably ok. But I did a ton of reading about baseball analytics back then and I found out that the best predictor of strikeouts wasn't actually past strikeouts, but swinging strike percentage, which is the percent of pitches that end in the batter swinging and missing. The general logic is that getting a swing and miss is the best possible outcome for a single pitch, and the best 'message' that the hitter is overmatched. More swinging strikes=more strikeouts. 

Another good predictor of strikeouts is raw Contact %. I also found the best fit for Contact% to K% and would use a mix of those two to come up with K%. If I had the time, I'd also look at his last few starts and at pitch usage and velocity. Velocity is huge and also a really good predictor of an injury.

At a certain point, raw K% is better to use than SWS% and Contact%, but early in the season and with new pitchers, SWS% can be deadly. You'll see all kind of pitchers who are either running good or running bad with it early and it's a great stat to use. (Interesting aside: there are a few pitchers who seemed to be able to outperform their SWS% for their entire career. Aaron Nola was one of them. I always used to try to see him pitch to see what it was exactly that he was doing different but I never really could find it. It was somewhat surprising how useless called strikes were for predicting strikeouts.)

Anyway, I would get the pitchers SWS% and Contact%, but then you have to turn that into K%. I read a bunch of fangraphs articles back then, and one of them gave the average number for SWS% into K%. I'm not going to give it here but you can find it pretty easily if you wanted to. I also kept a list in excel, every time I priced a pitcher I would put in his SWS% and K% for his past 4 years and keep the list going all the time so I had my own updated number. It was always very close to the fangraphs number anyway. I did the same for Contact% (which isn't as important as SWS%).

This was the part of the model that I was always tinkering with. Some pitchers would be way out or underperforming their expected K% all year, and you'd start to wonder if it was luck or a skill. That relationship between SWS% and K% and to a lesser extent Contact% and K% was an elusive one and I don't think I ever really quite nailed it down.

But anyway, now that I had the pitchers expected K%, the rest was pretty easy, honestly. First, you need the expected K% for the average opponent. I had four separate numbers; AL vs R, AL vs L, NL vs R, and NL vs L. I would update these numbers every week at least. Then you take the opponent K% (side note: you can drill down into anything as far as you want. For example, for a while I was taking the days posted lineup and using their K% instead of taking a team stat. My thinking was this would be more accurate, obviously. But everything you do takes time. And I've always lived by Justin7's credo: do lots of props quickly. So I eventually stopped doing this). 

You normalize the pitchers K% with the opponent, and now you have the true expected K% for that pitcher for that day. (Again, we're obviously approximating here and I'm sure a true 'data scientist' would find some glaring errors in the way I do things. But they were good enough to win for a while, and again, you can spend more time and get more accurate on any number there is.) Now that you have the K%, you need to get a number for how long he's going to be in the game. Or, more accurately, how many batters he's going to face. (This was a slight mis-step for me at first. I kept trying to come up with a number for innings pitched and then even percentage of game pitched for a reason I can't explain now. Batters faced is very clearly the number you're looking for). For this, I just took the median number that he had faced that season with a mix of total batters faced per start. This had a little bit of a fudge factor to it as I would consider the opponent and his last few starts and then just adjust it manually which probably isn't great. 

So now you have the SP expected K% and the total batters he's expected to face. You multiply those and you have expected strikeouts. I did an ump adjustment as well later on.

Now you just compare your number with the books number. I'm actually not sure if you can use poisson for this. The conditions for poisson are that the event has to be very rare and not have the rate change over time, which isn't really true for this. But I usually didn't need to use poisson. I'd get something like 5.323 and see 6.5 even on the under. You don't need poisson to know that's a bet.

Like I said before, this model, in all its iterations, did very well for lots of years. I guess it's similar to the rest of my models, it's certainly nothing ground breaking but I like to think it has a certain elegant simplicity. There are other things I've tinkered with on it; tried to factor in called strikes more, I tried to copy and paste an equation I got from baseball reference. Which, by the way, never works. I once tried out a 'model' that was straight from fangraphs for expected home runs. It was showing big edges but it never worked. And some guy from football outsiders tried to go tout a while ago and did terrible and was a big baby about it. For some reason, the stat guys don't always translate directly to the betting world even though everyone uses their numbers to make bets. It really is a different skill set.

Anyways, that's it for today. I have some things to talk about it so check back soon.