Inukchuk wrote:That would be the case if the player pool was small. The first year I messed around with it, I tried doing it by position and Brad Wilkerson was my 3rd rated first basemen because he stole 13 bases the year before...
Lately, I've joined some pretty big leagues and therefore had to increase the number of players I project. As the sample size gets larger, all data tends to approach the normal distribution and flatten out the values of SB/SV guys.
This is not true. Not all things in this world are normally distributed in large samples. The data I presented included all players rostered in my league. If you increase the sample size, you will be including more players with little to no stolen bases/saves and the data would get even more skewed.
jcook3127 wrote:To Rookies and Cream: Would it matter that a stat like steals is not normally distributed?
All that means is the standard deviation will be significantly higher because the sample data is significantly higher from the mean...so even though Reyes will have probably 45-50 more steals than the mean, I would expect a much higher standard deviation with a category like steals...
Obviously you can't make inferences like 95% will be within two standard deviations, etc. etc..but I think the method of calculating the st. deviation for any category and adding up a total amount of st. deviations above or below makes logical sense with any category..simply because you wouldn't be treating any two categories alike...every category of projections would have its own st. dev. and mean
The z-score method assumes all stats are normally distributed. Stats like saves and SB's are positively skewed, therefore the mean is not an accurate measure of central tendency.
rookies and cream wrote:This is not true. Not all things in this world are normally distributed in large samples. The data I presented included all players rostered in my league. If you increase the sample size, you will be including more players with little to no stolen bases/saves and the data would get even more skewed.
I agree, standard deviation is intended for use with normal distributions. So I can see this is not the intended application of z-scores. Doing so yields a pretty high number for guys like Reyes and Crawford. But when so many players are clustered in the 0-10 range, a guy who steals 40+ is a category killer by himself. I would expect to see a pretty high value there.
What other technique would make sense to use here instead?
Thanks for the all the tips. I've never done my own projections before, so it should be fun.
Now what about guys coming out of the minors. For example, a guy like Evan Longoria in 2008. Is minor leaguers a crapshoot? or is it possible to project their stats accurately?
Polar Bear wrote:Thanks for the all the tips. I've never done my own projections before, so it should be fun.
Now what about guys coming out of the minors. For example, a guy like Evan Longoria in 2008. Is minor leaguers a crapshoot? or is it possible to project their stats accurately?
The beta (or risk) is higher for a player with less mlb service time. Its easier to project a seasoned veteran than a rookie, so beta is used to bring light to the risk involved in the projection
TheRock wrote:I agree, standard deviation is intended for use with normal distributions. So I can see this is not the intended application of z-scores. Doing so yields a pretty high number for guys like Reyes and Crawford. But when so many players are clustered in the 0-10 range, a guy who steals 40+ is a category killer by himself. I would expect to see a pretty high value there.
What other technique would make sense to use here instead?
I'd probably just knock down down the extreme outliers a notch or two in my rankings. If you're looking for something more objective, I've read about others assigning weights for each category as a method of correction (http://baseball-lab.blogspot.com/2007/0 ... ation.html). I would imagine something else can be done using the skewness/kurtosis of the distributions as well, though not sure about this.
rookies and cream wrote:I'd probably just knock down down the extreme outliers a notch or two in my rankings. If you're looking for something more objective, I've read about others assigning weights for each category as a method of correction (http://baseball-lab.blogspot.com/2007/0 ... ation.html). I would imagine something else can be done using the skewness/kurtosis of the distributions as well, though not sure about this.
Yeah, weighting kinda makes sense.
My wife works with a bunch of actuaries, I'm having her ask around.
Inukchuk wrote:That would be the case if the player pool was small. The first year I messed around with it, I tried doing it by position and Brad Wilkerson was my 3rd rated first basemen because he stole 13 bases the year before...
Lately, I've joined some pretty big leagues and therefore had to increase the number of players I project. As the sample size gets larger, all data tends to approach the normal distribution and flatten out the values of SB/SV guys.
This is not true. Not all things in this world are normally distributed in large samples. The data I presented included all players rostered in my league. If you increase the sample size, you will be including more players with little to no stolen bases/saves and the data would get even more skewed.
I don't agree that increasing sample size would skew the data more, because while you would be adding more low steals you'd also be adding more higher steals guys as well. But whatever...
The problem with showing distributions of only rostered players for only one season is that the sample size is still far too small. Yes, the data is skewed positively in 2008 (which I understand was your point and is valid...more on that later), but it that's a pretty insignificant amount of data in the grand scheme of things. It's like showing a distribution of temperature, but only using the summer months. Were you to take every player's stolen base totals from every season played, SB distribution would approach a normal curve. This is in essence what the central limit theorem implies. But whatever...
OK, all this nerdery is driving the ladies away. Let's get back to the issue. The problem of analyzing data on a year by year case is that there are fluctuations in distributions, specifically with SB these days. As far as that goes, you are correct that there is a bit of a bias toward high steals guys (although I have noticed it's become less pronounced over the past few years). I think there is a bit of common sense that needs to be applied in these cases. As you said, you could simply bump down the high steals outliers a few spots. What I've noticed that also works pretty well is simply inflating the value of a SB standard deviation by 10-15%. I find that flattens out the outliers quite nicely.
abrunn11... the place to go for all your sig needs...
Inukchuk wrote:That would be the case if the player pool was small. The first year I messed around with it, I tried doing it by position and Brad Wilkerson was my 3rd rated first basemen because he stole 13 bases the year before...
Lately, I've joined some pretty big leagues and therefore had to increase the number of players I project. As the sample size gets larger, all data tends to approach the normal distribution and flatten out the values of SB/SV guys.
This is not true. Not all things in this world are normally distributed in large samples. The data I presented included all players rostered in my league. If you increase the sample size, you will be including more players with little to no stolen bases/saves and the data would get even more skewed.
I don't agree that increasing sample size would skew the data more, because while you would be adding more low steals you'd also be adding more higher steals guys as well. But whatever...
The problem with showing distributions of only rostered players for only one season is that the sample size is still far too small. Yes, the data is skewed positively in 2008 (which I understand was your point and is valid...more on that later), but it that's a pretty insignificant amount of data in the grand scheme of things. It's like showing a distribution of temperature, but only using the summer months. Were you to take every player's stolen base totals from every season played, SB distribution would approach a normal curve. This is in essence what the central limit theorem implies. But whatever...
OK, all this nerdery is driving the ladies away. Let's get back to the issue. The problem of analyzing data on a year by year case is that there are fluctuations in distributions, specifically with SB these days. As far as that goes, you are correct that there is a bit of a bias toward high steals guys (although I have noticed it's become less pronounced over the past few years). I think there is a bit of common sense that needs to be applied in these cases. As you said, you could simply bump down the high steals outliers a few spots. What I've noticed that also works pretty well is simply inflating the value of a SB standard deviation by 10-15%. I find that flattens out the outliers quite nicely.
How many players are we talking here? I included a fair amount in my analyses and I'm pretty sure most if not all the high steals guys were included. Send me your data and I'll run it through SPSS and post the distribution. I promise you it will be positively skewed. You also need to consider the saves cat, which will surely get more skewed as more players are added to the database. We all know how people love to chase saves, leaving no saves on the waiver wire.
Trust me, the central limit theorem does not apply to saves and stolen bases. Traits (stats) only approach a normal distribution if they are normally distributed in the population (MLB).