Steals are definitely not normally distributed. I ran a quick graph in Excel on 300+ players from last year (the top 300). I didn't make it pretty like RNC did, but you can clearly see the shape, very skewed to the left.

And it makes sense. Stealing a base in the majors is hard. Many players will never or hardly ever steal. And since you can't have fewer steals than none, they will cluster right there in the 0-10 range.

Still not sure what that means mathematically. Surely there's a more accurate way to reflect their value than take a z-score and knock it down a bit. How much is 'a bit'?

As I mentioned, R and C, you're definitely correct about steals in 08 being skewed positively. That's why I agreed that you would have to make adjustments on either your speed guy rankings or inflate the value of a stolen base standard deviation. In that vein, we are both in total agreement.

The only area in which we disagree is your assessment that adding more data in general would skew the distribution away from the normal. Frankly, I think it's more of a math thing than a baseball thing, and I think it's because we're both arguing different points. All I was saying was that while you would need vast amounts of data (far more than the 300 that Rock provided) for it to happen, but it definitely would.

For example, if we were having this conversation in 1887 (when there were 23 players that stole over 70 bags), you'd be inclined to think that the more players you add to your data analysis, the more negatively skewed your data would become. That is clearly not what has happened since then.

Again, I think the problem is, we're both talking about different populations. I was inferring a normal approximation if all players from all times were taken into account, while you are speaking of just 2009 players. Either way, it doesn't really matter for the 09 season...

I am curious as to why you think high saves guys are skewed, though. Mine aren't too bad. If anything, I don't think my system values saves enough. Do you lump all your pitchers together, or do you separate starters and relievers?

abrunn11... the place to go for all your sig needs...

Inukchuk

General Manager

Posts: 4014

Joined: 24 Jan 2006

Home Cafe: Baseball

Location: Coming down on this hospital like the hammer of Thor

As far as SP/RP go, I've always looked at them separately. I would imagine that's why I don't notice the variance in saves. To me, the difference between starters and relievers in a 5x5 is big enough (IP difference, wins, saves, etc) to warrant 2 separate analyses.

I actually just went back and threw all my pitchers together (including ERA/WHIP values weighted on IP) and the top 20 popped out as such:

8SP, then 4RP, then 1SP, 1RP, 2SP, 1RP, 2SP, 1RP

Which works out to 13 starters in the top 20 and 7 relievers. In the top 50 however, the ratio dropped to 14 RP.

This doesn't seem too bad, as I wouldn't fault someone for taking Papelbon over James Shields. This is assuming that every RP got 0 wins however, since I haven't done RP win projections yet. I think once those get punched in, relievers could conceivably be a bit inflated. This would, of course, lead to inflating the SD to level it off. I actually think I like this way a little better than calculating separately. Normally, figuring out when to pull the trigger on a closer is tough, but I think I might be able to finagle some things around to make it easier...

I have to say, this thread has really made me think more carefully about my ranking system, and consequently has led to some nice adjustments. I appreciate the constructive criticism!

abrunn11... the place to go for all your sig needs...

Inukchuk

General Manager

Posts: 4014

Joined: 24 Jan 2006

Home Cafe: Baseball

Location: Coming down on this hospital like the hammer of Thor

Nerfherders wrote:My projection system consists of a 3x5 photo of every player, a large cork board on the wall, and about a dozen or so darts. So far, I've never come in last place using this system.

rookies and cream wrote:Trust me, the central limit theorem does not apply to saves and stolen bases. Traits (stats) only approach a normal distribution if they are normally distributed in the population (MLB).

More nerdery. Back to stats class for you. The only requirement for convergence to the normal distribution by the CLT is that the random variables be i.i.d. (independent and identically distributed). The CLT, remember, is a statement about the distribution of the sample average: "The central limit theorem states that as the sample size increases , the distribution of the sample average of these random variables approaches the normal distribution with a mean µ and variance σ2 / n irrespective of the shape of the original distribution."

So, applied here, think about each fantasy team representing a sample of 20 position players. The average steals of each team would be normally distributed according to the CLT.

"I don't want to play golf. When I hit a ball, I want someone else to chase it."

rookies and cream wrote:Trust me, the central limit theorem does not apply to saves and stolen bases. Traits (stats) only approach a normal distribution if they are normally distributed in the population (MLB).

More nerdery. Back to stats class for you. The only requirement for convergence to the normal distribution by the CLT is that the random variables be i.i.d. (independent and identically distributed). The CLT, remember, is a statement about the distribution of the sample average: "The central limit theorem states that as the sample size increases , the distribution of the sample average of these random variables approaches the normal distribution with a mean µ and variance σ2 / n irrespective of the shape of the original distribution."

So, applied here, think about each fantasy team representing a sample of 20 position players. The average steals of each team would be normally distributed according to the CLT.

Yeah, but does it matter that were not talking about multiple samples? I was speaking in terms of one sample (all players in MLB in one year), not teams in a fantasy league. We also discussed whether SB's would be normally distributed over the history of baseball. However, even if you consider each year as an individual sample, they would not be independent of one another due to player overlap. Does the CLT even apply to what we are discussing?

rookies and cream wrote:Yeah, but does it matter that were not talking about multiple samples? I was speaking in terms of one sample (all players in MLB in one year), not teams in a fantasy league. We also discussed whether SB's would be normally distributed over the history of baseball. However, even if you consider each year as an individual sample, they would not be independent of one another due to player overlap. Does the CLT even apply to what we are discussing?

Two things. First, there's no problem using z-scores with non-normal distributions. The Z-score is a widely applicable approach to standardizing any distribution. The difference with a nonnormal distribution is that you cannot interpret the z-score using the standard normal tables.

Second, in essence, we are talking about multiple samples. I'm competing against a bunch of guys and we are each drawing a sample of players. We get measured in terms of total production in each category (or our average production, in the case of rate stats). Those totals or averages are going to be normally distributed (if anyone has played in a keeper league for several years, take a look at the results for several years and see if it's normal). So, each player can be assessed in terms of their standard score.

Z-scores are good (note in the discussion the importance of using actual data, rather than projections, to get the s.d. and the importance of replacement value) and the SGP method is good, imo. I tend to use the second, but z-scores are essentially the same thing in unitless measures.

"I don't want to play golf. When I hit a ball, I want someone else to chase it."