Return to Baseball Leftovers

## Spreadsheet help!

Moderator: Baseball Moderators

question: i want to run a regression right now:
1) can i do it with access?
2) how?

what i want to do is look at how good the previous three years (n-1, n-2, n-3) are at predicting the next year's performance for a player. the regression would be simple enough; three explanatory variables (last three years) and year n's numbers are the dependent variable. i'll have to deal with the covariance somehow (i really should remember more from that class...).

my inspiration is all of the "use the previous three years" predictor models being mentioned. i want to see if the 50/33/17 ratio is the best.

thanks for your help!
"I don't buy everything I read,
I havn't even read everything I've bought"

"I find it more comforting to believe that this [life] isn't simply a test."
George_Foreman
General Manager

Posts: 4351
Joined: 16 Apr 2004
Home Cafe: Baseball
Location: at Morimoto's, eating \$50 worth of sushi

George_Foreman wrote:question: i want to run a regression right now:
1) can i do it with access?
2) how?

Much easier to do it with Excel. AFAIK to do it directly in Access you would have to A) buy a 3rd-party add-in or B) code it yourself in VBA.
Maine has a good swing for a pitcher but on anything that moves, he has no chance. And if it's a fastball, it has to be up in the zone. Basically, the pitcher has to hit his bat. - Mike Pelfrey
Amazinz
Mod in Retirement

Posts: 18800
Joined: 16 Mar 2004
Home Cafe: Baseball
Location: in Canada, toughening up figure skaters

Tavish wrote:Ok here's a (somewhat) quick example of how to do a real basic 3 year average using Lahman's database in Access 2000. If you have any experience with Excel then changing the equations shouldn't be too hard to suit wht you are wanting to do.

Open the DB and go to the Query menu. Create a New query and choose Design View. That will create a new blank query. Now you have to pick the tables that you will pull data from. Add the Batting and Master tables.

IIRC Lahman's doesn't have relationships setup by default so you need to create one real quick so the DB understands what relates between the tables (the keys). The two tables should be shown at the top of the query view. Simply drag playerID from the Hitting table to playerID in the Master table and now your tables are related.

Next choose the information you want to pull. First we will pull information from the Hitting table. Double-click on playerID in the Hitting table. We want to group all the information for each of the playerID together so go down to the main part of the query form and right click playerID (it should have been added to the first column when you double-clicked on the label in the table) and select Totals. This will show the Totals setting for each data column. playerID should be set to "Group By" automatically (which is what we want).

Next add the yearID from the Hitting table (double-click or drag it to the second column). We only want the previous 3 seasons so in the criteria section of put >1999 (I'm still using the 2003 Lahman version). Also change the Totals from "Group By" to "Where".

Now the easiest parts. Add the other data. For pure fantasy averages I'll just add AB, H, R, RBI, HR, and SB all from the Hitting Table. You want the average of these so change all the Totals from "Group By" to Avg. Since we don't want to see all the spot players and pitcher stats add criteria to the AB column. Put >200 (or whatever you deem fit) into the Criteria section.

Now some quick minor things to make reading your output easier. From the Master table add nameFirst and nameLast. Remove the Show checkmark from playerID. You can also right click each of the columns of data ( AB, H, HR, etc) choose Properties and change the format to Fixed Number and select 0 decimals in order to round each number off.

All that is left is Query/Run and you have your information.

Wow, that's awesome. Thanks!

Also, I realized all the stuff I have is for excel so when I went to get the Access database from Baseball1.com, I found that the 2005 version was released today!

I think the PlayerID's might already be linked too, but I'm not sure - apparently it wasn't like that on the older versions. Not that it makes much difference, but just so everybody knows.

Again - thanks a lot Tavish!

"Jack, will you call me, if you're able?"

"I've got your phone number written, in the back of my Bible."
LBJackal
Hall of Fame Hero

Posts: 9196
Joined: 1 Jul 2003
Home Cafe: Baseball
Location: The Hotel Yorba

Hah, well, looks like Tavish beat me to it. If anyone's still interested in a somewhat more detailed (and probably more complicated ) tutorial, I can post mine. It's currently running at 3,671 words.
Pedantic
Hall of Fame Hero

Posts: 6725
Joined: 5 Dec 2003
Home Cafe: Baseball
Location: Funny movie quote that everyone knows

The more, the merrier. You don't want that 10 page essay to go to waste

And if you know anything about doing regressions.... that'd be a big help, too.

"Jack, will you call me, if you're able?"

"I've got your phone number written, in the back of my Bible."
LBJackal
Hall of Fame Hero

Posts: 9196
Joined: 1 Jul 2003
Home Cafe: Baseball
Location: The Hotel Yorba

If by regression you mean the 17/33/50 thing, then you're in luck. Otherwise, someone is going to have to explain that to me.
Pedantic
Hall of Fame Hero

Posts: 6725
Joined: 5 Dec 2003
Home Cafe: Baseball
Location: Funny movie quote that everyone knows

Regressions are pretty easy if you're using Excel.

Open up Excel and in the menu click Tools->Add-Ins...

The Add-Ins dialog box will pop up. There will be a list of add-ins ranging from a few to many depending on your version of Excel. Near the top of the list you should see "Analysis ToolPak" and "Analysis ToolPak - VBA". Check both of those and click OK.

Now go back to the menu and click Tools->Data Analysis...

A new dialog will pop up and give you a list of statistical tools. The list is alphabetical so work your way down to "Regression", highlight it and click OK.

A regression dialog box will pop up and allow you to set the ranges on your spreadsheet and set the other variables neccesary for your regression. HTH.
Maine has a good swing for a pitcher but on anything that moves, he has no chance. And if it's a fastball, it has to be up in the zone. Basically, the pitcher has to hit his bat. - Mike Pelfrey
Amazinz
Mod in Retirement

Posts: 18800
Joined: 16 Mar 2004
Home Cafe: Baseball
Location: in Canada, toughening up figure skaters

Introduction To Setting Up a Three Year Weighted Statistics Database

by Pedantic

Preface: Sorry it took so long, but I wanted to be, uh, thorough. The example database should really help though, or so I hope. If any points need clarification, do bring them up, as I have the tendency to skim over things sometimes. Anyway, put on your thinking cap and, uh, good luck.

I haven't set up a database for three years' worth of stats before now, however, I know what it would encompass in principle. The first part would be to make six separate tables (or import them, this would be much more desirable Lol) for each year. Three tables will hold batting stats to '02, '03, and '04, respectively--the other three, pitching. If the databases you have downloaded have the three year stats in one table I would highly recommend going to the Tools menu -> Analyze -> Table. This will automatically split your table for you.

First, I have set up an example database with a few fictional batters and pitchers and queries that evaluate weighted some basic batting and pitching categories. You can download the (virus-free) database in Access 2000 format here. My apologies for the file size, but there wasn't much I could do in that respect. I may refer to this during my explanations, so it would be best to have this loaded while reading this.

Each table must have what is called a primary key. A primary key is a unique (it must be unique, so you can't use names, as names can be the same) identifier for each row (record) in each table. This is how the database can pull distinct rows from a table. In Access, if you don't specify a primary key when creating a table, then it will assign an AutoNumber as the key (e.g. 1, 2, 3,... etc.). You can also assign your own unique identifier; it doesn't necessarily have to be a number, for example, you could have a row with the key "P099".

There are also what are called foreign keys. These are primary keys from another table that correspond to a particular column other than the primary key in the table in question (for instance, a customer ID might correspond to a customer ID in a table that shows orders, where the primary key is the order ID). Fortunately, since we are only creating or manipulating a player database, there is really only one key we need to worry about: the key that identifies a particular player.

If you are creating a new table or importing a spreadsheet from Excel, it is imperative that the primary keys used to identify a player's name or his stats are consistent, because while cells are specified in Excel by a column position and a row position, a value in a database is identified only by a column and its primary key. (If you will allow me a brief digression, the difference between a database and a spreadsheet reminds me a bit of the difference between vector and non-vector mathematics. A spreadsheet only records magnitude, but a database records the magnitude plus the column and primary key attributes. Back to your regularly scheduled program.) Therefore, it might be possible to end up with A-Rod as player #576 in the 2002 stats table, while Scott Spiezio is #576 in another year's table. I'm sure you see the problem--how could we attribute the sum of the stats to any player if it is a combination of A-Rod and Spiezio's stats? However, if you are importing statistics that are already in a database, this is something that you probably don't need to worry about.

If you open a table, like batters_02, you can see that it looks much like a spreadsheet, except in that fact that each row is unique and no matter how you order the table, the stats will remain with that row. I also chose to only include the basic fantasy stats, to avoid getting too complicated. You can also see that I put the names of the players in a separate table. This is because if I am far more likely to make a mistake with a name if I have to list them in three different tables. Also, if I want to make a change to a name, I only have to change it in one location, rather than in three separate tables.

If you are familiar with Excel, then you are no doubt used to programming formulas and the like directly into a cell in a spreadsheet. In a database, tables are used exclusively for storing information. If we want to make a calculation on a table, we must use an aptly named query. A query is just that: a statement that requests information from a table. In this statement, we can perform calculations, group the results a certain way (e.g. by position), rank the results, and much more. Compared to a spreadsheet, where you have to alter all of your formulas if you change the rows/columns you want to perform calculations on, there is no change needed if you want to add a row (or record) in a table. Also, if you want to alter a formula (or statement), you only have to do it in one place.

Open the weighted_hitting query. You will see the hitting statistics, weighted 16.667/33.333/50% (try manually calculating one of the values from the tables--it works). Of course, looking at the stats doesn't help much--you really need to know how to replicate this in your own database.

Open up a new query by selecting "New" in the "Queries" window and selecting "Design view". Close the Show Table window that pops up (we'll get to this in a bit). You can see the window is separated into two main sections. In the top section, there is a large blank space. Below is a spreadsheet-looking space. Now, right-click in the top section and select "Show Tables". You will see the window that initially popped up. Notice that you can insert either a table or a query by switching tabs. Select the table batters_02[i] and click "Add". To put a column or perform calculations on a column in the query, you must add the table that contains the column to the query or the query will not recognize the name of the column.

You will see a small window in the top section now. The title bar reads "batters_02". This is, of course, the name of the table. In the mini-window are a number of fields: an asterisk, a bolded [i]player_id
, etc. Click and drag the asterisk to the first column in the spreadsheet-esque bottom section. In the row labeled "Field" is the expression batters_02.*. The portion before the period is the name of the table. The part after is the column name. In this case, the asterisk is telling the database to show all columns in the original table.

Switch to Datasheet View. You can do this by clicking the leftmost icon in the toolbar or by right-clicking in the top section and selecting "Datasheet view". The results of the query should be identical to the original table. Switch back to Design view. Hover over the little bar above the column that contains batters_02.*. You should see a black arrow pointing down. Click and hit Delete. This will clear the column. Now, drag and drop runs onto the column. You should see runs in the Field row. Switch back to Datasheet view. Now all you see is the runs column from the batters_02 table. Switch back to Design view. Under the rows labeled Field and Table, there is a row labeled "Sort". Click in the rightmost portion of the first cell to open the drop-down menu. The database automatically sorted the column by descending. Click "ascending" and quickly check to see the results (don't worry about that zero at the bottom--we only have five rows--the sixth there is what would be a new row).

Back in Design view, drag and drop the bolded player_id to the second column. Note that it is bolded because it is the aforementioned primary key. Also note the row labeled "Criteria". In the second column in this row, type the number "5" without the quotes. In Datasheet view, you will see that the query excluded all player_id values except for 5. Now go back and type "Not 5" without quotes. You will see that the query excluded the value 5. Clear the player_id criteria. In the runs criteria, type ">= 80". The query excludes all values for runs that are not greater than or equal to eighty. In the same vein, "=" means equal to, "<>" means not equal to, "<" means less than, "<=" means less than or equal to, and ">" is greater than. You can also include more than one criterion by using the operator "AND" or "OR". A separate row is provided for "OR" criteria in which you do not have to use the operator "OR".

Now, clear the player_id column and all criteria in the query. In the toolbar, there is a sigma icon (looks like a funky "E"). Click this icon. Notice it adds a row labeled "Total" to the bottom section. This is for calculating totals on columns. Experiment with some of the options in the drop-down menu in that row.

However, I'm sure the ability to calculate mins and maxs on a column is hardly enough to satisfy anyone. This is where some advanced techniques come in. Close the query you created (you can save it if you want, but you don't need to). Open the weighted_hitting query again. Now, if you have a sharp eye, you might have noticed the table names in the table windows in the top section don't match the actual table names. This is because I am using an "alias" for each table. I'm not sure if you can use table aliases when constructing a query in Design view, but we will explore ways you can later.

You can, however, use aliases with column names right now. Look at the first two columns in the bottom section. Expand the columns so you can view the full Field. You'll see "first_name" followed by a colon followed by "player_fname". Look in the names table window. You'll see that player_fname is a column (the first name column, specifically). The syntax for assigning an alias to a column is alias: column_name. Under the Total row in the first two columns is the option "Group By". This instructs the query to group any calculations by the values in these columns. I could have used player_id, but a name is more meaningful.

The other five columns use aliases also, however, because these columns are not the original columns but the result of a calculation they are not actually aliases and therefore are required; if a name isn't specified, the database will assign one automatically. Also notice that for these five columns, the option selected in Total is "Expression". This is because it is a custom expression and not a straight sum, etc. Using any other option in Total with a custom expression will result in an error (e.g. "Cannot have aggregate function in GROUP BY clause"). Also, a custom expression must be enclosed in a Sum() function.

By displaying all four tables in the top section, I can perform calculations on any column in any of the tables. The reason I added the names tables is so I could add the players' names to the query. Take a look at some of the expressions. They are actually very straightforward. They just multiply the values in a particular column by a certain scalar and add the results from each table together. Of course, you must multiply the correct column by the correct scalar. 0.16667 goes with batters_02 (aka two), etc. If you are used to Excel, you must remember that while in Excel you must specify the "cell" you want a calculation performed on, in Access, when writing an expression, it is only necessary (and possible) to specify the column. The advantage of this is, of course, that it then doesn't matter if you have five rows or 500. A last note on expressions: you may find Access' Expression Builder useful. To start the builder, you must be editing a cell in the Field row (i.e. the cursor is in the cell of the expression you want to build). Then, click the magic wand icon in the toolbar. I find it easier to type it directly into the cell, personally, but the builder can be useful for inserting column names (especially from different tables) into an expression. Just be sure the table is shown in the top section if you insert a column from a different table.

You might also be wondering what the black lines connecting the table windows together are for. These do not really serve a purpose except to denote a relationship between tables. In this case, the primary key player_id is in every table, therefore, each table can relate to another. All a relationship does is say that two or more tables have something in common (e.g. a primary key matches a column in another table). If tables have nothing in common, it would not be possible to perform calculations on data from both. You could perform separate calculations in separate queries, but that would be a royal pain in the ass.

Now, I'm going to talk about some of the more advances features of Access and SQL in particular. If you hate coding of any kind, you could plug your ears (or eyes in this case) from this point on and still make decent queries, however, you'd be missing a lot. If you're brave enough, take a deep breath, as we're about to dive into Access' seamy underground.

Microsoft Access is a simplified, limited-function SQL DBMS (DataBase Management System). Microsoft makes a more robust version known simply as Microsoft SQL Server at a much more robust price (yes, believe it or not all commercial DBMS' cost far more than the \$500 bones you'll shell out for a copy of MS Office Premium). SQL doesn't actually stand for anything; it's just the name of a standard maintained by a special division of the U.S. government for software developers to use. This ensures that the SQL language is used the same by everyone. Actually, SQL isn't a true programming language; it's far too simple (luckily, for you and I). It's more of a way to tell the DBMS what you want outputted. The DBMS decides for its own self how to execute the command, unlike other languages, where you tell the computer implicitly what to do. Naturally, different DBMS will execute commands differently (for instance, I believe Access does not support subqueries, or queries embedded in other queries).

I wrote the queries for the sample database in SQL, partly because I didn't know how exactly how to do what I wanted with Access' graphical interface (I do now, and it looks far less daunting than before, but as they say, hindsight is 20/20). However, writing in SQL opens up your options considerably and is actually remarkably simple.

Open up the weighted_hitting query and go to "SQL view" via the drop-down menu on the icon you used to go back and forth between Datasheet and Design views (or, similarly, you can right-click and select SQL view). You'll see a block of code that upon first view may look impenetrable. However, have no fear, I'll break it down piece by piece.

The first thing you should know is that words in all caps are SQL commands. The first command you will see is SELECT. In fact, the SELECT command, along with the FROM command, form the backbone of SQL. The following code is as simple as a query gets:

<pre>SELECT column_name
FROM table_name</pre>

Doesn't seem too hard, now does that? In fact, it almost makes sense if you read it out loud. You're SELECTing a particular column FROM a particular table. The next command you see is AS. This is the part where you're specifying an alias for a column. Aliases are helpful if you want to rename a column and/or make it more meaningful. Also, I should note that in SQL view, if you're using data from multiple tables (as we are here) you should write columns names in the following format: table_name.column_name, otherwise, if you're using data from only one table, you can just write the column name. In this case, we're SELECTing the column player_fname FROM the table names (this is a table alias, which I'll cover in a bit) and renaming it the more meaningful first_name for this query. The concordantly, names.player_lname is renamed last_name.

You may SELECT as many columns as you want by separating column names (including the corresponding aliases) with commas. In this case, I have SELECTed seven columns in all, two being actual columns, the other five being the expressions. Speaking of the expressions, to SELECT the data from an expression for use in a query, all you have to do is write the expression in the form:
<pre>Sum(expression) AS alias</pre>

The alias is mandatory in this case.

Now is where it gets a bit tricky. The FROM command tells the query which table the column(s) is/are coming from. This is also where we can assign an alias to a table. We can then use the alias in the earlier parts of the query, despite not having been defined until the FROM command. However, in this case, we want to perform calculations on multiple tables. Normally, in SQL, for two tables you could use this code:

<pre>FROM table_1
INNER JOIN table_2
ON table_1.column_1 oper table_2.column_2</pre>

where oper is an operator (i.e. "=", ">=", AND, etc.). The ON command tells the database to JOIN (i.e. combine two tables into one for the purposes of the query) table_1 and table_2 on the condition specified by the operator (e.g. in the weighted_hitting, the database JOINs table two and table three if and only if the columns two.player_id and three.player_id are equal [and we know they are]). Generally, the equals sign is the operator used. An INNER JOIN is a certain (and the most common) type of JOIN and only returns the rows where the condition specified by the operator is satisfied. There are other types of JOINs, but they aren't really very important at this point.

However, we do not want to JOIN just two tables, we need to JOIN four. With Access, we have to nest (or embed all four tables into "layers") the JOINs. If you know what u-substitution is, this is somewhat similar. It looks complicated, but here's the long and short of what we're actually doing, borrowing some math:

<pre>FROM batters_02 AS two
INNER JOIN (u)
ON two.player_id = three.player_id

/* The part above is very similar to the basic code a couple paragraphs above--if you*/
/*think of u as table_2. Now we'll set the value of u (remember the*/ /*FROM is still in effect): */

batters_03 AS three
INNER JOIN (v)
ON three.player_id = four.player_id

/* And now we'll set the value of v and JOIN the last table: */

batters_04 AS four
INNER JOIN batters_names AS names
ON four.player_id = names.player_id

/* Whew, that was the last iteration, now let's combine them all */

FROM batters_02 AS two
INNER JOIN (batters_03 AS three
INNER JOIN (batters_04 AS four
INNER JOIN batters_names AS names
ON four.player_id = names.player_id)
ON three.player_id = four.player_id)
ON two.player_id = three.player_id</pre>

It seems like a big tangled mess, but at least there's no chain rule. It just takes some thinking. Also, I should mention that the first INNER JOIN applies to tables two and three, therefore, since the first INNER JOIN is the outermost, the ON condition must also be the outermost, and therefore the last, or you will get an error. The same goes for the other ON commands. Just think of it as peeling an onion--every part of the layer must correspond. Also, like an onion, you can add as many tables as you want using this method.

Fortunately, it's all downhill from here. The GROUP BY command tells the query how to group totals (by which column[s]). You can add several columns to GROUP BY by separating column names with commas. Also, there are some essential commands I didn't use in this query, but which you might find useful. The first is WHERE. It follows the FROM command and goes before GROUP BY. In the WHERE command you can specify conditions to be satisfied if a value is to be included. The Criteria row in the Design view is based on the WHERE command.
Another command is the ORDER BY command. Similar to GROUP BY, you can specify by which column you want to order the query. Also, you can specify whether you want to ORDER BY column_name ASC (ascending) or DESC (descending). The Sort row in the Design view is based on the ORDER BY command. The ORDER BY always goes last in a query.

Last, but certainly not least, if you are writing SQL, you must not forget a semicolon at the end of the query; this is imperative. Otherwise, your query will not run.

I've reached the end of my "tutorial". I certainly hope I've at least helped your understanding of Access and databases, and I certainly hope I have not confused you further. Good luck writing queries and feel free to ask if you need help. Access is certainly harder than writing the average Word document. But don't let the difficulty make you hesitate to convert to a database. It's more than worth it. Hanging onto a spreadsheet when you could unleash the power of a database is akin to insisting on using your familiar old abacus when you could be using a graphing calculator. Once again, good luck.
Last edited by Pedantic on Fri Nov 19, 2004 7:16 pm, edited 1 time in total.
Pedantic
Hall of Fame Hero

Posts: 6725
Joined: 5 Dec 2003
Home Cafe: Baseball
Location: Funny movie quote that everyone knows

Sorry the text isn't wrapping. I'll try to find a way to correct it.

Well, at least it fit on the freaking screen now.
Pedantic
Hall of Fame Hero

Posts: 6725
Joined: 5 Dec 2003
Home Cafe: Baseball
Location: Funny movie quote that everyone knows

Thanks for the help Pedantic.
Music2004Man
Major League Manager

Posts: 1092
Joined: 22 Oct 2003
Home Cafe: Baseball

PreviousNext

Return to Baseball Leftovers

### Who is online

Users browsing this forum: No registered users and 5 guests