Thursday, October 21, 2010

Accessing a MS SQL database in R

I'm going to preface this post with a disclaimer: I'm not really a programmer - I'm actually an ecologist. That being said, I am slowly learning R in the course of my work. R is a programming language and a software environment for doing statistics and making pictures with data. A lot of programmers have never heard of R and those who come to R after learning other languages may complain bitterly about it. Since R is the first language I've learned, I can't really tell you how it's different from other languages, but I can tell you that I struggled quite a bit with the syntax because there are many ways to seemingly do the same thing. But the very subtle differences will kill your code.

With all this talk of death, you're probably as frightened of R as the average ecologist, but there are very good reasons to use R (even the NY Times thinks so!). If you're doing data analysis, you can do it with R and you can probably do it better. Plus, it's free.

To get you familiar with R and some of its capabilities, I'm going to walk you through one of my current projects over a few posts. My lab group has data from several remote weather stations stored in a Microsoft SQL Database. Temperature and precipitation have been measured hourly for about eight years. For a variety of reasons, we have quite a few missing measurements. I need to create a model that (reasonably) predicts those missing values.

The first step is to actually get to the data. Those of you familiar with databases probably wouldn't have had to spend an entire afternoon trying to do that. Those of you not familiar with databases hopefully won't spend all afternoon trying to do that after reading this post. A quick note: I use Windows at work. If you're using something else, this may not be as helpful as I'd hope.

Create Data Source Name
We're going to use ODBC to get R to communicate with the database. Before today, I didn't have a good handle on ODBC. Wikipedia was totally there for me:
Open Database Connectivity (ODBC) provides a standard software interface for accessing database management systems (DBMS). The designers of ODBC aimed to make it independent of programming languages, database systems, and operating systems. Thus, any application can use ODBC to query data from a database, regardless of the platform it is on or DBMS it uses. ODBC accomplishes this by using a driver as a translation layer between the application and the DBMS. The application thus only needs to know ODBC syntax, and the driver can then pass the query to the DBMS in its native format, returning the data in a format the application can understand.
Cool, right? So to get this all working, the first thing you need to do is create a Database Source Name for your database. I didn't really know what that was either, but of course Wikipedia did:

A DSN specifies a data structure that contains the information about a specific data source (database or other data source) that an Open Database Connectivity (ODBC) driver needs in order to connect to it.
Luckily, this is really easy to do and Microsoft will hold your hand the whole way:

Create a System DSN in Windows XP

  1. Click Start, point to Control Panel, double-click Administrative Tools, and then double-click Data Sources(ODBC).
  2. Click the System DSN tab, and then click Add.
  3. Click the database driver that corresponds with the database type to which you are connecting, and then click Finish.
  4. Type the data source name. Make sure that you choose a name that you can remember. You will need to use this name later.
  5. Click Select.
  6. Click the correct database, and then click OK.
  7. Click OK, and then click OK.

Install RODBC package
The next step is even easier. Assuming you already have R installed, you need to install and load the RODBC package. There are a lot of ways to install packages in R, but I installed and loaded the package like so:

install.packages('RODBC') #install
library('RODBC') #load
Anything after a # is a comment in R.

Connect to the database
And we're ready to connect! All I need to do now is give the odbcConnect function the name of the DSN I created and my username and password. I'll call this connection con. You can call it anything you like, as long as it doesn't have spaces or start with a number.
con <- odbcConnect('DSN', uid='username', pwd='password')
The quotes are important. If you don't use them, R thinks you're referring to an object.

Now that I have access to the database through con, I want to know a little bit about the database. odbcGetInfo(con) will tell you some basic information about the database like what database management system you're using, what version you're using, and the name. That's not especially helpful since you should already know these things.

The sqlTables function will return all table like objects. I just want to see what the actual tables are, so I pass it an additional argument, like so:
sqlTables(con, tableType='TABLE')
The parameter tableType is case sensitive, but the argument TABLE isn't. That won't always be the case with an R function. Figuring out errors caused by using the wrong case is a great time.

If I want to know the column names and data types in a particular table I can use the function sqlColumns.
sqlColumns(con, sqtable='hrlyweather')
I could have returned the same output with
sqlColumns(con, 'hrlyweather')
R functions will usually let you get away with leaving off the parameter name (sqtable in this case) as long as you pass the function arguments in the order it expects, but it's a good habit to use the parameter names. It'll make it easier to figure out problems and make your code much more readable.

You can learn more about what you can do with the RODBC package in this handy document (pdf). In my next post I'll query the database through the connection I've created and do some basic data manipulation in preparation for running a model.

No comments:

Post a Comment