-->
Google BigQuery is an enterprise data warehouse that allows users to store massive datasets and access them with super-fast SQL queries that harness the processing power of Google’s infrastructure. BigQuery is also home to large public datasets including the Medicare dataset used here. The Google Cloud project overall is a great place to start exploring open data and learning more about the various Google APIs created for the access and presentation of Big Data (Terrabyte+).
In this tutorial we will set up a Google BigQuery account, set up and test a sample SQL query on the Medicare data in the BigQuery API, then set up and access the data in R to report a simple table and plot of the sampled Medicare data. The examples include a walk-through of setting up the Google BigQuery account properly, two sample SQL queries of the Medicare dataset, and tables and barplots that highlight the provider specialties and states with the ten highest expenditures for Medicare Part D prescriptions.
You will need a current Google or gmail account to get started. Log in to your Google/Gmail account and then follow the first link below to start a Google Cloud account and project. Follow this short list below. For more a more detailed walkthrough of setting up Google BigQuery check out how the instructions at this blog on Setting up Google BigQuery
Go to Google APIs Console
Clink on BigQuery API.
If you see a warning or pop-up for Create a project, accept it then name your project. Use only lower-case letters, dashes and numbers when naming your project. Otherwise, bigrquery may not play nice with BigQuery. If you do not see a pop-up warning asking you to start a new project, look for a Create a project pull-down or button in the BigQuery console.
View your new project dashboard.
Now lets look at the Medicare table we will be using in this excercise.
Medicare Part D Prescribers 2014, Schema
To find the table size and number of rows in the Medicare Part D table, follow the link above and then click on ‘Details’. The dataset is 3.17GB and has over 24 million rows. Until you have a design plan, it’s probably a good idea to start on the smaller side to explore this data. Query processing is free up to 1 TB. Carefully planned exploratory queries will prevent charges from the Google API service.
Use the table schema above to pick fields to use in the first go at exploring the data.
Since the SQL queries will import aggregated numbers as integers, import them using SQL’s ROUND feature to insure that your numbers are floats with two decimals. Do this with the query to save time later.
Next, let’s add our SQL query and save it to an r object. We will simply assign the whole query as a string and then call it using the query_exec function in ‘bigrquery’.
If your query ran without error then you should see a newly generated preview of your results in a table below where you wrote your query.
Working with BigQuery requires additional setup. First install devtools, then ‘httpuv’ for listening to and interacting with HTTP and WebSocket clients. Then install The bigrquery package.
Then attach the required libraries for querying Google BigQuery.
Add your BigQuery account as objects that we will call on in the next steps.
Next, run a query. The query below is using the objects we saved earlier, and results in a new data file called “data1”.
The result of this first query is the top ten prescription costs by care provider specialty. The SQL query for BigQuery aggregated the Medicare Part D data into totals for claims, Medicare Part D Beneficiaries, and drug costs(prescriptions) and the average (mean) drug cost.
Here is a first look at what is in the first data query.
Specialty | Claims | Beneficiaries | Total Cost | Avg. Cost | |
---|---|---|---|---|---|
1 | Internal Medicine | 385,878,260 | 75,632,169 | 22,984,632,037 | 3,348.88 |
2 | Family Practice | 361,432,334 | 67,269,092 | 18,255,475,412 | 2,554.00 |
3 | Nurse Practitioner | 77,806,940 | 15,976,609 | 6,402,619,812 | 3,088.91 |
4 | Neurology | 18,611,279 | 2,712,173 | 5,087,737,347 | 13,004.47 |
5 | Psychiatry | 40,916,821 | 5,427,422 | 4,191,868,687 | 6,190.20 |
6 | Cardiology | 67,063,399 | 14,138,640 | 4,039,385,824 | 4,098.10 |
7 | Hematology/Oncology | 5,323,398 | 981,381 | 3,467,281,174 | 20,675.38 |
8 | Gastroenterology | 10,338,099 | 3,126,280 | 3,099,043,158 | 13,645.56 |
9 | Physician Assistant | 40,437,509 | 10,237,398 | 3,020,290,520 | 2,695.39 |
10 | Rheumatology | 10,880,367 | 2,211,941 | 2,457,641,511 | 15,470.10 |
And a simple bar plot of the top ten provider categories.
The second query below is grouped by state name and will query the top 10 states with the highest total Medicare Part D drug cost.
One more time with another query and a table view of the data.
A quick table view of the second query data.
State | Claims | Beneficiaries | Total Cost | Avg. Cost | |
---|---|---|---|---|---|
1 | CA | 116,446,642 | 25,742,811 | 9,633,893,459 | 4,162.26 |
2 | NY | 79,575,358 | 16,066,653 | 7,522,274,092 | 4,557.87 |
3 | FL | 91,446,623 | 22,514,043 | 6,969,934,015 | 4,325.70 |
4 | TX | 75,687,823 | 17,798,120 | 6,462,968,508 | 4,389.77 |
5 | PA | 62,949,047 | 11,754,376 | 4,842,161,770 | 3,939.50 |
6 | OH | 53,497,321 | 11,156,950 | 4,050,839,095 | 4,043.17 |
7 | NC | 46,022,957 | 9,272,742 | 3,358,005,546 | 3,831.88 |
8 | MI | 39,464,395 | 9,251,248 | 3,193,749,462 | 3,874.71 |
9 | IL | 42,529,201 | 8,206,059 | 3,120,751,393 | 3,601.20 |
10 | NJ | 29,243,495 | 6,216,323 | 2,980,197,009 | 4,582.87 |
A quick plot of the data for a quick visual reference.
Using BigQuery with R takes a little setup but it offers a great deal of flexibility for your workflow. The above was just the basics of how to get started and see the data. If your work environment and wallet will support it you can query and save subsets or large sets of big data and complete your analysis reporting in R.