-->

An Introduction to Using Google BigQuery in R

Google BigQuery is an enterprise data warehouse that allows users to store massive datasets and access them with super-fast SQL queries that harness the processing power of Google’s infrastructure. BigQuery is also home to large public datasets including the Medicare dataset used here. The Google Cloud project overall is a great place to start exploring open data and learning more about the various Google APIs created for the access and presentation of Big Data (Terrabyte+).

In this tutorial we will set up a Google BigQuery account, set up and test a sample SQL query on the Medicare data in the BigQuery API, then set up and access the data in R to report a simple table and plot of the sampled Medicare data. The examples include a walk-through of setting up the Google BigQuery account properly, two sample SQL queries of the Medicare dataset, and tables and barplots that highlight the provider specialties and states with the ten highest expenditures for Medicare Part D prescriptions.


Quick Setup Instructions for BigQuery Account

You will need a current Google or gmail account to get started. Log in to your Google/Gmail account and then follow the first link below to start a Google Cloud account and project. Follow this short list below. For more a more detailed walkthrough of setting up Google BigQuery check out how the instructions at this blog on Setting up Google BigQuery

Go to Google APIs Console

  • Clink on BigQuery API.

  • If you see a warning or pop-up for Create a project, accept it then name your project. Use only lower-case letters, dashes and numbers when naming your project. Otherwise, bigrquery may not play nice with BigQuery. If you do not see a pop-up warning asking you to start a new project, look for a Create a project pull-down or button in the BigQuery console.

  • View your new project dashboard.


Testing our Query in the BigQuery API

Now lets look at the Medicare table we will be using in this excercise.

Medicare Part D Prescribers 2014, Schema

To find the table size and number of rows in the Medicare Part D table, follow the link above and then click on ‘Details’. The dataset is 3.17GB and has over 24 million rows. Until you have a design plan, it’s probably a good idea to start on the smaller side to explore this data. Query processing is free up to 1 TB. Carefully planned exploratory queries will prevent charges from the Google API service.

Use the table schema above to pick fields to use in the first go at exploring the data.

Since the SQL queries will import aggregated numbers as integers, import them using SQL’s ROUND feature to insure that your numbers are floats with two decimals. Do this with the query to save time later.


SQL Query

Next, let’s add our SQL query and save it to an r object. We will simply assign the whole query as a string and then call it using the query_exec function in ‘bigrquery’.

sql1 = "SELECT specialty_description AS specialty,
  ROUND(SUM(total_claim_count), 2) AS total_claims,
  ROUND(SUM(bene_count), 2) AS total_beneficiaries,
  ROUND(SUM(total_drug_cost), 2) AS total_cost,
  ROUND(AVG(total_drug_cost), 2) AS avg_cost
  FROM [bigquery-public-data.medicare.part_d_prescriber_2014]
  GROUP BY specialty
  ORDER BY total_cost DESC
  LIMIT 10;"

If your query ran without error then you should see a newly generated preview of your results in a table below where you wrote your query.


Setting up R for BigQuery

Working with BigQuery requires additional setup. First install devtools, then ‘httpuv’ for listening to and interacting with HTTP and WebSocket clients. Then install The bigrquery package.


Install Required Packages

install.packages('devtools')
devtools::install_github("assertthat")
install.packages('httpuv')
# BigQuery Package
devtools::install_github("bigrquery")


Load Required Libraries

Then attach the required libraries for querying Google BigQuery.

library(devtools)
library(httpuv)
library(bigrquery)


BigQuery Account information

Add your BigQuery account as objects that we will call on in the next steps.

projectID = "R-and-BigQuery-Medicare"
datasetID = "medicare_partD_prescribers"


Run and Save a Query

Next, run a query. The query below is using the objects we saved earlier, and results in a new data file called “data1”.

data1 = query_exec(sql1, project = projectID)


View the New Data

The result of this first query is the top ten prescription costs by care provider specialty. The SQL query for BigQuery aggregated the Medicare Part D data into totals for claims, Medicare Part D Beneficiaries, and drug costs(prescriptions) and the average (mean) drug cost.

load("~/bigquery_example.RData")

Here is a first look at what is in the first data query.


Table 1: Top 10 Specialties with the Highest Drug Costs

library(htmlTable)
htmlTable(format(data1[1:5], big.mark=",", scientific=FALSE),
          header =  paste(c("Specialty", "Claims",
                            "Beneficiaries", "Total Cost", "Avg. Cost")),
                          css.cell = "padding-left: .5em; padding-right: .2em;")


Specialty Claims Beneficiaries Total Cost Avg. Cost
1 Internal Medicine 385,878,260 75,632,169 22,984,632,037 3,348.88
2 Family Practice 361,432,334 67,269,092 18,255,475,412 2,554.00
3 Nurse Practitioner 77,806,940 15,976,609 6,402,619,812 3,088.91
4 Neurology 18,611,279 2,712,173 5,087,737,347 13,004.47
5 Psychiatry 40,916,821 5,427,422 4,191,868,687 6,190.20
6 Cardiology 67,063,399 14,138,640 4,039,385,824 4,098.10
7 Hematology/Oncology 5,323,398 981,381 3,467,281,174 20,675.38
8 Gastroenterology 10,338,099 3,126,280 3,099,043,158 13,645.56
9 Physician Assistant 40,437,509 10,237,398 3,020,290,520 2,695.39
10 Rheumatology 10,880,367 2,211,941 2,457,641,511 15,470.10


Figure 1: Top 10 Specialty Providers with the Highest Drug Costs

And a simple bar plot of the top ten provider categories.

library(ggplot2)
library(scales)
ggplot(data1, aes(x = reorder(specialty, -total_cost), y = total_cost, fill = specialty)) +
    geom_bar(stat = "identity") +
    xlab("Provider Specialty") +
    ylab("Total Cost") +
    scale_y_continuous(labels = dollar) +
    theme(axis.ticks.x=element_blank(),
          legend.position = "none",
          axis.text.x=element_text(angle=90,hjust=1.05,vjust=0.5))


Top 10 Specialty Providers with the Highest Drug Costs


Top 10 States with the Highest Part D Costs

The second query below is grouped by state name and will query the top 10 states with the highest total Medicare Part D drug cost.

sql2 =
"SELECT nppes_provider_state AS state,
ROUND(SUM(total_claim_count), 2) AS total_claims,
ROUND(SUM(bene_count), 2) AS total_beneficiaries,
ROUND(SUM(total_drug_cost), 2) AS total_cost,
ROUND(AVG(total_claim_count), 2) AS avg_claims,
ROUND(AVG(bene_count), 2) AS avg_beneficiaries,
ROUND(AVG(total_drug_cost), 2) AS avg_cost
FROM
 [bigquery-public-data.medicare.part_d_prescriber_2014]
GROUP BY state
ORDER BY total_cost DESC
LIMIT 10;"

One more time with another query and a table view of the data.

data2 = query_exec(sql2, project = projectID)


Table 2: Top 10 Specialties with the Highest Drug Costs

A quick table view of the second query data.

library(htmlTable)
htmlTable(format(data2[1:5], big.mark=",", scientific=FALSE),
          header =  paste(c("State", "Claims",
                            "Beneficiaries", "Total Cost", "Avg. Cost")),
                          css.cell = "padding-left: .5em; padding-right: .2em;")


State Claims Beneficiaries Total Cost Avg. Cost
1 CA 116,446,642 25,742,811 9,633,893,459 4,162.26
2 NY 79,575,358 16,066,653 7,522,274,092 4,557.87
3 FL 91,446,623 22,514,043 6,969,934,015 4,325.70
4 TX 75,687,823 17,798,120 6,462,968,508 4,389.77
5 PA 62,949,047 11,754,376 4,842,161,770 3,939.50
6 OH 53,497,321 11,156,950 4,050,839,095 4,043.17
7 NC 46,022,957 9,272,742 3,358,005,546 3,831.88
8 MI 39,464,395 9,251,248 3,193,749,462 3,874.71
9 IL 42,529,201 8,206,059 3,120,751,393 3,601.20
10 NJ 29,243,495 6,216,323 2,980,197,009 4,582.87


Figure 2: Top 10 States with the Highest Drug Costs

A quick plot of the data for a quick visual reference.

ggplot(data2, aes(x = reorder(state, -total_cost), y = total_cost, fill = state)) +
    geom_bar(stat = "identity") +
    xlab("US States") +
    ylab("Total Cost") +
    scale_y_continuous(labels = dollar) +
    theme(axis.ticks.x=element_blank(),
          legend.position = "none",
          axis.text.x=element_text(angle=90,hjust=1.05,vjust=0.5))


Top 10 States with the Highest Drug Costs

Using BigQuery with R takes a little setup but it offers a great deal of flexibility for your workflow. The above was just the basics of how to get started and see the data. If your work environment and wallet will support it you can query and save subsets or large sets of big data and complete your analysis reporting in R.