email us at Bioinformaticsolutions @ gmail . com

Thursday, June 21, 2012

Google BigQuery - Hospital Episode Statistics data analysis made easy (ish)

We had a visit from PA consulting last week, who were very excited about their application of Google's new Big Data solution, BigQuery, to the vast dataset which is the UK's Hospital Episode Statistics (HES).

We've had to wrestle with HES data before and come off the worse for it - and that was looking only at one year's Inpatient data - about 18 million lines of data. We encountered all kinds of issues in the setup of our bespoke database, created to hold and analyse this data, starting with the published data dictionary's divergence from the fields in our extract, taking in the discovery of duplicated unique episode identifiers (see our post on the response from the NHS Information Centre) and ending with our discovery that the processing power required to run some of our composite queries in a timely fashion was beyond our meagre infrastructure (this was at the now defunct National Cancer Research Institute's Informatics Initiative).

We could really have used the facility which BigQuery is set to provide. The guys from PA described how they had obtained the entire start-to-finish HES dataset across all three areas of collection (inpatient, outpatient and A&E) and loaded this into BigQuery (this being the most arduous part of the process, the data arriving on 27 DVDs and taking a couple of weeks to upload) prior to demonstrating the speed with which it was able to provide answers and how the data could be linked to google maps and google docs' spreadsheet application to dynamically produce visual and graphical analyses.

BigQuery dynamically calls in servers to assist in the running of a query based on the processing power required and then releases them once the query has been executed. The result of being able to access Google's immense army of servers is that without any of the usual time-consuming optimisation (indexing etc.) which supports enhanced performance on traditional database technologies, the user can execute a query against billions of data points in seconds. If you're working to some degree 'in the dark', uncertain of how you wish to structure your data and what analyses you will require it to support, you can experiment on vast datasets without waiting hours for queries to produce results (or fail!) - a facility which would have delivered huge time-savings to us in our HES analysis.

PA have a video describing their work and approach here and Google provide further information about BigQuery here.

No comments:

Post a Comment