Build and share reliable datasets with Analytics Hub
Syah Ismail2021-07-14T04:52:25+08:00
Sharing and exchanging data with other organisations is a critical element of analytics strategy but it’s hamstrung by unreliable data and processes and only getting harder with security threats and privacy regulations on the rise.
Furthermore, traditional data sharing techniques use batch data pipelines that are expensive to run, create late-arriving data and can break with any changes to the source data. They also create multiple copies of data which brings unnecessary costs and can bypass data governance processes. These techniques do not offer features for data monetisation, such as managing subscriptions and entitlements. Altogether, these challenges mean that organisations are unable to realise the full potential of transforming their business with shared data.
To address these limitations, Google Cloud is introducing Analytics Hub, a new fully managed service, available in Q3, in preview, that helps you unlock the value of data sharing, leading to new insights and increased business value. With Analytics Hub you get:
- A rich data ecosystem by publishing and subscribing to analytics-ready datasets.
- Control and monitoring over how your data is being used because data is shared in one place.
- A self-service way to access valuable and trusted data assets including data provided by Google. For example, a unique dataset from Google Search Trends will be available that you can query and combine with your own data.
- An easy way to monetise your data assets without the overhead of building and managing the infrastructure.
Built on BigQuery
While Analytics Hub is a new service, it builds on BigQuery, Google’s petabyte-scale, serverless cloud data warehouse. BigQuery’s unique architecture provides separation between compute and storage, enabling data publishers to share data with as many subscribers as you want without having to make multiple copies of your data. With BigQuery, there are no servers to deploy or manage which means that data consumers get immediate value from shared data. Data can be provided and consumed in real-time using the streaming capabilities of BigQuery and you can leverage the built-in machine learning, geospatial and natural language capabilities of BigQuery or take advantage of the native business intelligence support with tools like Looker, Google Sheets and Data Studio.
BigQuery has had cross-organisational, in-place data sharing capabilities since it was introduced in 2010. Over a 7 day period in April, BigQuery had over 3,000 different organisations sharing over 200 petabytes of data. These numbers don’t include data sharing between departments within the same organisation.
Raising the bar on data sharing
To make data sharing easier and more scalable in BigQuery, Analytics Hub introduces the concepts of shared datasets and exchanges. As a data publisher, you create shared datasets that contain the views of data that you want to deliver to your subscribers. Next, you create exchanges that are used to organise and secure shared datasets. By default, exchanges are completely private which means that only the users and groups that you give access to can view or subscribe to the data. You can also create internal exchanges or leverage public exchanges provided by Google. Finally, you publish shared datasets into an exchange to make them available to subscribers.
Data subscribers search through the datasets that are available across all exchanges for which they have access and subscribe to relevant datasets. This creates a linked dataset in their project that they can query and join with their own data. Subscribers pay for the queries that they run against the data while the publisher pays for the storage of the data. Data providers can add new data, new tables or new columns to the shared dataset and these will be immediately available to subscribers. In addition, the publisher can track subscribers, disable subscriptions and see aggregated usage information for the shared data.
Analytics Hub makes it easy for you to publish, discover and subscribe to valuable datasets that you can combine with your own data to derive unique insights. Here are some types of data that will be available through Analytics Hub:
- Public datasets: Easy access to the existing repository of over 200 public datasets, including data about weather and climate, cryptocurrency, healthcare and life sciences, and transportation.
- Google datasets: Unique, freely available datasets from Google. One example of this is the COVID-19 community mobility dataset. Another example is the forthcoming Google Trends dataset which will provide the top 25 search terms and top 25 rising search terms over a 5-year window in 210 distinct locations in the US. Trends data can be used by everyone in the organisation to gain insights into what customers care about.
- Commercial (paid for) datasets: Google is working with leading commercial data providers to bring their data products to Analytics Hub. If you are interested in delivering your data via Analytics Hub, Google is also introducing Data Gravity, an initiative that provides storage benefits and new distribution paths for data published through Analytics Hub.
- Internal datasets: In larger organisations, Analytics Hub can be used for internal data, for example, to share standardised customer demographics with your sales engineering and data science teams.
What’s next for Analytics Hub
This is just the beginning for Analytics Hub. As we get to preview and general availability, Google will be adding additional capabilities including workflows for publishing and subscribing, publishing analytics assets (Looker Blocks, Data Studio reports, Connected Google Sheets) along with the shared data, the ability for data publishers to specify query restrictions on the usage of their data and making it easy for data publishers to create sandbox environments for subscribers to work with their data, even if they are not yet on Google Cloud. There will be features in Analytics Hub for monetisation of data including managing subscriptions, data entitlements and billing.