The missing piece in enforcing access policies on Google BigQuery

The missing piece in enforcing access policies on Google BigQuery

By Bob van den Hoogen, engineer @ STRM.

In any modern data context, especially within large organizations, managing access to sensitive data like employee salaries is critical. Getting it wrong proves costly in one of two ways, either data is controlled too restrictively causing missed potential value, or permissions are too loose, leaving residual risk on the table. The challenge lies in balancing data protection with the utility of the data for different departments. BigQuery offers a security approach that allows users to balance these demands via secure views, policy tags and column masking, but as we found out this approach has notable limitations.

Let’s make it concrete

Consider the following case, we have a table with employee data and several departments within an organization having varied needs regarding this data. The access needs of different departments for sensitive employee data could be outlined as follows:

  • HR Analytics Department: They focus on macro-level insights, like departmental salary expenditures. Access to generalized salary data is provided, rounded to the nearest 5,000. Sensitive data like names or exact individual salaries are excluded to protect privacy. This setup allows the department to perform the desired analysis without compromising individual employee confidentiality.
  • Finance Department: This department requires detailed financial data, including exact salary amounts and banking details (e.g., IBAN) for tasks like payroll processing. However, they don't need access to HR-related data such as performance evaluations. The emphasis here is on the accuracy and security of financial transactions, ensuring compliance with legal and internal standards.
  • General Employee Access: They can view broad, non-sensitive information about the company, such as overall department structures, general company announcements, or high-level statistics. This information is useful for employees to stay informed about the broader context of their work environment. Evidently, there is no need to grant access to employee salaries.

This scenario, while specific, illustrates a common philosophy in role and policy creation. Establishing proper safeguards is crucial for safe and secure role execution. Typically, various teams contribute to these requirements - from business owners to security, legal, and risk management teams. Often, the data team is tasked with implementing these measures, and roles like data stewards are assigned for access management.

Now, let's dive into our aforementioned case. Considering all these transformations we want to apply on a single column, we need a way to represent this in BigQuery.

BigQuery comes with a policy tag taxonomy to apply column masking. It provides an hierarchical way of applying a UDF (or a preset one) to a column that has the corresponding tag. The UDF can be defined in the BigQuery Studio using SQL. For example, when creating a UDF that should mask all characters but the first eight from an International Bank Account Number (IBAN), it could be defined using a RegEx replace:

CREATE OR REPLACE FUNCTION <DATASET>.first_eight_char(iban STRING) RETURNS STRING
OPTIONS (data_governance_type="DATA_MASKING") AS 
(SAFE.REGEXP_REPLACE(iban, '^([a-zA-Z0-9]{8}).*$', '\\1'));

This function's definition is part of the specified dataset, but is publicly available within the BigQuery project and can be used for all tags in the taxonomy. Note the data_governance_type is set to DATA_MASKING as then, and only then, the function can be used for column masking via policy tags. Up to this point, it looks like a walk in the park to enforce the policy on a dataset (provided that you know your way around IAM roles and permissions in your project of course). However, this is the point where we find two limitations of column masking in BigQuery.

Limitations

1. UDF limitation

The first limitation we have come across here is that basic arithmetic and rounding operations (multiply, subtract, round, ceil, etc.) are not allowed in these functions. Even though it seems only logical to apply actual masking techniques instead of performing arithmetic it would not be enough to cover this use case. Without the arithmetic, the only option to be compliant with the privacy policies, would be to hide the column altogether for the specified set of users, resulting in an unnecessary loss of data.

2. Policy tag limitations

The second limitation is in associating the UDFs with the policy tags. For each policy tag holds that, even though you can apply any number of preset functions, at most one UDF can be applied, regardless of the set of users/groups/service accounts it applies to. Since columns can be tagged with at most one policy tag, we cannot use multiple tags to circumvent this limitation.

Updating & explaining access policies

BigQuery's policy tag taxonomy and UDFs offer a systematic solution for establishing initial data masking and access control procedures. However, managing and modifying these policies becomes more complex when adjustments are required or challenging questions arise.

Policy Management Complexity: Due to their complex nature, managing policies demands an in-depth understanding of BigQuery's architecture, involving SQL, data taxonomies, and policy components. Hence, updating policies is time-consuming, requiring careful planning and alignment with data governance strategies.

Technical Expertise Required: The complexity of policy management thus requires a high level of technical skill. This often goes beyond the scope of roles involved in the process of setting policies and giving access to data like data stewards. These barriers are created between technical and business policy management.

How PACE solves this

We create a view on the original table.

By restricting read access on the original table, no user (unless explicitly permitted) can view the sensitive data.

Via a data policy we define which sets of users (we call them principals) are able to read exactly what data. The data policy is a set of rules contained in a yaml.

These rules are enforced in the definition of the resulting view.
Why should you not simply write the underlying SQL yourself?
Of course you can, but with large datasets with a lot of columns, large sets of rules and ever changing policies, it is an exhausting and error prone task. Let us do it for you!

Interested in what more we can offer? Check out our documentation

PACE is not just about helping you with these challenges in BigQuery; its open-source nature means endless possibilities. We are keen to adapt to your unique challenges. The above is based on our experience so if there are specific issues we have not yet addressed, let us know. Your insights help us evolve and better serve your needs and the wider community. Reach out and explore the potential with us. Feel free to join our Slack.