This is an article from DZone’s 2022 Data Pipelines Trend Report.
Cloud data warehouses (CDWs) are making rapid growth in the way organizations are analyzing data at scale. As cloud storage is elastic and cheap, and as modern data pipelines simplify ETL processes, they commonly scale to store more data than on-premises data warehouses. This obviously includes sensitive data, leading to challenges in security, privacy, governance, and compliance. In this article, we will discuss the move to CDWs and security considerations when using them.
The Move to Cloud Data Warehouses
Over a decade ago, public cloud companies started releasing data warehouses in a Platform-as-a-Service (PaaS) model. Significantly, Google BigQuery in 2010 and Amazon Redshift in 2012 enabled organizations to deploy a CDW in minutes, without the need to install the databases or configure the servers. This was followed by the launch and gradual growth of other vendors, later becoming the largest software IPO with its data cloud.
The move to a CDW (which is considered part of the modern data stack) has significant implications for the ease for data consumers and producers to access data. Cloud computing allows elasticity, storage becomes cheaper and easier to increase (or decrease), and database administration becomes much simpler (for example, in some of the CDW platforms, there are no indexes that DBAs need to maintain).
Considerations in Cloud Data Warehouses Security
When organizations use CDWs, the basics of information security remain the same, but some things are different versus having an on-premises data warehouse, or even a data warehouse you manually install on a cloud infrastructure. Part of the difference is because of the shared responsibility model. Your provider assumes full responsibility for certain things, such as physical security, operating system security and patching, and even maintaining the database software. This leaves you narrower areas of focus.
However, in many cases, the move to the cloud also allows more users to access and make value from your data (in what’s referred to as “data democratization”). The move from specific teams using data to many teams accessing it and having data that keeps changing brings more challenges.
In addition, some assumptions of on-premises data storage (including that it’s disconnected from the internet by default) are not always correct in cloud data storage. The complexity in applying data security to CDWs is also in the people managing them in organizations. These are mostly data engineers and not security professionals, and in many companies, this creates a situation where the CDW is a “black box” for the security teams who don’t have full control over the security policies.
Let’s discuss some of the main security considerations.
Organizations must make sure that both stored data (at rest) and data being connected to (in transit) are encrypted. This is important from a security point of view (reducing risks, such as MITM attacks, and access to the data stored). It is also important when facing compliance audits. In some CDW platforms, data is encrypted by default at rest and in transit. In other platforms, you need to configure it to encrypt stored data and enforce only encrypted protocols when accessing the data.
Access Control in Cloud Data Warehouses
A large part of meeting security — as well as compliance and privacy — requirements is to establish an effective access control to the data you’re storing. This depends a lot on your company, the security policies you have, the types of data, etc. However, let’s discuss some of the main aspects of access control in CDW.
Network Access Control
Setting network access policies is, in most cases, a simple, effective way to reduce the risk level of your CDW. Some platforms come without public internet access by default, and in some platforms, you need to configure network policies. In some of the cases, you will also want to set more specific network access policies for specific users or groups of users. Using Snowflake as an example, here is how to set a network access policy and apply it to a specific user:
CREATE OR REPLACE NETWORK POLICY us_employees ALLOWED_IP_LIST = ( '184.108.40.206/24', '220.127.116.11/24', '18.104.22.168' ) BLOCKED_IP_LIST = ( '22.214.171.124', '126.96.36.199' ) COMMENT = 'US employees offices, excluding guest WiFi gateways'; /* Assigning the policy to a user */ ALTER user us_marketing_analysts SET NETWORK_POLICY=US_EMPLOYEES;
Authentication of users to CDWs differs by platform. For example, not all of them support OAuth for individual user authentication in BI tools. Furthermore, it is quite common not to use the most secure authentication options available. In many companies, for example, applications connect to the data warehouse with a username and password when they could use stronger key-based authentication.
This usually needs to be addressed in collaboration between the data and security teams (and sometimes, the IT identity teams as well) to make sure that there are clear security guidelines as to which authentication type to use. A good example is, when possible, to use Identity Provider integration to make human users abide by the organization’s authentication policy (including using two-factor authentication).
Perhaps the hardest part of data warehouse security is the authorization (i.e., once users are authenticated to use the data warehouse, what data can they access and at what level?). Different CDWs have different authorization mechanisms. For example, Snowflake has a strict role-based access control model (RBAC), and Amazon Redshift recently introduced an RBAC model of its own.
Here are some of the common security challenges in CDW authorization:
- Many users require changes in their data access, which, in many cases, has to be done by data engineering teams, creating a bottleneck.
- There is often no good process to revoke access to data that is no longer needed by users.
- Tracking access to sensitive data by users (which is often needed for compliance and security reasons) is hard to apply.
- Users often get overly broad access rights.
- After a while, without a clear access design, access permissions may become complicated and harder to manage.
There are solutions to these problems, such as enabling self-service data access, creating and enforcing clear security policies, and applying security access policies in a separate environment than the data warehouse itself.
Fine-Grained Access Control
In addition to managing data access to “coarse” objects such as tables, views, schemas, or databases, in many cases, there is also a need to apply fine-grained security. This may mean that you want certain users to be able to access only specific rows from the tables (e.g., row-level security), or perform dynamic masking on data based on the user (or their role). In these capabilities, different CDWs have different capabilities. In some cases, you may need to engineer your way to such policies by using functions and views, and on other platforms, you may need to create policies and apply them to data objects.
Note that such policies have different implementations in each of these platforms. In addition, it is often hard to manage these capabilities at scale. Oftentimes, when this is done at scale, the company either builds an in-house overlay to automate access control or uses a data access solution.
Auditing and Monitoring
Auditing and monitoring your data access is an intrinsic part of your security — and a compliance requirement. Once again, different CDWs offer different levels of audit logs, as well as different steps needed to enable them. For example, in Snowflake, data access logs are available out-of-the-box of the
snowflake.account_usage schema (available using SQL select queries), while with Amazon Redshift, you need to configure query logs export to S3 buckets.
Prioritizing and Protecting Sensitive Data
Ensuring data security for CDWs requires resources and collaboration between data and security teams, which are already very busy. In many cases, one of the most important steps is knowing where your sensitive data is, and prioritizing resources towards its security as a top priority item. These are a few examples of how different security controls are done in different CDWs. Obviously, this is a summary only, and we recommend checking for the specific requirements you require on your own:
CDW security — and securing access to data in an age where more users are accessing more data that keeps on changing — is challenging. On the other hand, some areas of focus are taken out of the equation, allowing data teams to concentrate on increasing data-driven value in the company. All modern CDWs allow companies to manage data in a secure way. When evaluating CDWs, it’s important to bring security teams into the discussion and understand how the different security capabilities will come into play. In addition, some of the security elements may be handled outside of the CDW itself (in BI tools or data access platforms).
Regardless of the CDW you choose (which has many other factors to consider, not only security capabilities), by having clear security policies, a collaboration between data and security teams, and a plan to reduce risk continuously, a company can set itself for success in its CDW security.
This is an article from DZone’s 2022 Data Pipelines Trend Report.