Google Cloud Storage Data Lake

The Google Cloud Storage (GCS) data lake leverages Google Cloud Storage for storing and accessing your data in the GCP infrastructure. It offers state-of-the-art performance and scalability, along with ensuring the security and privacy of your data.

Setting user permissions in GCS data lake

To set up GCS data lake as a destination in RudderStack, you will need to create a new user role and grant the required permissions to create schemas and temporary tables.

Creating the role and adding permissions

Go to the Google Cloud IAM Admin console and click CREATE ROLE.
Then, fill in the details as shown:

Fill the details and click ADD PERMISSIONS.
Under Filter permissions by role, select Storage Object Admin and grant the required permissions:

The permission required to successfully use the GCS data lake destination is shown:

storage.objects.create
storage.object.get

Then, click CREATE. This will successfully create the role.

Creating the service account and attaching the role

Go to the Service Accounts option in the Google Cloud IAM Admin console.
Then, select the project containing the dataset that you want to use.
Next, click CREATE SERVICE ACCOUNT.
Fill in the details as shown below and click CREATE.

Then, select the previously created role and click CONTINUE.

Finally, click DONE.

Creating and downloading the JSON key

Click the three dots under Actions in the service account that you just created and select Manage keys, as shown:

Click ADD KEY, followed by Create new key, as shown:

In the resulting pop-up, select JSON and click CREATE.

Finally, download this JSON file. This file is required while setting up the GCS data lake destination in RudderStack.

Configuring GCS data lake destination in RudderStack

To send event data to GCS data lake, you first need to add it as a destination in RudderStack and connect it to your data source. Once the destination is enabled, events will automatically start flowing to GCS data lake via RudderStack.

To configure GCS data lake as a destination in RudderStack, follow these steps:

In your RudderStack dashboard, set up the data source. Then, select Google Cloud Storage Data Lake from the list of destinations.
Assign a name to your destination and then click Next.

Connection settings

GCS Storage Bucket Name: The name of the GCS bucket used to store data before loading it into the GCS data lake.
Prefix: If specified, RudderStack will create a folder in the bucket with this prefix and push all the data within that folder. For example, https://storage.googleapis.com/<bucketName>/<prefix>/.
Namespace: If specified, all the data for the destination will be pushed to https://storage.googleapis.com/<bucketName>/<prefix>/rudder-datalake/<namespace>. If you don't specify a namespace in the settings, RudderStack sets it to the source name, by default.
Table Suffix: This optional setting lets you define a custom path under which your table data is stored. For example, if you set this field to rs, your data will be pushed to https://storage.googleapis.com/<bucketName>/<prefix>/rudder-datalake/<namespace>/<table-name>/rs.
Choose time window layout: This option lets you set the timestamp-defined layout structure under which the table data will be stored.

GCS data lake destination settings in RudderStack

For example, if you choose the option Upto Hour (year=YYYY/month=MM/day=DD/hour=HH) and an event called Product Clicked is received at 2022-08-06T17:30:00.000T, then the data will be stored in the following location:

https://storage.googleapis.com/<bucketName>/<prefix>/rudder-datalake/<namespace>/<table_name>/<table_suffix>/year=2022/month=08/day=06/hour=17/

The default value for this setting is YYYY/MM/DD/HH.

Credentials: Enter the content of the downloaded credentials JSON file in this field.
Sync Frequency: Specify how often RudderStack should sync the data to your GCS data lake.
Sync Starting At: This optional setting lets you specify the particular time of the day (in UTC) when you want RudderStack to sync the data to the data lake.

Finding your data in the GCS data lake

RudderStack converts your events into Parquet files and dumps them into the configured GCS bucket. Before dumping the events, RudderStack groups the files by the event name based on the time (in UTC) they were received.

The folder structure is similar to the following format:

https://storage.googleapis.com/<bucketName>/<prefix>/rudder-datalake/<namespace>/<tableName>/YYYY/MM/DD/HH

As specified in the Connection settings section above:

<prefix> is the GCS prefix used while configuring the GCS data lake destination in RudderStack. If not specified, RudderStack will omit this value.
<namespace> is the namespace specified in the destination settings. If not specified, RudderStack sets it to the source name.
<tableName> is set to the event name.
YYYY, MM, DD, and HH are replaced by the actual time values.

A combination of the `YYYY`, `MM`, `DD`, and `HH` values represents the UTC time.

Use case

Suppose that RudderStack tracks the following two events:

Event Name	Timestamp
Product Purchased	`"2019-10-12T08:40:50.52Z" UTC`
Cart Viewed	`"2019-11-12T09:34:50.52Z" UTC`

RudderStack then converts these events into Parquet files and dumps them into the following folders:

Event Name	Folder Location
Product Purchased	`https://storage.googleapis.com/<bucketName>/<prefix>/rudder-datalake/<namespace>/product_purchased/2019/10/12/08`
Cart Viewed	`https://storage.googleapis.com/<bucketName>/<prefix>/rudder-datalake/<namespace>/cart_viewed/2019/11/12/09`

IPs to be allowlisted

To enable network access to RudderStack, allowlist the following RudderStack IPs depending on your region and RudderStack Cloud plan:

Plan	Region
	US	EU
Free, Starter, and Growth	23.20.96.9 18.214.35.254 52.38.160.231 34.211.241.254	18.198.90.215 18.196.167.201
Enterprise	34.198.90.241 54.147.40.62 3.216.35.97 100.20.239.77 44.236.60.231	3.66.99.198 3.64.201.167

All the outbound traffic is routed through these RudderStack IPs.

FAQ

What are the files written in the location `<namespace>/rudder-warehouse-staging-logs/`?

RudderStack collects all the unprocessed data flowing to the warehouse destinations as staging files. It stores these files in the object storage at the location https://storage.googleapis.com/rudder-warehouse-staging-logs/.

Once the staging files are processed, RudderStack separates them by the event name and sends them to the specified destination in the following format:

https://storage.googleapis.com/<bucketName>/<prefix>/rudder-datalake/<namespace>/<tableName>.

For a more comprehensive FAQ list, refer to the Warehouse FAQ guide.

Google BigQuery

Contact us

For more information on the topics covered on this page, email us or start a conversation in our Slack community.

Setting user permissions in GCS data lake

Creating the role and adding permissions

Creating the service account and attaching the role

Creating and downloading the JSON key

Configuring GCS data lake destination in RudderStack

Connection settings

Finding your data in the GCS data lake

Use case

IPs to be allowlisted

FAQ

What are the files written in the location `<namespace>/rudder-warehouse-staging-logs/`?

Google BigQuery

PostgreSQL

Contact us

On this page

Setting user permissions in GCS data lake

Creating the role and adding permissions

Creating the service account and attaching the role

Creating and downloading the JSON key

Configuring GCS data lake destination in RudderStack

Connection settings

Finding your data in the GCS data lake

Use case

IPs to be allowlisted

FAQ

What are the files written in the location <namespace>/rudder-warehouse-staging-logs/?

Google BigQuery

PostgreSQL

Contact us

@media (max-width: 1200px){.css-1x56nmx{display:none;}}On this page

What are the files written in the location `<namespace>/rudder-warehouse-staging-logs/`?

On this page