Connect to Cloud-Based Data

Data connections connect your LiveRamp Clean Room organization to your data at your cloud provider so that it can be accessed in a clean room. This allows the data to be used in list and analytical questions within clean rooms.

Data connections can be configured to any cloud-based storage or data warehouse location, including AWS, GCS, Azure Blob, Snowflake, Google BigQuery, and Databricks. Connections are specific to both the cloud provider and the clean room type (Hybrid, Confidential Computing, or native pattern), meaning the exact configuration will depend on source storage location, data types, and structures. For example, Snowflake data connections must be configured differently for use in Hybrid clean rooms versus Snowflake clean rooms, based on whether the data being used lives in different clouds and/or cloud regions.

Note

Your Clean Room representative will work with you to determine the type(s) of data connections you’ll need for your situation.

Once you've determined the type of data connections you'll need (based on your data source and preferred configuration type), select the appropriate article to see specific configuration steps.

Each data connection results in a single dataset within LiveRamp Clean Room. All data files in a data connection job must have the same schema in order to successfully process. For more information on standard schemas, see "Format Your Clean Room Data".

To enable distinct tables or sets of files as datasets, you will need a data connection for each table or set of files.

Data Connection Prerequisites

Before creating a new data connection, you might want to have the desired data prepared and present in your cloud location. This can help speed up the connection to the data. For more information, see "Format Your Clean Room Data".

To utilize partitioning for cloud storage data connections, you need to organize your data into folders that reflect the partition columns. LiveRamp encourages users to use Hive-style partitioning, typically by date (such as s3://bucket/path/date=YYYY-MM-DD/). For more information, see "Partition a Dataset in LiveRamp Clean Rooms".

For information on utilizing partitioning for cloud warehouse data connections, see "Partition a Dataset in LiveRamp Clean Rooms".

When creating a data connection, you will either need to utilize existing credentials that you’ve created previously for that cloud provider or you’ll need to add a new credential during the process. For more information, see "Add Credentials".

Next Steps After Connecting Your Data

After you’ve created the data connection and Clean Room has validated the connection by connecting to the data in your cloud account, you will then need to map the fields before the data connection is ready to use. This is where you specify which fields can be queryable across any clean rooms, which fields contain identifiers to be used in matching, and any columns by which you wish to partition the dataset for questions.

After fields have been mapped, you’re ready to provision the resulting dataset to your desired clean rooms. Within each clean room, you’ll be able to set dataset analysis rules, exclude or include columns, filter for specific values, and set permission levels.

Once this has been done, you're ready to use these datasets in your clean room questions.

Data Connection FAQs

See the FAQs below for common data connection questions.

Why does partitioning matter?

Partitioning optimizes the dataset even before you get to the query stage. Partitioning improves query performance because data processing during question runs occurs only on the relevant filtered data. For more information, see “Data Connection Partitioning”.

What are the best practices for partitioning?

Data partitioning (dividing a large dataset into smaller, more manageable subsets) is recommended for optimizing query performance and leading to faster processing times. By indicating partition columns for your data connections, data processing during question runs occurs only on the relevant filtered data, which reduces query cost and time to execute. Best practices include:

Partition at the source: When configuring your data connection to LiveRamp Clean Room, define partition columns.
Consider the collaboration context: Make sure that the partition columns make sense for the types of questions that a dataset is likely to be used for. For example:
- If you anticipate questions that analyze data over time, partition the dataset by a date field (e.g., event_date or impression_date). This allows queries that filter by date ranges to scan only relevant partitions, reducing processing time and costs.
- If the main use case is to analyze data by different brands or products, then partitioning by a brand or product_id column makes sense. This strategy ensures that queries filtering by brand will only access the necessary subset of the data.
Verify column data types: Partitioning supports date, string, integer, and timestamp field types. Complex types (such as arrays, maps, or structs) are not allowed.
Cloud-specific formatting: For cloud storage sources like S3, GCS, and Azure, structure your buckets and file paths in a partitioning format based on the partition column. For BigQuery and Snowflake, make sure columns are indicated as partition keys in your source tables.

For more information, see "Data Connection Partitioning".

How should I map fields if my data contains a RampID column?

If you data contains a column with RampIDs, do not slide the PII toggle for that column. Mark the RampID column as a User Identifier and select "RampID" as the identifier type. If the data contains a RampID column, no other columns can be enabled as PII.

In this section: