Skip to main content

Data Connection Partitioning

LiveRamp Clean Room enables you to indicate partition columns for your data connections in order to optimize query performance. Data partitioning is the process of dividing a large dataset into smaller, more manageable subsets. For partitioned data connections, data processing occurs only on relevant data during question runs which leads to faster processing times.

Supported Data Sources and Field Types

LiveRamp Clean Room currently supports the configuration of partition columns for the following data sources:

LiveRamp Clean Room currently supports the configuration of partition columns for the following field types:

  • Date

  • String

  • Integer

  • Timestamp

Note

LiveRamp Clean Room doesn’t currently support ArrayType, MapType, or StructType field types. Here are two common approaches we've seen clients use as a substitute for array field types:

  • Bring the segment/attribute data field in as a STRING field type with comma-separated values. Once the data is in LiveRamp Clean Room, you can flatten the data within your SQL queries using functions like explode or split.

  • Denormalize the data using your customID as the key, which results in multiple rows for each customID, one for each segment.

For more information on supported field types, see "Field Types for Data Connections".

Prerequisites

  • Understand your collaboration context Currently, defining partition columns at a data connection level means you or your partners will always be asked to define the partition value (the values on which to filter data) when executing a question run using the associated dataset. This means you must ensure the partition columns defined make logical sense for the questions to which the dataset is likely to be assigned.

    For example, publishers defining question logic and operating multiple clean rooms with the same questions are likely to be able to assign partition columns to their data connections without incident.

    Advertisers working in multiple clean rooms may wish to assign partition columns only if they know filtering on those columns will always work for the types of questions to which the dataset will likely be assigned.

  • Understand best practices for publisher measurement use cases: As a best practice, we recommend you partition the following fields for publisher measurement use cases:

    • Exposure Date or Impression Date (for exposure or impression data)

    • Date or Timestamp (for transaction or conversion data)

    • Brand (if applicable)

  • Partitioned data: Make sure your data is partitioned if defining partition columns before configuring the connection to LiveRamp Clean Room. For BigQuery and Snowflake, make sure columns are indicated as partition keys. For S3, GCS, and Azure, make sure your buckets/file paths are structured in partitioning format based on the partition column (see example below).

Configure Partition Columns

When using partition columns, you will need to indicate which columns you would like to use for partitioning when you configure a new data connection. The process for this configuration is as follows:

  1. Select the appropriate data source from the list of options above.

  2. Fill in the configuration fields appropriate to the data connection, optionally choosing to point to a sample file path from which LiveRamp Clean Room will define the schema for the dataset. If a sample file path is not provided, LiveRamp Clean Room will attempt to infer the schema.

    Note

    If you're partitioning by date using the file path, note that only the "YYYY-MM-DD" date format is supported.

    For example:

    s3://habu-client-org-1234/brand-id=123/1.parquet
    CR-Data_Connection_Partitioning-configurations.png
  3. Once the data connection is saved, verified, and moves to the "Mapping Required" status, configure the schema fields, their data types, and indicate their PII or unique identifier status. During this step, you will also be prompted to configure which columns should be used as partition columns by sliding the "Allow Partitions" column to the right.

    Note

    • Any columns you select will be used to filter data for question runs in questions where the associated dataset is assigned.

    • As a best practice, we recommend you partition the following fields for publisher measurement use cases:

      • Exposure Date or Impression Date (for exposure or impression data)

      • Date or Timestamp (for transaction or conversion data)

      • Brand (if applicable)

    CR-Data_Connection_Partitioning-map_fields.png

Partitioned Question Runs

Once you have allowed partitions for a particular data connection, you and your partners will be prompted to indicate partition filter values when requesting question runs using the associated dataset.

For example, if you have a CRM dataset with partition column = city, you will be asked to supply a value for city when running the assigned question like so:

CR-Data_Connection_Partitioning-partition_values.png