Format Your Data
See the sections below for information on formatting your data, including the recommended schema and format for each dataset type you might need to connect to LiveRamp Clean Room.
Note
If you’re utilizing LiveRamp Embedded Identity in your cloud environment, the formatting guidelines are somewhat different. For more information, see “Embedded Identity in Cloud Environments” in our Identity docs site.
While not all fields might be required for your current workflows and collaborations, we recommend you include as many of these additional fields now to make it easier for you to expand your collaborations in the future without you (or your IT team) having to perform additional work on your datasets.
The recommended field names shown aren’t mandatory, but using the recommended names can reduce the number of transformations you and your collaborators might need to do during dataset provisioning or query writing.
All data files or tables in a data connection job must have the same schema in order to successfully process. If you have multiple types of schemas, create a new data connection for each one. The order of the columns should match the schema.
Recommended Dataset Types
When preparing data for collaboration, consider the following recommended dataset types:
Universe dataset: This represents your full audience and likely includes all user identifiers (PII touchpoints or online identifiers) that will be used to resolve your data to RampIDs.
Note
For brands, you might consider your CRM dataset to be your “universe” of consumers.
CRM dataset: Includes attribute (segment) data on consumers.
Conversions dataset: Includes data on conversions, such as transactions.
Impressions dataset: Includes data on impressions served or other ad events, along with metadata on the impressions or events.
Publisher audience dataset: For publishers, this includes data on user segments in your platform.
Product dimensions dataset: For Clean Room users with products, this includes metadata on your products.
Store dimensions dataset: For Clean Room users with stores, this includes metadata on your stores.
With the exception of datasets for product and store dimensions, all datasets should include a field for a CID (a unique customer identifier you leverage for deduplicating individuals across your datasets):
For PII-based universe datasets, we recommend that plaintext CIDs (not hashed) be sent (if you must hash the CIDs, we ask that you consistently use the same hashing for CIDs in other datasets for interoperability).
For all other datasets (including universe datasets based on online identifiers), if you sent plaintext CIDs in your universe dataset, send MD5-hashed CIDs. If you hashed the CIDs in your universe dataset, use the same hashing type.
Note
If you use a combination of LiveRamp (SFTP) and Embedded Identity (Cloud Hosting), or if you send different dataset types, you must ensure that any hashed data is hashed using MD5 hashing in all files.
General Formatting Guidelines
When formatting your data before creating data connections in LiveRamp Clean Room, keep the following guidelines in mind:
CSV and Parquet files supported: For file-based data connections, LiveRamp Clean Room supports CSV and Parquet files, as well as multi-part files.
File size:
Parquet files and CSV files for cloud-based data connections should not be larger than 500 MB.
Files for CSV uploads should not be larger than 50 MB.
Use UTF-8 character encoding: To ensure the greatest match rate between collaborator datasets, we recommend using UTF-8 character encoding (including before hashing identifiers).
Identifiers: Each row in a file or table in a data connection job must contain identifiers. LiveRamp Clean Room supports multiple identifier types, including the identifiers listed in “Formatting Identifiers”. If your file or table does not include an identifier field, add a new field and insert a random ID for every record. This pseudo field must be mapped as the user identifier during the data setup.
Hashing identifiers: It is best practice to select a hashing algorithm which matches the NIST standard and results in a final hash value that uses lowercase characters. For more information on hashing identifiers, see "Hashing Identifiers".
File headers: All files must have a header in the first row. The name of each header column in a dataset file constitutes an attribute, which is a key for each record in the dataset. For more information on formatting headers, see the “Standard Dataset Header Guidelines” section below
Delimiters: If you’re using CSV files, delimiters must be specified.
Remove special characters from CSV files: When using CSV files, remove any special characters from your data files. For example, if your file contains integers formatted with commas (i.e. 123,456), remove the comma formatting (i.e. 123456).
Date formatting: The folder where the data files or tables are dropped includes a date macro - {yyyy-MM-dd}. Replace the macro with the appropriate date string when the files are dropped without the curly brackets. LiveRamp Clean Room does not accept date time strings.
Guidelines for Standard Dataset Headers, File Names, and Folder Names
Make sure to follow the guidelines listed below for file and table names, folder names, and headers for files and tables for standard dataset types (such as CRM, transaction, or ad log). File and table names, folder names, and headers must meet these guidelines:
Start with a letter (and not start with a special character, such as an underscore, or digit)
End with either a letter or a digit
Not contain special characters (hyphens (
-
) and underscores (_
) are allowed)Not contain whitespaces, such as spaces and control characters ( such as form feed
\f
, c arriage return\r
, line feed/new line\n
, or t ab\t
)Not contain duplicate header names within the same data connection
Not be longer than 255 characters
Here are a few examples of valid names (header examples shown):
Header1
header_name
HEADER-NAME
HEADER___NAME__HERE-1
Here are a few examples of invalid names (header examples shown):
123header
: Starts with non-alphabetheader123_
: Ends with non-alphanumericheader@name
: Contains special charactersheader name
: Contains spacesheader\nname
: Contains non-space white spaces\uFEFFHeaderName
: Contains a BOM character
Managing BOM Characters
BOM (Byte Order Mark) characters can occur in the header of a CSV file due to the file initially being saved or written out with encoding that includes the BOM, often UTF-8 with BOM.
When a file is saved with UTF-8 encoding, some systems prepend the contents with BOM to indicate that it's a UTF-8 file. BOM can cause problems when reading the file because it might be interpreted as part of the actual data (like the header), rather than metadata about the byte order and encoding.
If a BOM character is detected at the start of your file, an error will appear in the data connection within LiveRamp Clean Room. If this occurs, perform the following on your data before uploading the file again, which can be done within the same data connection:
Unix-Based OS (Mac OS, Linux)
Confirm the existence of the BOM character by performing the following shell command within a terminal session:
cat /path/to/your/input.csv | head
The highlighted characters
<U+FEFF>
(or something similar for different UTF encoding), will appear at the beginning of the file, which notifies the presence of a BOM character. The current shell command can be exited by pressingq
on the terminal window.Alternatively for a more precise diagnosis, perform the following shell command. Depending on the encoding of the file the set of BOM characters that will appear are different. Refer to the BOM CharacterSequences section below for more information.
head -n 1 myfile.txt | od -t x1 | head -n 1 | awk '{print $2,$3,$4}'
To remove BOM characters within the terminal session, the following shell command can be performed:
sed -i '1s/^\xEF\xBB\xBF//' /path/to/your/input.csv > /path/to/your/output.csv
Windows OS
Confirm the existence of the BOM character by performing one of the following PowerShell command within a Windows Powershell session:
Get-Content inputFile.csv -Encoding Byte -TotalCount 3 # for UTF-8 and UTF-16 Get-Content inputFile.csv -Encoding Byte -TotalCount 4 # for UTF-32
Depending on the encoding of the file the set of BOM characters that will appear are different. Referring to the table within section iv. BOM Character Sequences.
Either remove the BOM character with the following PowerShell command if the file is
UTF-8
encoded, or re-export the file asUTF-8
without BOM.$inputFile = 'path\to\your\inputFile.csv' $outputFile = 'path\to\your\outputFile.csv' $encoding = New-Object System.Text.UTF8Encoding $false [System.IO.File]::WriteAllLines($outputFile, (Get-Content $inputFile), $encoding)
Re-upload the fixed file to the same data connection location. You may need to adjust the date folder based on the current UTC date.
BOM Character Sequences
Encoding | Windows Sequence | UNIX Sequence |
---|---|---|
UTF-8 | 239 187 191 | ef bb bf |
UTF-16 (Little Endian) | 255 254 | ff fe |
UTF-32 (Little Endian) | 254 255 | fe ff |
UTF-32 (Little Endian) | 255 254 0 0 | ff fe 00 00 |
UTF-32 (Big Endian) | 0 0 254 255 | 00 00 fe ff |
Formatting Datasets
See the sections below for information on the recommended schema and format for each dataset type you might need to connect to LiveRamp Clean Room.
While not all fields might be required for your current workflows and collaborations, we recommend you include as many of these additional fields now to make it easier for you to expand your collaborations in the future without you (or your IT team) having to perform additional work on your datasets.
The recommended field names shown aren’t mandatory, but using the recommended names can reduce the number of transformations you and your collaborators might need to do during dataset provisioning or query writing.
All data files or tables in a data connection job must have the same schema in order to successfully process. If you have multiple types of schemas, create a new data connection for each one. The order of the columns should match the schema.
Format a Universe Dataset
The universe dataset should represent your full audience and should include all user identifiers (PII touchpoints or online identifiers) that will be used during identity resolution to resolve to RampIDs.
LiveRamp uses this dataset to create a mapping between your CIDs and their associated RampIDs. This mapping lives in a linked dataset and allows you to use RampIDs as the join key between the various datasets in queries.
For information on formatting and hashing identifiers, see “Formatting Identifiers”.
Note
When sending PII, it’s important that as many PII touchpoints as possible are provided for LiveRamp’s identity resolution capabilities to yield the best results.
Your CRM dataset might also be able to function as a universe dataset.
Field Contents | Recommended Field Name | Field Type | Values Required? | Description/Notes |
---|---|---|---|---|
A unique user ID | cid | string | Yes |
|
Consumer’s first name | first_name | string | Yes (if Name and Postal is used as an identifier) | |
Consumer’s last name | last_name | string | Yes (if Name and Postal is used as an identifier) | |
Consumer’s address | address_1 | string | Yes (if Name and Postal is used as an identifier) | |
Consumer’s additional address information | address_2 | string | No |
|
Consumer’s city | city | string | Yes (if Name and Postal is used as an identifier) | |
Consumer’s state | state | string | Yes (if Name and Postal is used as an identifier) |
|
Consumer’s ZIP Code or postal code | zip | string | Yes (if Name and Postal is used as an identifier) |
|
Consumer’s best email address | email_1 | string | Yes (if email is used as an identifier) |
|
Consumer’s additional email address | email_2 | string | No |
|
Consumer’s additional email address | email_3 | string | No |
|
Consumer’s additional email address | email_4 | string | No |
|
Consumer’s best phone number | phone_1 | string | Yes (if phone is used as an identifier) |
|
Consumer’s additional phone number | phone_2 | string | No |
|
Consumer's mobile device ID (MAID) | maid | string | Yes (if MAIDs are used as identifiers) |
|
Format a CRM Dataset
Your CRM dataset should contain a CID for each consumer, as well as attribute (segment) data fields.
Field Contents | Recommended Field Name | Field Type | Values Required? | Description/Notes |
---|---|---|---|---|
A unique user ID | cid | string | Yes |
|
Consumer attribute category | <User Attribute 1> | varies | No |
|
Consumer attribute category | <User Attribute 2> | varies | No |
|
Consumer attribute category | <User Attribute 3> | varies | No |
|
Consumer attribute category | <User Attribute 4> | varies | No |
|
Format a Conversions Dataset
The conversions dataset should include information on the desired conversions (such as transactions or downloads).
Note
If you do not have values for any fields in this schema, we recommend that you still include those fields as placeholders.
Field Contents | Recommended Field Name | Field Type | Values Required? | Description/Notes |
---|---|---|---|---|
A unique user ID | cid | string | Yes |
|
Whether the consumer is a loyalty customer | is_card_holder | integer | No |
|
Country where the conversion occurred | country_id | string | No |
|
Unique identifier for the transaction | transaction_id | string | Yes | |
UTC timestamp of the conversion event | conversion_timestamp | timestamp | No |
|
Unique identifier for SKU or product | product_id | string | Yes | (Foreign Key for Product Table) |
Barcode number | barcode | string | No |
|
Unique identifier for the sales channel | store_id | string | No |
|
Unique identifier for a line within a receipt | trans_line_number | string | No | |
Units of a particular SKU/product ID in transaction | trans_line_quantity | int | No |
|
Value of a particular SKU/product ID in transaction | trans_line_value | numeric | No |
|
Line-level discounts | trans_line_disc_value | numeric | No | |
Whether this line represents a promotion product | trans_line_promo | integer | No |
|
Total number of items in an order/basket | trans_total_quantity | integer | No | |
Total value of order/basket | trans_total_value | numeric | No | |
Total discounts | trans_total_disc_value | numeric | No | |
Whether this line represents returned product | return_flag | integer | No |
|
Iso currency code | currency | string | No |
|
USD rate | currency_rate | numeric | No |
|
Units*Unit price of a particular SKU/product ID in transaction | gross_amt | double | Yes | |
Timestamp of the conversion event | conversion_timestamp | timestamp | Yes | |
Name of the SKU or product | product_name | string | No |
|
Where the transaction took place | sales_channel | string | No |
|
Product categorization | division | string | No |
|
Brand name of the product | brand_name | string | No |
|
Format an Exposures Dataset
An exposures dataset should Include data on impressions served or other ad events, along with metadata on the impressions or events.
Note
If you do not have values for any fields in this schema, we recommend that you still include those fields as placeholders.
Field Contents | Recommended Field Name | Field Type | Values Required? | Description/Notes |
---|---|---|---|---|
A unique user ID | cid | string | Yes |
|
UTC timestamp of the event | event_timestamp | timestamp | Yes |
|
Unique ID for the creative served | creative_id | string | No |
|
Name of the creative | creative_name | string | No | |
Unique ID for the placement, ad group, adSet or line item ID | placement_id | string | No |
|
Name of the placement, ad group, or line item | placement_name | string | No |
|
Unique ID for the campaign | campaign_id | string | No |
|
Name of the campaign | campaign_name | string | No |
|
Unique ID of the publisher or site where the exposure occurred | property_id | string | No | |
Location on a publisher site, the channel, or media type where the exposure occurred | property_name | string | No | |
Segment ID that the user belonged to at the time of exposure | segment_id | string | No | |
Unique ID for the advertiser account | account_id | string | No |
|
Name of the advertiser account | account_name | string | No |
|
Unique ID for the brand | advertiser_id | string | No | |
Name of the brand | advertiser_name | string | No |
|
Type of placement or line item | placement_type | string | No | |
Type of event | event_type | string | No |
|
Cost of the event to the advertiser | cost | double | No |
|
Unique identifier for the event | event_id | string | No |
|
Format a Publisher Audience Dataset
For publishers, a publisher audience dataset includes data on user segments in your platform.
Field Contents | Recommended Field Name | Field Type | Values Required? | Description/Notes |
---|---|---|---|---|
A unique user ID | cid | string | Yes |
|
Unique identifier of audience segment used for targeting this impression | segment_id | string | ||
The name of the segment | segment_name | string | ||
The name of the category | segment_category | string |
Format a Product Dimensions Dataset
For Clean Room users with products, a product dimensions dataset includes metadata on your products.
Field Contents | Recommended Field Name | Field Type | Values Required? | Description/Notes |
---|---|---|---|---|
Product ID | product_id | string | Yes | A 1toN relationship between product_id and barcode is accepted (ex: product_id 123 linked to: Barcode ABC). |
Barcode number | barcode | string | Yes |
|
The product name | product_name | string | Yes | |
The product description | product_desc | string | No | |
country_id | string | No |
| |
The name of the manufacturer | supplier_name | string | Yes | Should be the name of the company manufacturing the brand/item, not the potential 3rd party supplier). |
supplier_key | integer | No | ||
The name of brand | brand_name | string | Yes |
|
brand_key | integer | No | i.e. brand_id (unique identifier for product brand) | |
The name of the first-level hierarchy category this product belongs to | hierarchy_level1_name | string | Yes | Such as the business group |
The name of the second-level hierarchy category this product belongs to | hierarchy_level2_name | string | Yes | Such as the sector |
The name of the third-level hierarchy category this product belongs to | hierarchy_level3_name | string | Yes | Such as the department |
The name of the fourth-level hierarchy category this product belongs to | hierarchy_level4_name | string | Yes | Such as the class |
The name of the fifth-level hierarchy category this product belongs to | hierarchy_level5_name | string | Yes | Such as the category (product category) |
The name of the sixth-level hierarchy category this product belongs to | hierarchy_level6_name | string | Yes | Such as the subcategory (product sub category) |
Need unit of product | need_unit | string | No | |
Whether this is a discontinued product | discontinued_flag | integer | No |
|
Whether this is a white-label product | is_white_label | integer | No |
|
Whether this is a promotion product | is_promo | integer | No |
|
Format a Store Dimensions Dataset
For Clean Room users with stores, a store dimensions dataset includes metadata on your stores.
Field Contents | Recommended Field Name | Field Type | Values Required? | Description/Notes |
---|---|---|---|---|
Main identifier for stores | store_main_id | string | No |
|
Unique identifier for sales channel | store_id | string | No |
|
Store name | store | string | No |
|
Country of store | country_id | string | No |
|
Store region | store_region | string | No | |
Default use state, or official region name for store | store_state | string | No | (or store DMA) |
City of store | store_city | string | No | |
Store's postal code | postal_code | string | No | |
Whether the store is closed | discontinued_flag | integer | No |
|
Channel type | channel_type | string | No |
|
Channel name | channel_name | string | No |
|
Delivery type | delivery_type | string | No |
|
Latitude of store location | latitude | decimal | No |
|
Longitude of store location | longitude | decimal | No |
|