Skip to main content

Configure a Data Preparation Process

To maximize identity resolution outcomes, you can configure LiveRamp Identity Engine's data preparation process to:

  • Create and declare a data source.

    Note

    If you have multiple data sources, you can create multiple data preparation processes for a workflow.

  • Map data sources to Identity Engine's standard data model.

  • Transform your data, such as removing null or invalid values, applying data hygiene, and mapping to a common taxonomy.

  • Enable data enrichment with deterministic signal and metadata from the LiveRamp Known Identity Graph.

You cannot configure certain default tasks, such as data normalization and ID enrichment.

Configure Data Preparation

Procedure. To configure a data preparation process:
  1. From the Workflow Editor, click the yellow caution icon and select Configure Process. Optionally, to reuse an existing configuration, click Upload and select a JSON file that contains a data preparation configuration.

    Identity_Engine-Configure_Data_Preparation.png

    The Configure Data Preparation dialog is displayed. If you've uploaded a JSON file, all the fields and options will be pre-defined which you can modify or leave as is.

  2. In the Source Name box, enter a unique name for the source file.

  3. From the File Format list, select one of the following source file formats:

    • Avro

    • CSV

    • JSON

    • Parquet

    • Text

    For more information, see "File Schema".

  4. From the Refresh Mode list, select one of the following:

    • Replace existing data (Full): Every update of the file is to be a full refresh and replacement of the file as a whole without comparison to the previous source.

    • Append to existing data (Incremental): Each update to the file will only contain updates based on existing lines, additional lines, and removed lines. Any data not present in the updated file will remain unchanged in the source.

  5. From the Data Type list, select one of the following:

    • Consumer: Indicates data that will be used in resolution to create your graph and Enterprise IDs

    • Consent: Indicates data that won't be used for resolution purposes, but that will instead inform consent decisions on your graph build and export processes

    • Enrichment: Use this option when you need to match a data source to your graph but not build new Enterprise IDs based on those records

  6. In the Source Prefix box, enter an identifying string of up to 10 characters to identify a specific source in a workflow. This can be helpful if your workflow has multiple data sources. This prefix value will persist with the source once it is onboarded.

  7. If you want to enable LiveRamp's Known Identity Graph and enrich input data with AbiliTec links and metadata, select the Match to LiveRamp Known Identity Graph check box.

  8. In the Source File Location box, enter a path to the source file. For example, gs://my-home-bucket/myfolder/mydata/*.csv.

    If the source file is already ingested in Identity Engine, exclude gs:// and the bucket name. For example, myfolder/mydata/*.csv

    The files will then be copied and sorted to find the correct file based on its date.

    Note

    If your instance is deployed remotely, the retention window for the data source bucket must be longer than the frequency between identity graph builds.

  9. Click Next. The File Format Configuration step is displayed.

  10. Depending on the file format you declared, select one of the following options:

    • Upload Schema: Upload a JSON blob or a text file.

    • Paste Schema: Enter a list of header names separated by comma, pipe, tab, or semicolon.

    Identity_Engine-Configure_Data_Preparation_File_Format-Upload_Schema.png

    Tip

    • CSV format is typically the simplest way to copy-paste the file schema (AKA "headers"). The values must use the same delimiter that is used within the CSV file. If header names include quotes, the quote characters are automatically removed.

    • For JSON and Avro file formats, you can provide a file schema instead of headers. To generate the JSON schema, create a small sample file of input data, read it locally in Spark to auto-infer the schema. Then run println(df.schema.prettyJson) to generate the JSON schema.

    Identity_Engine-Configure_Data_Preparation_File_Format.png
  11. (Optional) Enter format values to customize certain input files and help Spark properly read the input data. For example, you can specify multi-line, encoding, and time format options as key-value pairs. For information, see Spark's "JSON Files" documentation.

  12. Click Next. The Entity Mapping step appears, which allows you to add and customize entities to match your data source.

  13. From the Add Entity list, select one or more of the following entity types to configure its field mapping:

    • Email

    • Postal

    • Phone

    • Custom

    • Person

    • Consent

    • Identifier

    Identity_Engine-Configure_Data_Preparation_Entity_Mapping-Add_Entity.png

    If you add an entity type, you must configure it or delete it. You cannot click Next if you have an empty entity configuration.

    Note

    If you do not map a column from your source data to a field, it will pass through and will not be used during the resolution process.

  14. Add any needed entities and then select Configure Entity.

    Identity_Engine-Configure_Data_Preparation_Entity_Mapping-Configure_Entity.png
    • Depending on the selected entity, the Entity Configuration section displays options to customize the entity to match your data source. You can map several columns to a field by selecting several options.

    • You can optionally enter a description and add transformations.

    Identity_Engine-Configure_Data_Preparation_Entity_Mapping-Entity_Configuration_Section.png

    For more information, see "Entity Mapping" and "Transformations".

  15. Click Next, review the summary of your completed data preparation process, and then click Confirm.

    Identity_Engine-Configure_Data_Preparation-Summary_Step.png

    Tip

    You can reuse the saved configuration for other processes of the same type by downloading it as a JSON file. From the Workflow Editor, click the three dots on the desired process and select Download.

For the most common Identity Engine use cases, CSV format provides a way to copy and paste the file header.

Note

The file schema must use the same delimiter used within the file for CSV format, this allows the file header to be used for file schema input. If header names are quoted, quotes will be automatically removed.

If you provide a schema instead of headers, generate the JSON schema and enter it in the File Schema box:

Identity_Engine-Configure_Data_Preparation_File_Format_Configuration-Schema.png

Sample File Schema

{
  "type" : "struct",
  "fields" : [ {
    "name" : "CCID",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "FIRSTNAME",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "MIDDLENAME",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "LASTNAME",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "ADDRESS1",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "ADDRESS2",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "CITY",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "STATE",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "ZIP",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "EMAIL",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "PHONE",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "CARD_NBR",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "HOUSEHOLD_ID",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "COUNTRY_CODE",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "COUNTRY_NAME",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "GENDER",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "BIRTH_DT",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "START_EFF_DT",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "MAIL_ALLOWED_ID",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  } ]
}

Entity mapping enables you to map each field from source data to a predetermined set of fields supported within Identity Engine and apply data transformation functions to your fields.

To accurately configure entity mapping, analyze your file to understand the specifics of each field, determine their corresponding mappings, and apply needed transformations. If the input file contains multiple email addresses, phone numbers, or addresses, you need to map each of these to a new entity. You cannot map multiple email addresses, phone numbers, or addresses to the same entity.

If your input file has too many fields, such as passthrough fields that are only required for your export, you do not need to map them. Passthrough fields can be retrieved by calling rawRecord in the export stage.

The default entities include:

Entity Type

Field

Comment

Consent

Opt-Out: The record will be excluded from exports.

Represents data subject requests (DSRs).

For more information, see "Data Compliance".

Deletion: Indicates an order to delete the record.

Last Updated: The timestamp when the record was last modified

Custom

Any raw data field

Define any raw data field to use in resolution.

Email

Email

Contains all consumer emails. You can have several emails mapped in the Email entity.

Privacy Transformation

Contains the type of transformation applied to the email address. Options include MD5, SHA256, SHA1, and Plaintext.

Identifier

Primary ID

Unique ID for this data source

Customer ID

An ID that can be common to several data sources and can have duplicates within a data source

Household ID

An ID that can be common across several sources and can have duplicates within a source

Person

First Name

Contains information directly related to an individual

Middle Name

Last Name

Suffix

Date of Birth

Title

Gender

Phone

Phone Number

Contains all phone numbers. You can have several phone numbers mapped in the Phone entity.

Privacy Transformation

Contains the type of transformation applied to the phone number. Options include MD5, SHA256, SHA1, and Plaintext.

Postal

Address Line 1

Contains all fields for a postal address

Address Line 2

Address Line 3

Address Line 4

Address Line 5

Postal Code

City

Country

You can optionally specify a transformation to customize an entity to create new fields from the source data. These can be used in downstream resolution processes.

To add a transformation, click Add Transformation in the Entity Configuration side panel.

Identity_Engine-Configure_Data_Preparation_Entity_Mapping-Add_Transformation.png

The following example shows a configuration that defaults all email fields to null when a test email, such as testuser@gmail.com, exists. Transformations will be applied during the mapping process.

Identity_Engine-Configure_Data_Preparation_Entity_Mapping-Transformation_Options.png

For each transformation, the Transformations section provides the following options:

  • Transformation: A list of transformation functions to apply, such as Concatenate, Convert to Lowercase, and so on

  • Origin Field: Field to apply transformation to

  • Column Selection: Raw field to apply transformation from

  • Destination Field: Required only when creating cascading transformation. Otherwise, leave this empty.

When you select certain transformation options, additional options are displayed such as the following:

  • Value: This field is displayed if you choose certain transformations, such as Set null if or Set Constant Value.

  • Pattern: This field is displayed if you choose certain transformations, such as Replace (Regex).

  • Replacement: This field is displayed if you choose certain transformations, such as Replace (Regex).

  • Algoriithm: This field is displayed if you choose certain transformations, such as Hash.

Identity_Engine-Configure_Data_Preparation_Entity_Mapping-Transformation_Options.png

Transformation functions include:

Transformation Functions

Descriptions

Arguments

Examples

Concatenate

Link fields together in a single string. If you select the Mandatory Parts check box, both fields needs to be present, otherwise the transformation returns null.

  • Separator: String

  • Mandatory Parts: Boolean

Separator: ","

Input: "123","456"

Output: "123,456"

Convert Unix (Date)

Convert a numeric Unix timestamp to date format

  • Input: "1710486000"

  • Output: Date(2024-03-15)

Convert Unix (Datetime)

Convert a numeric Unix timestamp to datetime format

  • Input: “1710514783”

  • Output: “2024-03-15 07:59:43”

Convert to Lowercase

Convert all the field value to lowercase

  • Input: “TeSt”

  • Output: “test”

Convert to Uppercase

Convert all the field value to uppercase

  • Input: “TeSt”

  • Output: “TEST”

Format using Pattern

Format a string by replacing {} placeholders with values from specified columns in a sequential order.

Output Format: String

  • Output Format: "test {} {}"

  • Input: "abc", "def"

  • Output: "test abc def "

Generate Unique ID

Generate a UUID.

Output: "{uuid}"

Greatest

Return the highest value from the provided sourced fields.

  • Input: "2023-01-01", "2022-01-01"

  • Output: "2023-01-01"

Hash

Generate a hash value based on the specified method. Options include SHA256, SHA1, and MD5.

Algorithm: String

/

Join

Join values from the specified source fields.

Separator: String

  • Separator: ","

  • Input: "col1","col2"

  • Output: "col1,col2"

Least

Return the lowest value from the specified sourced fields. For example, calculate the earliest join date from two date fields.

  • Input: "2023-01-01", "2022-01-01"

  • Output: "2022-01-01"

Left Pad Column Value

Add a specified character at the left of the string until the final string has the specified length.

  • Character: The character to be added

  • Output Length: The length of the final string

  • Input: "123"

  • Character: "0"

  • Output Length: 6

  • Output: "000123"

Parse Date

Parse a string to a date using a specified format, such as yyyy-MM-dd.

  • Map: Dict

  • blankMismatch: Boolean

  • Map: {"1":"M", "2":"F"}

  • blankMismatch: True

  • Input: [1, 2, 3]

  • Output: [M, F, null]

Remap

Matches input values using a provided map. It returns the corresponding value for a match, and for values without a match, you can choose either null or keep the original value.

Remove Spaces Before/After

Remove all spaces from the start and the end of the string.

Input: "    value  "

Output: "value"

Replace

Replace characters matching the specified value with another specified value.

  • From: String

  • To: String

From: "-"

To: " "

Input: "test-string"

Output: "test string"

Replace (Regex)

Replace any regex matching pattern with the specified value

  • Pattern: String

  • Replacement: String

Pattern = "[^a-zA-z]"

Replacement = ""

Input = "abc123"

Output = "ABC"

Set Constant Value

Set a constant value for the field based on the parameter value.

Value: String

  • Value: "US"

  • Output: "US"

Set null if

Return null if the input matches the specified value. Otherwise, keep the original value.

Value: String

  • Value = "N/A"

  • Input = "N/A"

  • Output = null

Cascading Transformations

If a field requires multiple transformations, you can click Add Transformation again to apply cascading transformations. For example, you can concatenate two values and then transform the result as uppercase characters.

To apply multiple transformations in a cascade:

  1. Use the Destination Field to specify a column name that will receive the output from the first transformation. The next transformation will then receive the value in its Column Selection and apply the second transformation.

  2. Add as many transformations as you need.

  3. Leave the Destination Field of the last transformation empty.