Skip to main content

Configure an Export Process

An export process extracts elements of your first-party graph into a file for consumption by downstream clients of the identity graph. The export process relies on SQL statements using Spark SQL syntax to specify the fields you want to extract and create different views of the output for systems that have distinct requirements or use cases.

When you run a workflow:

  1. The export process extracts data from your first-party graph based on your configured SQL statements.

  2. Any relevant privacy suppressions are applied based on the configured consent behavior.

  3. The refined data is output into a Google Could Storage (GCS) bucket or table where it can be accessed by other processes.

Once you configure a data preparation process and a resolution process, you can configure one or more export processes. If the option to configure an export process is grayed out, it indicates that your data preparation process or resolution process have not yet been configured.

Procedure. To configure an export process:
  1. Once you drag and drop an export process onto your workflow canvas, a yellow warning icon Identity_Engine-Workflow_Warning_Icon.png to indicates that it is not yet configured.

  2. Click the icon and select Configure. If this option is grayed out, it indicates that your data preparation process or resolution process have not yet been configured.

    Identity_engine_export_process.png

    Optionally, to reuse an existing configuration from another export process, click Upload and select a JSON file the desired configuration.

    The Configure Export Process wizard displays the Process Details step. If you've uploaded a JSON file, all the fields and options will be pre-defined which you can modify or leave as is.

    Identity_Engine-Export_Process_Configuration.png
  3. Configure the following process details:

    • Process Name: Enter a name for this export process.

    • File Format: Select CSV, Text, or JSON format.

      Note

      Support for nested field types, such as arrays or structs, requires the JSON format because nested fields are not compatible with CSV.

    • Field Delimiter: If you selected the CSV format, select one of the following delimiters: comma, pipe, tab, or semicolon.

    • Add Headers: If you selected the CSV format, you can select this check box to include a heading row in the first line of the exported CSV file.

    • Export Mode: Choose one of the following modes

      • Full Export: Export the entire first-party graph.

      • Last File Related: For file-triggered builds, filter on records exclusively from the input data that triggered the build.

      • Graph Change: Filter on records that include changed Enterprise ID formations.

    • Destination Location: Specify the path to the GCP bucket and folder for your export file. For example, gs://mybucket/myfolder/myexportfolder

    • File Partitioning: Select either "Auto (Spark partitioning)" or a value from 1 through 20.

    • File Prefix: If you specified a numeric partitioning value, you must define a file prefix.

    • (Optional) Description: Enter any important details that other users may need to know about this export configuration.

  4. Click Next. The Export Definition step displays the Export Definition (SQL) box. By default, it includes the following query statement:

    SELECT portraitId
    FROM genDf

    Where portraitId refers to Enterprise IDs and genDf is the name of a temporary Spark view. For a list of fields in this view, see "genDf Fields".

  5. Enter a SQL query statement in the Export Definition (SQL) box that is based on the structure of your input data, the applicable genDf fields, your data extraction needs, and other requirements. For more details, see "Export Process SQL Query Guidelines".

    Note

    The Export Definition (SQL) box is not a SQL editor. It does not run queries and does not validate SQL code that you enter. You must compose valid SQL code using your preferred tools.

  6. Click Next. The Export Process Configuration wizard displays the Summary step.

    Identity_Engine-Export_Process_Configuration-Summary.png
  7. Review the configuration and then click Confirm.

    Tip

    You can reuse the saved configuration for other processes of the same type by downloading it as a JSON file. From the Workflow Editor, click the three dots on a desired process and select Download.

Export Process SQL Query Guidelines

When configuring an export process, consider the following SQL query guidelines when preparing code to paste into the Export Definition (SQL) box:

  • Use Spark SQL syntax to create simple queries.

  • The Export Definition (SQL) box is not a SQL editor. It does not run queries and does not validate SQL code that you enter. You must compose valid SQL code using your preferred tools.

  • Avoid overloading the export process with query statements such as SELECT * FROM genDf

  • Any fields referenced by your query must be included in the input data and mapped by the data preparation process.

  • Support for nested field types, such as arrays or structs, typically requires the JSON format because nested fields are not compatible with CSV format. This includes fields such as portraitIdHistory, emails, phones, addresses, and links.

  • Consider the data types of queried columns. For example, arrays and structs might require additional handling using the explode and transform functions before exporting to CSV format.

  • Test your queries against your BigQuery tables to verify its columns are available for the export.

Sample Query

SELECT
  portraitid,
  portraitIdHistory,
  rawRecord.CCID AS cid,
  rawRecord.COUNTRY_NAME AS country,
  rawRecord.GENDER AS gender
FROM genDf

Sample Query for Retrieving Nested Values

SELECT
  portraitId,
  groupId AS sourceId,
  customerId AS customerId,
  struct(
    people.firstName,
    people.lastName
  ) AS people,
  emails.email AS emails,
  phones.phone AS phones,
  transform(links, x -> struct(
    x.matchConfidence,
    x.cl
  )) AS links
FROM genDf

Sample Query for Flattening Nested Values

SELECT
  portraitId,
  explode(transform(phones, x -> phone)) phone
FROM genDf

genDf Fields

The genDf view provides the following fields for your queries:

  • portraitid: Enterprise IDs

  • portraitidHistory: An array with elements containing the history of Enterprise IDs, portraitidUpdateDate, and reasons

  • generationId: Identifies the build

  • rawRecord: A struct field for information extracted from the input data, which is structured based on the input schema. All fields from raw input data will persist under rawRecord.

  • recordId: Unique identifier associated with each PII or AbiliTec link per Enterprise ID

  • groupId: A three-digit string identifying a specific data source in a workflow

  • tenantName: A unique name for the tenant

  • customerId: The unique ID of the customer

  • people: A nested field containing information about people (if any), such as title, suffix, firstName, middleName, lastName, gender, dateOfBirth, yearOfBirth, firstSeenDate, lastSeenDate, and isValid

  • emails: An array of structs containing email information (if any), such as email, isotopes, isValid, isfake, firstSeenDate, lastSeenDate, and hashFormat

  • phones: An array of structs containing phone information (if any), such as phone, phoneType, internalNumber, isValid, firstSeenDate, lastSeenDate, and hashFormat

  • addresses: An array of structs containing information about physical addresses (if any), such as addressLine{1-4}, addressLineFull, postalCode, zip4, city, state, country, firstSeenDate, and lastSeenDate

  • links: An array of structs containing link information (if any), such as cl (for ConsumerLink), al (for AbiliTec link), hhl (for householdLink), matchComponents, and matchConfidence

Note

With the exception of raw input data under the rawRecord field, all these fields must be mapped by the data preparation process. For example, if the data preparation process does not map the postalCode field, it will not be available to the export process.

Sample Export Field Hierarchy

| -- portraitId: string
| -- portraitIdHistory: array
| | -- element: struct
| | | -- portraitId: string
| | | -- portraitIdUpdateDate: string
| -- recordId: string
| -- generationId: long
| -- groupId: string
| -- tenantName: string
| -- customerId: string
| -- rawRecord: struct
| | -- field_1: string
| | -- field_2: string
| | -- field_3: string
| | -- ...
| -- people: struct
| | -- firstName: string
| | -- middleName: string
| | -- lastName: string
| | -- gender: string
| | -- dateOfBirth: string
| | -- yearOfBirth: long
| -- emails: array
| | -- element: struct
| | | -- email: string
| | | -- isFake: boolean
| | | -- isValid: boolean
| -- phones: array
| | -- element: struct
| | | -- phone: string
| | | -- internalNumber: string
| | | -- phoneType: string
| | | -- isValid: boolean
| -- addresses: array
| | -- element: struct
| | | -- addressLine1: string
| | | -- addressLine2: string
| | | -- addressLine3: string
| | | -- addressLine4: string
| | | -- addressLine5: string
| | | -- city: string
| | | -- country: string
| | | -- postalCode: string
| | | -- zip4: string
| | | -- state: string
| -- links: array
| | -- element: struct
| | | -- al: string
| | | -- cl: string
| | | -- hhl: string
| | | -- matchConfidence: long
| | | -- matchType: string