Configure an Export Process
An export process extracts elements of your first-party graph into a file for consumption by downstream clients of the identity graph. The export process relies on SQL statements using Spark SQL syntax to specify the fields you want to extract and create different views of the output for systems that have distinct requirements or use cases.
When you run a workflow:
The export process extracts data from your first-party graph based on your configured SQL statements.
Any relevant privacy suppressions are applied based on the configured consent behavior.
The refined data is output into a Google Could Storage (GCS) bucket or table where it can be accessed by other processes.
Once you configure a data preparation process and a resolution process, you can configure one or more export processes. If the option to configure an export process is grayed out, it indicates that your data preparation process or resolution process have not yet been configured.
Once you drag and drop an export process onto your workflow canvas, a yellow warning icon to indicates that it is not yet configured.
Click the icon and select
. If this option is grayed out, it indicates that your data preparation process or resolution process have not yet been configured.Optionally, to reuse an existing configuration from another export process, click
and select a JSON file the desired configuration.The Configure Export Process wizard displays the Process Details step. If you've uploaded a JSON file, all the fields and options will be pre-defined which you can modify or leave as is.
Configure the following process details:
Process Name: Enter a name for this export process.
File Format: Select CSV, Text, or JSON format.
Note
Support for nested field types, such as arrays or structs, requires the JSON format because nested fields are not compatible with CSV.
Field Delimiter: If you selected the CSV format, select one of the following delimiters: comma, pipe, tab, or semicolon.
Add Headers: If you selected the CSV format, you can select this check box to include a heading row in the first line of the exported CSV file.
Export Mode: Choose one of the following modes
Full Export: Export the entire first-party graph.
Last File Related: For file-triggered builds, filter on records exclusively from the input data that triggered the build.
Graph Change: Filter on records that include changed Enterprise ID formations.
Destination Location: Specify the path to the GCP bucket and folder for your export file. For example,
gs://mybucket/myfolder/myexportfolder
File Partitioning: Select either "Auto (Spark partitioning)" or a value from 1 through 20.
File Prefix: If you specified a numeric partitioning value, you must define a file prefix.
(Optional) Description: Enter any important details that other users may need to know about this export configuration.
Click
. The Export Definition step displays the Export Definition (SQL) box. By default, it includes the following query statement:SELECT portraitId FROM genDf
Where
portraitId
refers to Enterprise IDs andgenDf
is the name of a temporary Spark view. For a list of fields in this view, see "genDf Fields".Enter a SQL query statement in the Export Definition (SQL) box that is based on the structure of your input data, the applicable genDf fields, your data extraction needs, and other requirements. For more details, see "Export Process SQL Query Guidelines".
Note
The Export Definition (SQL) box is not a SQL editor. It does not run queries and does not validate SQL code that you enter. You must compose valid SQL code using your preferred tools.
Click
. The Export Process Configuration wizard displays the Summary step.Review the configuration and then click
.Tip
You can reuse the saved configuration for other processes of the same type by downloading it as a JSON file. From the Workflow Editor, click the three dots on a desired process and select
.
Export Process SQL Query Guidelines
When configuring an export process, consider the following SQL query guidelines when preparing code to paste into the Export Definition (SQL) box:
Use Spark SQL syntax to create simple queries.
The Export Definition (SQL) box is not a SQL editor. It does not run queries and does not validate SQL code that you enter. You must compose valid SQL code using your preferred tools.
Avoid overloading the export process with query statements such as
SELECT * FROM genDf
Any fields referenced by your query must be included in the input data and mapped by the data preparation process.
Support for nested field types, such as arrays or structs, typically requires the JSON format because nested fields are not compatible with CSV format. This includes fields such as portraitIdHistory, emails, phones, addresses, and links.
Consider the data types of queried columns. For example, arrays and structs might require additional handling using the explode and transform functions before exporting to CSV format.
Test your queries against your BigQuery tables to verify its columns are available for the export.
Sample Query
SELECT portraitid, portraitIdHistory, rawRecord.CCID AS cid, rawRecord.COUNTRY_NAME AS country, rawRecord.GENDER AS gender FROM genDf
Sample Query for Retrieving Nested Values
SELECT portraitId, groupId AS sourceId, customerId AS customerId, struct( people.firstName, people.lastName ) AS people, emails.email AS emails, phones.phone AS phones, transform(links, x -> struct( x.matchConfidence, x.cl )) AS links FROM genDf
Sample Query for Flattening Nested Values
SELECT portraitId, explode(transform(phones, x -> phone)) phone FROM genDf
genDf Fields
The genDf
view provides the following fields for your queries:
portraitid: Enterprise IDs
portraitidHistory: An array with elements containing the history of Enterprise IDs, portraitidUpdateDate, and reasons
generationId: Identifies the build
rawRecord: A struct field for information extracted from the input data, which is structured based on the input schema. All fields from raw input data will persist under rawRecord.
recordId: Unique identifier associated with each PII or AbiliTec link per Enterprise ID
groupId: A three-digit string identifying a specific data source in a workflow
tenantName: A unique name for the tenant
customerId: The unique ID of the customer
people: A nested field containing information about people (if any), such as title, suffix, firstName, middleName, lastName, gender, dateOfBirth, yearOfBirth, firstSeenDate, lastSeenDate, and isValid
emails: An array of structs containing email information (if any), such as email, isotopes, isValid, isfake, firstSeenDate, lastSeenDate, and hashFormat
phones: An array of structs containing phone information (if any), such as phone, phoneType, internalNumber, isValid, firstSeenDate, lastSeenDate, and hashFormat
addresses: An array of structs containing information about physical addresses (if any), such as addressLine{1-4}, addressLineFull, postalCode, zip4, city, state, country, firstSeenDate, and lastSeenDate
links: An array of structs containing link information (if any), such as cl (for ConsumerLink), al (for AbiliTec link), hhl (for householdLink), matchComponents, and matchConfidence
Note
With the exception of raw input data under the rawRecord field, all these fields must be mapped by the data preparation process. For example, if the data preparation process does not map the postalCode field, it will not be available to the export process.
Sample Export Field Hierarchy
| -- portraitId: string | -- portraitIdHistory: array | | -- element: struct | | | -- portraitId: string | | | -- portraitIdUpdateDate: string | -- recordId: string | -- generationId: long | -- groupId: string | -- tenantName: string | -- customerId: string | -- rawRecord: struct | | -- field_1: string | | -- field_2: string | | -- field_3: string | | -- ... | -- people: struct | | -- firstName: string | | -- middleName: string | | -- lastName: string | | -- gender: string | | -- dateOfBirth: string | | -- yearOfBirth: long | -- emails: array | | -- element: struct | | | -- email: string | | | -- isFake: boolean | | | -- isValid: boolean | -- phones: array | | -- element: struct | | | -- phone: string | | | -- internalNumber: string | | | -- phoneType: string | | | -- isValid: boolean | -- addresses: array | | -- element: struct | | | -- addressLine1: string | | | -- addressLine2: string | | | -- addressLine3: string | | | -- addressLine4: string | | | -- addressLine5: string | | | -- city: string | | | -- country: string | | | -- postalCode: string | | | -- zip4: string | | | -- state: string | -- links: array | | -- element: struct | | | -- al: string | | | -- cl: string | | | -- hhl: string | | | -- matchConfidence: long | | | -- matchType: string