Skip to main content

Clean Compute on Apache Spark

Clean Compute is the mechanism for executing your custom Python code against your own data and partner data from within a clean room while respecting the architectural integrity of the clean room. Clean Compute enables multi-node processing via customizable Spark jobs when you run a Clean Compute question.

Note

  • Clean Compute is only available for Hybrid Confidential Compute clean rooms and Hybrid clean rooms.

  • Clean Compute is in closed beta. To request this feature, contact your LiveRamp representative.

Overall Steps

Perform the following overall steps to create a Python-based Spark job as a question in a clean room:

For information on performing these steps, see the sections below.

For information on the limitations of the closed beta, see the "Limitations" section below.

Prerequisites

To create Python-based Spark jobs as questions in your clean rooms, follow the guidelines listed below when preparing your code:

  • The code must be available to the clean room in a Wheel package format.

  • The code must follow the format of the included files in the template.zip package provided by your LiveRamp representative. This includes:

    • requirements.txt is a requirements file that lists packages required to run your custom code for reference while running the job.

    • setup.py file

    • transformation.py: Referred to in the setup.py file. This must be called transformation.py.

    • custom_job directory with an optional custom_code.py file. Either include your business logic in the transformation.py script directly or invoke the custom_code.py file in the transformation.py script.

    • __init__.py files

    • data_handler.py: Enables clean rooms to create dataframes from configured datasets.

  • The code must follow the templatized structure we've provided in the template.zip package, including:

    • Leave the data_handler.py file as-is.

    • For the transformation.py file:

      • Include the Transformation class based on the provided template.

      • def transform () function to wrap around the core business logic.

        • ${dataset} definitions of dataframes to account for datasets. These represent macros in the question definition to provide placeholders for assigned datasets. Define them as you expect the macros to appear during the dataset assignment.

        • In the example package, these are referred to as the partner and owner variables.

      • result defines the desired output of the job. This can be done directly or via the custom_code.py script as illustrated in the template.

      • Core business logic should then be included in the def transform() function acting on the defined dataset dataframes or referenced via the custom_code option.

      • If you would like to include accompanying metrics on model performance as an output file for downstream consumption or consumption from the View Reports screen, refer to the example template’s use of the has_multiple_outputs Boolean and include self.data_handler.save_output in your transformation.py file. You will need to also indicate the code results in an output file when configuring Outputs in the Question Builder.

      • Ending in self.data_handler.write(result) to ensure the output of the job is written to memory in the Spark session.

  • If you wish to generate both a results dataframe and additional file-based outputs such as model metrics in JSON format, you must write the output object to clean room storage using self.data_handler.save_output(output_file_path). An example of this is outlined in the template README.

  • If you want to leverage dynamic run-time parameters for clean compute questions, you must leverage the run_params function of the DataHandler class and then get them at run-time as part of your Transformation class. See below for more details on configuring run-time parameters for Clean Compute questions.

  • Each code artifact must be its own independent package. This means that if you wish to run independent jobs, these should each be included in their own artifact store data connection and should respect the above requirements.

    Do not update the name of the .whl file after it is generated. If you would like to make changes, do that in the setup.py file and then re-generate the .whl file.

    Note

    If you want to dictate the name of the .whl file, you can do that before generating the .whl by specifying that using the setup.py name and version parameters, as shown below:

    image__19_.png
  • Determine how your code will be translated to dataset and field macros within the clean room before configuring the package. A good rule of thumb is that the dataframes used will correspond to datasets or dataset macros, and columns will correspond to field macros.

Add the Credentials

Procedure. To add credentials:
  1. Determine where you want to store the Clean Compute Wheel package and place it in the appropriate bucket. This will be your artifact store for the Clean Compute package:

    • Amazon Web Services Simple Storage Service (AWS S3)

    • Azure Data Lake Storage (ADLS)

    • Google Could Storage (GCS)

  2. From the LiveRamp Clean Room navigation pane, select Data ManagementCredentials.

  3. Click Add Credential.

    add_credential.png
  4. Enter a descriptive name for the credential.

  5. For the Credentials Type, select either GCS or S3 (depending on what you chose in step 1) for access to the artifact store.

  6. Click Save Credential.

Create the Data Connection

Note

You will need to create one data connection per Clean Compute package that you want to use in questions.

Procedure. To create the data connection:
  1. From the LiveRamp Clean Room navigation pane, select Data ManagementData Connections.

  2. From the Data Connections page, click New Data Connection.

    data_cxn_new.png
  3. From the New Data Connection screen, select "Artifact Store".

  4. Select the credentials you created in the "Add the Credentials" section above.

  5. Enter the following metadata for the data source:

    • Name: Enter the name for the Clean Compute package when authoring questions.

      Note

      This name is displayed when assigning the "dataset" to the question.

    • Category: Enter a category of your choice.

    • Dataset Type: Select WHEEL.

      Note

      JAR files are not currently supported.

    • Artifact Location: Provide the bucket path for where the file will be stored.

    • Default Spark Configuration: If you would like to specify a default job configuration for questions using this data connection, this can optionally be included with comma-separated specifications. These are based on Spark  configuration property documentation and can also be used to define environment variables.

      An example format for this configuration is spark.property_1=value;spark.property_2=value2.

      Note

      The default configuration if left blank is a bare-bones configuration LiveRamp uses to execute the job.

  6. Review the data connection details and click Save Data Connection.

    All configured data connections can be seen on the Data Connections page.

Configure Datasets for Clean Compute Questions

Procedure. To configure datasets for Clean Compute questions:
  1. Make sure all parties have configured the datasets required to complete your Spark job in the specified clean room.

  2. Make sure to configure the Artifact Store-based data connection you configured in the step above as a dataset within the clean room.

Create a Clean Compute Spark Question

Procedure. To create a Clean Compute Spark question:
  1. From your organization or from within the clean room in which you are creating the Clean Compute question, go to Questions.

  2. Click New Question.

  3. Enter values for the following metadata fields:

    • Question Name: Enter a descriptive name that will make it easy for internal and external users to quickly understand the context of the question.

    • Category: Consider a naming convention for ease of search.

    • Description (optional): You can create descriptions for different audiences, including business, data science, and technical users. Descriptions explain how the report should be read, provide insights derived from the report, and can be tailored for different user profiles.

    • Tags (optional): Add a tag to help with question categorization and filtering of questions. To add a tag, type the desired value and press Enter.

    • Question Type: Select User List or Analytical Question. Once you save a question, you cannot change its question type.

  4. Select the check box for "Clean Compute on Spark Question".

  5. For Question Type, select Analytical Question.

  6. Click Next.

  7. Create the appropriate macros:

    • If the question supports specifying datasets, create macros for the relevant datasets.

    • If an organization-level templated question, create macros for the wheel file and the expected datasets.

    • Be sure to create dataset-type macros for expected partner dataset inputs.

    Note

    These macros must match those created in the def transform() function you specified in the "transformation.py" file. You should also include field macros where appropriate.

  8. (Optional) Include run-time parameters below the dataset and field macros. Run-time parameters are used to insert values for dynamic variables in your code at run-time. If using run-time parameters, label the parameters in the UI with the same names and data types your code indicates.

  9. For Output Format, select the desired format:

    • Report: Appears in the UI based on the dimensions and measures you configure (typically related to a dataframe's contents).

    • Data Location: This will write the output to a location managed by LiveRamp and is typically used for binary outputs. This option operates similarly to list questions in that the output can then be set up for export via an export channel and picked up for use elsewhere in your systems as desired.

    Note

    If you included extra outputs such as model metrics in your code, toggle Enable Multiple Outputs on the right-hand menu. Supported output formats include JSON, PDF, and PNG.

  10. Click Create.

Assign Datasets to a Clean Compute Spark Question

This follows the same process as required for other question types. Make sure to assign your wheel file data connection as part of the process.

When you assign datasets, you can override the default Spark configuration from the Artifact Store data connection. This is done at the dataset level if you have additional knowledge about the scale of the dataset which would alter requirements from the default configuration. Once saved, the specified Spark configuration will be used for all runs of the corresponding question.

Procedure. To change the Spark configuration from the default:
  1. Go to Manage Datasets for the relevant question.

  2. In the Assign Datasets step, for the Wheel dataset type, click the pencil icon to edit the default Spark configuration.

  3. Edit as necessary and save.

    Note

    Editing is only required if necessary and is not typically required.

Run the Clean Compute Spark Question

Request a report as you would for any other question type.

Limitations

LiveRamp must maintain the principles of clean room architecture, regardless of the code run on data. This means we cannot allow the ability to enable all possible Spark job configurations or desired code inputs. We limit the types of configurations that can be run as follows:

  • To mitigate risks associated with data extraction, logging initiated by the Clean Compute code is not permitted.

  • To ensure the mapping between partners' RampIDs cannot be leaked back to the question author, transcoding is not enabled for Clean Compute questions.

  • The ability to enforce crowd size (k-min) on arbitrary code cannot be enforced or verified. If partners have a crowd-size requirement, the code author must enforce this in their code. However, you can view configured k-min values on the Configuration tab of the Question Details page (see "View Question Details").