Skip to main content

Create a Job

From the Job Management page in Safe Haven, you can click Create Job to configure a PySpark job that specifies a file containing the Python code that you want to schedule or run immediately. Once you create a job, it appears on the Job Management page, where you can view its status, see the next time it will run, and disable, enable, or delete it.

Before you create a job, do the following:

  • Create a Python code file in your code repository bucket that you want to run, such as:


  • Determine any optional Dataproc arguments.

  • Decide when the job should run and at what interval if it is a recurring job.

Procedure. To create a job:
  1. In the left navigation bar, click Job Management.


    The Job Management page appears.

  2. Click Create a Job.

    The Create a Job page displays the Details step.

  3. Enter a unique name for the job and a description so that you can remember what the job does and then click Next.

    The Job Settings step appears.

  4. From the Job Type list, select PySpark (the default option) or Python.

  5. If you select PySpark, select a Dataproc cluster size from the Cluster list depending on how resource-intensive your PySpark job is:

    • Small: n1-standard-4

    • Medium: n1-standard-8

    • Large: n1-standard-16

    Similarly, if you select Python, select a small, medium, or large server type.

    For more information, see Google Cloud's "N1 machine series."

  6. (Optional) If your job requires additional code files to run, click Add Code File and browse to the file that you want to run. The file must exist in your code repository bucket.

  7. (Optional) In the Arguments box, enter a comma-delimited list of Dataproc Spark job arguments to pass to the main class and to any additional Python files.

  8. (Optional) In the Additional File box, enter the name of an additional file needed to run your job. The file must exist in your code repository bucket. To specify even more files, click Add Additional Files.

  9. (Optional) If your job requires any non-standard Python packages to run, enter its <package_name>==<version_number> syntax in the Additional Packages box. For information, see "Supported Python Packages."

  10. Click Next. The Schedule step appears.

  11. From the Repeats Every list, select an interval at which you want to repeat the job.


    If you want to specify a custom schedule, you can enter a cron schedule expression to instruct the cron utility to run your PySpark job at a specified day, time, and recurring interval. For example, the cron format typically includes a string of space-delimited integers and special characters (e.g., , - * /) in the following order:

    • Minute (0−59)

    • Hour (0−23)

    • Day of the month (1−31)

    • Month of the year (1−12)

    • Day of the week (0−6 with 0=Sunday)

    For more information, see Google Cloud's "Configuring cron job schedules."

  12. As needed, enter the start run time and time zone options.


    Job Management currently only supports the default time zone of UTC +2 (Western Europe).

  13. Click Next, review your job information, and then click Create.

    You are returned to the Job Management page and your job displays the Processing status.

    If you want to disable your job at any time, you can click its Disable/Enable switch.