BigQuery Guidelines

This topic provides tips and technical considerations for Analytics Environment's BigQuery.

BigQuery Limitations

Both Google and LiveRamp set limitations on the use of BigQuery.

Google has some hard limitations on queries, tables, views, and user-defined functions to reduce spikes in resource use and maintain availability. For more information, see "Limit Partition Values" below and Google's "Quotas and Limits".

Google limits queries to a 6-hour timeout in BigQuery. Timeout errors indicate inefficient code, which can be resolved by improving query efficiency. For example, you could apply the _PARTITIONTIME clause. For more information, see Google's "Query Jobs".

To manage security and performance, LiveRamp sets the following limitations on top of Google's:

You can only access BigQuery from Analytics Environment's virtual machine desktop.
Only users with certain personas can access data in Analytics Environment's BigQuery. For information, see "Applications Available".
You only have data permissions to specific BigQuery datasets.
Certain BigQuery features, such as CSV exports, are disabled.
Some large tables (such as retail sales transactions) restrict the selection of the entire table without first specifying a date period.

Use Coderepo Buckets

A coderepo bucket is a Google Cloud Storage (GCS) bucket designated for storing code repositories and managing code files, such as PySpark scripts needed for running Spark jobs on a Dataproc cluster.

You can use coderepo buckets to store scripts and then reference them when submitting jobs to the cluster or using the command line to perform the following operations:

List the contents of a bucket
Copy to the coderepo bucket
Copy from the coderepo bucket

For example, in JupyterHub, you can push your PySpark code or a Jupyter notebook to a _coderepo bucket using the gsutil cp command, which copies local files to the specified GCS bucket. This enables you to share and manage code files within your team. Dataproc jobs can then use those code files.

Coderepo: From the Jupyter terminal, you can use the command-line interface to perform coderepo operations. For information, see "Use the Command-Line Interface".

SQL Guidelines

Avoid selecting all columns (SELECT *): In the SELECT query clause, select only the required columns to query and filter some partitions when applicable. If the table is large, check how much data will be read before querying it.
Doing joins with multi-occurrence keys: When running a query with joins using duplicate keys (unbalanced joins), depending on the amount of data manipulated, it can take a long time to run or eventually fail. A typical example is performing join with a huge web user activity table containing duplicate RampIDs. To resolve this issue, you may need to drop some duplicates, such as dropping all keys with more than 15 occurrences.
Limit partition values: To improve query performance, your BigQuery SQL statements should limit the partition value by using the _PARTITIONTIME partitioning column. Using _PARTITIONTIME allows you to limit the query to specific partitions, reducing the volume of data scanned, and lowering costs. BigQuery only scans the relevant partitions, which results in faster query execution because fewer rows need to be processed.
For partitioned tables, _PARTITIONTIME acts as a column representing the partition's date. You can use it directly in queries for date-based filtering instead of parsing a timestamp or date column in the table.
When using _PARTITIONTIME in a WHERE clause, the system will only bring in data that fits the specified time window, rather than first joining all of the data and then reducing the amount of data after it is joined together.
Example WHERE clause using _PARTITIONTIME to optimize a query for data from the last year:
```
FROM lranalytics-us/organization_name_wh.customer-profiles_segment_reference
WHERE TIMESTAMP_TRUNC(_PARTITIONTIME, DAY) > TIMESTAMP(DATE_SUB(CURRENT_DATE(), INTERVAL 12 MONTH))
| AND segment name in (...
```
Example WHERE clause using _PARTITIONTIME to limit data within a specific date:
```
FROM lranalytics-us/organization_name_wh.customer-profiles_segment_reference
WHERE TIMESTAMP_TRUNC(_PARTITIONTIME, DAY) > TIMESTAMP("2024-12-04")
| AND segment name in (...
```
For more information on _PARTITIONTIME, see Google's "Query Partitioned Tables".
Incorrect number of columns returned or inserted: If the underlying tables have been updated, you can run your query manually and use the EXCLUDE statement to bring in full tables while removing specific columns from the result.

For more information, see Google's "Optimize Query Computation".

Find Errors with Scheduled Queries in BigQuery

Procedure. To view errors for BigQuery jobs:

Open BigQuery in Analytics Environment.
Select Scheduled queries.
In the list of scheduled queries, sort by the query status to display all unsuccessful queries at the top of the list.
For the failed query you're interested in, click its name to view recent runs of the scheduled query and identify any errors that need to be corrected in the code or modifications to improve query efficiency, such as implementing filtering.

Using the BigQuery and GCS Command-Line from Your Cluster

Encapsulating BigQuery and GCS command-line operations in Python subprocess calls provides flexibility, automation, and better integration with other tools and libraries, enhancing the overall efficiency and maintainability of data workflows.

Encapsulating command-line operations in Python subprocesses allows you to automate and script complex workflows, which can be useful for managing repetitive tasks or for specifying steps in a script.

Some operations might be available only through the command line and not through the Jupyter or BigQuery Python client. By using Python subprocesses, you can use command-line tools that might not be available with the Python client.

Python provides error handling and logging. You can use subprocesses to capture command-line output, handle errors, and log the execution details for debugging and monitoring.