Skip to main content

BigQuery

The BigQuery query runner allows you to connect to and query Google BigQuery datasets, with support for authentication via service accounts and Application Default Credentials (ADC).

Configuration

The BigQuery query runner supports the following configuration:

{
"projectId": "your-project-id",
"jsonKeyFile": "base64-encoded-service-account-key",
"loadSchema": true,
"useStandardSql": true,
"location": "US",
"totalMBytesProcessedLimit": 1000,
"maximumBillingTier": 1,
"userDefinedFunctionResourceUri": "gs://bucket/date_utils.js,gs://bucket/string_utils.js",
"useQueryAnnotation": false
}

Configuration Schema

ParameterTypeRequiredDescription
projectIdStringYesGoogle Cloud project ID
jsonKeyFileStringNoBase64-encoded service account key JSON (if omitted, ADC is used)
loadSchemaBooleanNoWhether to load schema information
useStandardSqlBooleanNoWhether to use Google Standard SQL (true) or Legacy SQL (false) (default: true)
locationStringNoBigQuery processing location (e.g., "US", "EU")
totalMBytesProcessedLimitNumberNoMaximum data processed in MB for query execution
maximumBillingTierNumberNoMaximum billing tier for query execution
userDefinedFunctionResourceUriStringNoURIs of JavaScript UDFs, comma-separated
useQueryAnnotationBooleanNoWhether to enable query annotation (default: false)

Example Usage

from legion_query_runner.query_runner import QueryRunner

# Initialize with BigQuery configuration
bq_config = {
"projectId": "your-gcp-project",
"useStandardSql": True,
"location": "US"
}

# Create a QueryRunner instance
runner = QueryRunner('big_query', bq_config)

# Test connection
runner.test_connection()

# Execute a query
results = runner.run_query("SELECT * FROM `project.dataset.table` LIMIT 10")

Authentication

The BigQuery query runner supports two authentication methods:

  1. Service Account Key - provide a base64-encoded service account key JSON in jsonKeyFile
  2. Application Default Credentials (ADC) - omit jsonKeyFile to use ADC

To create a base64-encoded service account key:

# Generate base64-encoded key
cat service-account-key.json | base64 -w 0

Type Mapping

The BigQuery query runner maps BigQuery data types to standardized types:

BigQuery TypeLegion Type
INTEGERINTEGER
FLOATFLOAT
BOOLEANBOOLEAN
STRINGSTRING
TIMESTAMPDATETIME
DATETIMEDATETIME
DATEDATE

Query Optimization

To optimize query performance and reduce costs:

  1. Use the totalMBytesProcessedLimit parameter to set an upper bound on data processed
  2. The query runner will first perform a dry run to estimate data processed and will reject queries exceeding the limit
  3. Use the maximumBillingTier parameter to control billing tier escalation for complex queries

Schema Retrieval

The BigQuery query runner extracts schema information using INFORMATION_SCHEMA views:

# Get schema for all tables (only if loadSchema is true)
schema = runner.get_schema()

# Get columns for a specific table
columns = runner.get_table_columns("dataset.table")

# Get column types for a specific table
column_types = runner.get_table_types("dataset.table")

User-Defined Functions

To use JavaScript user-defined functions (UDFs):

  1. Upload your JavaScript UDF files to Google Cloud Storage
  2. Specify the GCS URIs in the userDefinedFunctionResourceUri parameter as a comma-separated string

Implementation Details

The BigQuery query runner uses the Google Cloud client libraries to connect to BigQuery. These libraries must be installed for the query runner to be enabled:

pip install google-api-python-client google-auth google-cloud-bigquery

The query runner includes support for:

  • Query cost estimation and limiting
  • Cancellation of running queries
  • Tracking data scanned metadata
  • Automatic pagination for large result sets
  • Handling of repeated fields (ARRAY types)
  • Custom timeouts for long-running queries