BigQuery
The BigQuery query runner allows you to connect to and query Google BigQuery datasets, with support for authentication via service accounts and Application Default Credentials (ADC).
Configuration
The BigQuery query runner supports the following configuration:
{
"projectId": "your-project-id",
"jsonKeyFile": "base64-encoded-service-account-key",
"loadSchema": true,
"useStandardSql": true,
"location": "US",
"totalMBytesProcessedLimit": 1000,
"maximumBillingTier": 1,
"userDefinedFunctionResourceUri": "gs://bucket/date_utils.js,gs://bucket/string_utils.js",
"useQueryAnnotation": false
}
Configuration Schema
Parameter | Type | Required | Description |
---|---|---|---|
projectId | String | Yes | Google Cloud project ID |
jsonKeyFile | String | No | Base64-encoded service account key JSON (if omitted, ADC is used) |
loadSchema | Boolean | No | Whether to load schema information |
useStandardSql | Boolean | No | Whether to use Google Standard SQL (true) or Legacy SQL (false) (default: true) |
location | String | No | BigQuery processing location (e.g., "US", "EU") |
totalMBytesProcessedLimit | Number | No | Maximum data processed in MB for query execution |
maximumBillingTier | Number | No | Maximum billing tier for query execution |
userDefinedFunctionResourceUri | String | No | URIs of JavaScript UDFs, comma-separated |
useQueryAnnotation | Boolean | No | Whether to enable query annotation (default: false) |
Example Usage
from legion_query_runner.query_runner import QueryRunner
# Initialize with BigQuery configuration
bq_config = {
"projectId": "your-gcp-project",
"useStandardSql": True,
"location": "US"
}
# Create a QueryRunner instance
runner = QueryRunner('big_query', bq_config)
# Test connection
runner.test_connection()
# Execute a query
results = runner.run_query("SELECT * FROM `project.dataset.table` LIMIT 10")
Authentication
The BigQuery query runner supports two authentication methods:
- Service Account Key - provide a base64-encoded service account key JSON in
jsonKeyFile
- Application Default Credentials (ADC) - omit
jsonKeyFile
to use ADC
To create a base64-encoded service account key:
# Generate base64-encoded key
cat service-account-key.json | base64 -w 0
Type Mapping
The BigQuery query runner maps BigQuery data types to standardized types:
BigQuery Type | Legion Type |
---|---|
INTEGER | INTEGER |
FLOAT | FLOAT |
BOOLEAN | BOOLEAN |
STRING | STRING |
TIMESTAMP | DATETIME |
DATETIME | DATETIME |
DATE | DATE |
Query Optimization
To optimize query performance and reduce costs:
- Use the
totalMBytesProcessedLimit
parameter to set an upper bound on data processed - The query runner will first perform a dry run to estimate data processed and will reject queries exceeding the limit
- Use the
maximumBillingTier
parameter to control billing tier escalation for complex queries
Schema Retrieval
The BigQuery query runner extracts schema information using INFORMATION_SCHEMA views:
# Get schema for all tables (only if loadSchema is true)
schema = runner.get_schema()
# Get columns for a specific table
columns = runner.get_table_columns("dataset.table")
# Get column types for a specific table
column_types = runner.get_table_types("dataset.table")
User-Defined Functions
To use JavaScript user-defined functions (UDFs):
- Upload your JavaScript UDF files to Google Cloud Storage
- Specify the GCS URIs in the
userDefinedFunctionResourceUri
parameter as a comma-separated string
Implementation Details
The BigQuery query runner uses the Google Cloud client libraries to connect to BigQuery. These libraries must be installed for the query runner to be enabled:
pip install google-api-python-client google-auth google-cloud-bigquery
The query runner includes support for:
- Query cost estimation and limiting
- Cancellation of running queries
- Tracking data scanned metadata
- Automatic pagination for large result sets
- Handling of repeated fields (ARRAY types)
- Custom timeouts for long-running queries