通过 Python API 加载 JSONL 数据时检索 BigQuery 验证错误

2024-03-02

将 JSONL 文件加载到 BigQuery 时,如何检索与验证错误相关的更多信息? (问题不是为了解决问题)

示例代码:

from google.cloud.bigquery import (
    LoadJobConfig,
    QueryJobConfig,
    Client,
    SourceFormat,
    WriteDisposition
)

# variables depending on the environment
filename = '...'
gcp_project_id = '...'
dataset_name = '...'
table_name = '...'
schema = [ ... ]

# loading data
client = Client(project=project_id)
dataset_ref = client.dataset(dataset_name)
table_ref = dataset_ref.table(table_name)
job_config = LoadJobConfig()
job_config.source_format = SourceFormat.NEWLINE_DELIMITED_JSON
job_config.write_disposition = WriteDisposition.WRITE_APPEND
job_config.schema = schema
LOGGER.info('loading from %s', filename)
with open(filename, "rb") as source_file:
    job = client.load_table_from_file(
        source_file, destination=table_ref, job_config=job_config
    )

    # Waits for table cloud_data_store to complete
    job.result()

我在这里使用bigquery-架构生成器 https://pypi.org/project/bigquery-schema-generator/生成架构(否则 BigQuery 仅查看前 100 行)。

运行可能会出错并显示以下错误消息(google.api_core.exceptions.BadRequest):

400 读取数据时出错,错误信息:JSON 表遇到太多错误,放弃。行数:1;错误: 1. 请查看错误[]集合以获取更多详细信息。

看着errorsproperty 基本上不提供任何新信息:

[{'reason': 'invalid',
  'message': 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.'}]

我也看了__dict__的例外情况,但尚未透露任何进一步的信息。

尝试使用加载表bq命令行(在这种情况下没有显式模式)会产生更有用的消息:

加载操作中的 BigQuery 错误:处理作业“...”时出错:提供的架构与表 不匹配。字段 的类型已从 TIMESTAMP 更改为 日期

我现在的问题是如何从 Python API 检索如此有用的消息?

基于已接受答案的解决方案

这是一种复制和粘贴解决方法,可以添加该解决方法以便默认显示更多信息。 (可能也有缺点)

import google.cloud.exceptions
import google.cloud.bigquery.job


def get_improved_bad_request_exception(
    job: google.cloud.bigquery.job.LoadJob
) -> google.cloud.exceptions.BadRequest:
    errors = job.errors
    result = google.cloud.exceptions.BadRequest(
        '; '.join([error['message'] for error in errors]),
        errors=errors
    )
    result._job = job
    return result


def wait_for_load_job(
    job: google.cloud.bigquery.job.LoadJob
):
    try:
        job.result()
    except google.cloud.exceptions.BadRequest as exc:
        raise get_improved_bad_request_exception(job) from exc

然后打电话wait_for_load_job(job)代替job.result()直接,将导致更有用的异常(错误消息和errors财产)。


为了能够显示更有用的错误消息,您可以导入google.api_core.exceptions.BadRequest捕获异常然后使用LoadJob 属性错误 https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.LoadJob.html#google.cloud.bigquery.job.LoadJob.errors从作业中获取详细的错误消息。

from google.api_core.exceptions import BadRequest
...
...
try:
    load_job.result()# Waits for the job to complete.
except BadRequest:
    for error in load_job.errors:
        print(error["message"])  # error is of type dictionary

为了进行测试,我使用了示例代码BQ加载json数据 https://github.com/googleapis/python-bigquery/blob/HEAD/samples/load_table_uri_json.py并更改输入文件以产生错误。在文件中我更改了值"post_abbr"从字符串到数组值。

使用的文件:

{"name": "Alabama", "post_abbr": "AL"}
{"name": "Alaska", "post_abbr":  "AK"}
{"name": "Arizona", "post_abbr": [65,2]}

应用上面的代码片段时,请参阅下面的输出。最后的错误消息显示了有关的实际错误"post_abbr"接收非重复字段的数组。

Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 3; errors: 1. Please look into the errors[] collection for more details.
Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 3; errors: 1; max bad: 0; error percent: 0
Error while reading data, error message: JSON parsing error in row starting at position 78: Array specified for non-repeated field: post_abbr.
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

通过 Python API 加载 JSONL 数据时检索 BigQuery 验证错误 的相关文章

随机推荐