bigquery url 解码

2024-05-02

有没有一种简单的方法可以在 BigQuery 查询语言中进行网址解码?我正在使用一个表,该表有一列包含某些值中的 URL 编码字符串。例如:

http://xyz.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz

我像这样提取“url”参数:

SELECT REGEXP_EXTRACT(column_name, "url=([^&]+)") as url 
from [mydataset.mytable]

这给了我:

http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345

我想做的是这样的:

SELECT URL_DECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) as url 
from [mydataset.mytable]

从而返回:

http://www.example.com/hello?v=12345

如果可能的话,我想避免使用多个 REGEXP_REPLACE() 语句(替换 %20、%3A 等...)。

Ideas?


下面是基于@sigpwned答案构建的,但稍微重构并用SQL UDF包装(没有限制,JS UDF可以安全使用)

#standardSQL
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
  SELECT SAFE_CONVERT_BYTES_TO_STRING(
    ARRAY_TO_STRING(ARRAY_AGG(
        IF(STARTS_WITH(y, '%'), FROM_HEX(SUBSTR(y, 2)), CAST(y AS BYTES)) ORDER BY i
      ), b''))
  FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}|[^%]+")) AS y WITH OFFSET AS i 
));
SELECT 
  column_name, 
  URLDECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) AS url
FROM `project.dataset.table`

可以用下面问题中的示例进行测试

#standardSQL
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
  SELECT SAFE_CONVERT_BYTES_TO_STRING(
    ARRAY_TO_STRING(ARRAY_AGG(
        IF(STARTS_WITH(y, '%'), FROM_HEX(SUBSTR(y, 2)), CAST(y AS BYTES)) ORDER BY i
      ), b''))
  FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}|[^%]+")) AS y WITH OFFSET AS i 
));
WITH `project.dataset.table` AS (
  SELECT 'http://example.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz' column_name
)
SELECT 
  URLDECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) AS url,
  column_name
FROM `project.dataset.table`    

有结果

Row url                                     column_name  
1   http://www.example.com/hello?v=12345    http://example.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz   

使用进一步优化的 SQL UDF 进行更新

CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
  SELECT STRING_AGG(
    IF(REGEXP_CONTAINS(y, r'^%[0-9a-fA-F]{2}'), 
      SAFE_CONVERT_BYTES_TO_STRING(FROM_HEX(REPLACE(y, '%', ''))), y), '' 
    ORDER BY i
    )
  FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}(?:%[0-9a-fA-F]{2})*|[^%]+")) y
  WITH OFFSET AS i 
));
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

bigquery url 解码 的相关文章

随机推荐