GCP Dataproc Spark.jar.packages 下载依赖项时出现问题

2024-05-04

创建 Dataproc Spark 集群时,我们通过--properties spark:spark.jars.packages=mysql:mysql-connector-java:6.0.6 to the gcloud dataproc clusters create命令。

这是为了将我们的 PySpark 脚本保存到 CloudSQL

显然在创建时这没有任何作用,但在第一次时spark-submit这将尝试解决这种依赖性。

从技术上讲,它似乎解析并下载了必要的 jar 文件,但集群上的第一个任务将失败,因为发出警告spark-submit

Exception in thread "main" java.lang.RuntimeException: [download failed: mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar]
    at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1177)
    at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:298)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

完整的输出是:

Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
mysql#mysql-connector-java added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found mysql#mysql-connector-java;6.0.6 in central
downloading https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar ...
:: resolution report :: resolve 527ms :: artifacts dl 214ms
    :: modules in use:
    mysql#mysql-connector-java;6.0.6 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   1   |   1   |   1   |   0   ||   1   |   0   |
    ---------------------------------------------------------------------

:: problems summary ::
:::: WARNINGS
        [FAILED     ] mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar: Downloaded file size doesn't match expected Content Length for https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar. Please retry. (212ms)

        [FAILED     ] mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar: Downloaded file size doesn't match expected Content Length for https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar. Please retry. (212ms)

    ==== central: tried

      https://repo1.maven.org/maven2/mysql/mysql-connector-java/6.0.6/mysql-connector-java-6.0.6.jar

        ::::::::::::::::::::::::::::::::::::::::::::::

        ::              FAILED DOWNLOADS            ::

        :: ^ see resolution messages for details  ^ ::

        ::::::::::::::::::::::::::::::::::::::::::::::

        :: mysql#mysql-connector-java;6.0.6!mysql-connector-java.jar

        ::::::::::::::::::::::::::::::::::::::::::::::

然而,集群上的后续任务显示此输出

Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
mysql#mysql-connector-java added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found mysql#mysql-connector-java;6.0.6 in central
:: resolution report :: resolve 224ms :: artifacts dl 5ms
    :: modules in use:
    mysql#mysql-connector-java;6.0.6 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   1   |   0   |   0   |   0   ||   1   |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
    confs: [default]
    0 artifacts copied, 1 already retrieved (0kB/7ms)

所以我的问题是:

  1. 原因是什么?GCP 的优秀人员可以解决这个问题吗?
  2. 除了运行允许在集群启动时失败的虚拟任务之外,是否还有临时解决方法?

你能如何一致地重现这一点?在尝试使用不同的集群设置进行重现后,我最好的理论是这可能是一个返回 5xx 错误的过载服务器。

就解决方法而言:

1)从Maven Central下载jar并通过--jars提交作业时的选项。如果您经常创建新集群,那么通过初始化操作将此文件暂存在集群上是可行的方法。

2)通过以下方式提供备用ivy设置文件spark.jars.ivySettings指向 Google Maven Central 镜像的属性(这应该减少/消除 5xx 错误的可能性)

参见这篇文章:https://www.infoq.com/news/2015/11/maven-central-at-google https://www.infoq.com/news/2015/11/maven-central-at-google

本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

GCP Dataproc Spark.jar.packages 下载依赖项时出现问题 的相关文章

随机推荐