我在 aws 的 kubernetes 中运行气流 1.9。我希望将日志发送到 s3,因为气流容器本身的寿命并不长。
我已经阅读了描述该过程的各种线程和文档,但我仍然无法让它工作。首先是一个测试,向我证明 s3 配置和权限是有效的。这是在我们的一个工作实例上运行的。
使用airflow写入s3文件
airflow@airflow-worker-847c66d478-lbcn2:~$ id
uid=1000(airflow) gid=1000(airflow) groups=1000(airflow)
airflow@airflow-worker-847c66d478-lbcn2:~$ env |grep s3
AIRFLOW__CONN__S3_LOGS=s3://vevo-dev-us-east-1-services-airflow/logs/
AIRFLOW__CORE__REMOTE_LOG_CONN_ID=s3_logs
AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://vevo-dev-us-east-1-services-airflow/logs/
airflow@airflow-worker-847c66d478-lbcn2:~$ python
Python 3.6.4 (default, Dec 21 2017, 01:37:56)
[GCC 4.9.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import airflow
>>> s3 = airflow.hooks.S3Hook('s3_logs')
/usr/local/lib/python3.6/site-packages/airflow/utils/helpers.py:351: DeprecationWarning: Importing S3Hook directly from <module 'airflow.hooks' from '/usr/local/lib/python3.6/site-packages/airflow/hooks/__init__.py'> has been deprecated. Please import from '<module 'airflow.hooks' from '/usr/local/lib/python3.6/site-packages/airflow/hooks/__init__.py'>.[operator_module]' instead. Support for direct imports will be dropped entirely in Airflow 2.0.
DeprecationWarning)
>>> s3.load_string('put this in s3 file', airflow.conf.get('core', 'remote_base_log_folder') + "/airflow-test")
[2018-02-23 18:43:58,437] {{base_hook.py:80}} INFO - Using connection to: vevo-dev-us-east-1-services-airflow
现在让我们从 s3 检索文件并查看内容。我们可以看到这里一切看起来都很好。
root@4f8171d4fe47:/# aws s3 cp s3://vevo-dev-us-east-1-services-airflow/logs//airflow-test .
download: s3://vevo-dev-us-east-1-services-airflow/logs//airflow-test to ./airflow-test
root@4f8171d4fe47:/# cat airflow-test
put this in s3 fileroot@4f8171d4fe47:/stringer#
因此,除了气流作业不使用 s3 进行日志记录之外,气流 s3 连接似乎良好。以下是我的设置,我认为有些东西是错误的,或者是我遗漏了一些东西。
正在运行的worker/scheduler/master实例的环境变量是
airflow@airflow-worker-847c66d478-lbcn2:~$ env |grep -i s3
AIRFLOW__CONN__S3_LOGS=s3://vevo-dev-us-east-1-services-airflow/logs/
AIRFLOW__CORE__REMOTE_LOG_CONN_ID=s3_logs
AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://vevo-dev-us-east-1-services-airflow/logs/
S3_BUCKET=vevo-dev-us-east-1-services-airflow
这表明airflow中存在s3_logs连接
airflow@airflow-worker-847c66d478-lbcn2:~$ airflow connections -l|grep s3
│ 's3_logs' │ 's3' │ 'vevo-dev-
us-...vices-airflow' │ None │ False │ False │ None │
我把这个文件https://github.com/apache/incubator-airflow/blob/master/airflow/config_templates/airflow_local_settings.py https://github.com/apache/incubator-airflow/blob/master/airflow/config_templates/airflow_local_settings.py在我的 docker 镜像中。您可以在此处查看我们一位员工的示例
airflow@airflow-worker-847c66d478-lbcn2:~$ ls -al /usr/local/airflow/config/
total 32
drwxr-xr-x. 2 root root 4096 Feb 23 00:39 .
drwxr-xr-x. 1 airflow airflow 4096 Feb 23 00:53 ..
-rw-r--r--. 1 root root 4471 Feb 23 00:25 airflow_local_settings.py
-rw-r--r--. 1 root root 0 Feb 16 21:35 __init__.py
我们已编辑该文件来定义 REMOTE_BASE_LOG_FOLDER 变量。这是我们的版本和上游版本之间的差异
index 899e815..897d2fd 100644
--- a/var/tmp/file
+++ b/config/airflow_local_settings.py
@@ -35,7 +35,8 @@ PROCESSOR_FILENAME_TEMPLATE = '{{ filename }}.log'
# Storage bucket url for remote logging
# s3 buckets should start with "s3://"
# gcs buckets should start with "gs://"
-REMOTE_BASE_LOG_FOLDER = ''
+REMOTE_BASE_LOG_FOLDER = conf.get('core', 'remote_base_log_folder')
+
DEFAULT_LOGGING_CONFIG = {
'version': 1,
在这里您可以看到我们的一位工作人员的设置是正确的。
>>> import airflow
>>> airflow.conf.get('core', 'remote_base_log_folder')
's3://vevo-dev-us-east-1-services-airflow/logs/'
基于 REMOTE_BASE_LOG_FOLDER 以“s3”开头且 REMOTE_LOGGING 为 True 的事实
>>> airflow.conf.get('core', 'remote_logging')
'True'
我期望这个块https://github.com/apache/incubator-airflow/blob/master/airflow/config_templates/airflow_local_settings.py#L122-L123 https://github.com/apache/incubator-airflow/blob/master/airflow/config_templates/airflow_local_settings.py#L122-L123评估为 true 并使日志转到 s3。
请任何在 1.9 上进行 s3 日志记录的人指出我缺少什么?我想向上游项目提交 PR 来更新文档,因为这似乎是一个非常常见的问题,并且据我所知,上游文档无效或经常被误解。
谢谢! G。