低级处理函数ProcessFunction

2023-10-27

原文链接：https://zhuanlan.zhihu.com/p/130708277

1. ProcessFunction定义

ProcessFunction 函数是低阶流处理算子，可以访问流应用程序所有（非循环）基本构建块：

事件 (数据流元素)
状态 (容错和一致性，仅用于keyed stream)
定时器 (事件时间和处理时间，仅用于keyed stream)

ProcessFunction 可以被认为是一种提供了对 KeyedState 和定时器访问的 FlatMapFunction。每在输入流中接收到一个事件，就会调用来此函数来处理。

对于容错的状态，ProcessFunction 可以通过 RuntimeContext 访问 KeyedState，类似于其他有状态函数访问 KeyedState。

定时器可以对处理时间和事件时间的变化做一些处理。每次调用 processElement() 都可以获得一个 Context 对象，通过该对象可以访问元素的事件时间戳以及 TimerService。TimerService 可以为尚未发生的事件时间/处理时间实例注册回调。当定时器到达某个时刻时，会调用 onTimer() 方法。在调用期间，所有状态再次限定为定时器创建的键，允许定时器操作 KeyedState。

如果要访问 KeyedState 和定时器，那必须在 KeyedStream 上使用 ProcessFunction。

2. 内置ProcessFunction

ProcessFunction: 用于DataStream
KeyedProcessFunction: 用于KeyedStream，keyBy之后的流处理
CoProcessFunction: 用于connect连接的流
ProcessJoinFunction: 用于join流操作
BroadcastProcessFunction: 用于广播
KeyedBroadcastProcessFunction: keyBy之后的广播
ProcessWindowFunction: 窗口增量聚合
ProcessAllWindowFunction: 全窗口聚合

其中ProcessFunction看作是一个具有key state和定时器(timer)访问权的FlatMapFunction。对于在输入流中接收到的每一个事件，此函数就会被调用以处理该事件。

如果想要在流处理过程中访问keyed state和定时器，就必须在一个keyed stream上应用ProcessFunction函数，代码如下：

stream.keyBy(...).process(new MyProcessFunction())

3. 使用实例

作为ProcessFunction的扩展（即子类），KeyedProcessFunction在其onTimer(…)方法中提供对计时器key的访问。其模板代码如下所示：

@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<OUT> out) throws Exception {
    K key = ctx.getCurrentKey();
    // ...
}

在下面的示例中，KeyedProcessFunction维护每个key的计数，并在每过一分钟(以事件时间)而未更新该key时，发出一个key/count对：

把计数、key和最后修改时间戳（last-modification-timestamp）存储在一个ValueState中, ValueState的作用域是通过key隐式确定的。
对于每个记录，KeyedProcessFunction递增计数器并设置最后修改时间戳。
该函数还安排了一个一分钟后的回调(以事件时间)。
在每次回调时，它根据存储的计数的最后修改时间检查回调的事件时间时间戳，并在它们匹配时发出key/count（即，在该分钟内没有进一步的更新）。

示例: 维护数据流中每个key的计数，并在每过一分钟(以事件时间)而未更新该key时，发出一个key/count对。

1）首先导入必须所依赖包

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.functions.timestamps.AscendingTimestampExtractor;
import org.apache.flink.util.Collector;

2）定义存储状态数据的数据结构（数据模型）

/**
 * 存储在状态中的数据类型
 */
public class CountWithTimestamp {

	public String key;           // 存储key
	public long count;           // 存储计数值
	public long lastModified;    // 最后一次修改时间
}

3）自定义ProcessFunction，继承自KeyedProcessFunction：

public class CountWithTimeoutFunction
		extends KeyedProcessFunction<Tuple, Tuple2<String, String>, Tuple2<String, Long>> {

	/** 由这个处理函数负责维护的状态 */
	private ValueState<CountWithTimestamp> state;

	// 首先获得由这个处理函数（process function）维护的状态
        // 通过 RuntimeContext 访问Flink的keyed state
	@Override
	public void open(Configuration parameters) throws Exception {
		state = getRuntimeContext().getState(new ValueStateDescriptor<>("myState", CountWithTimestamp.class));
	}

	// 对于在输入流中接收到的每一个事件，此函数就会被调用以处理该事件
	// 对于每个记录，KeyedProcessFunction递增计数器并设置最后修改时间戳
	@Override
	public void processElement(
			Tuple2<String, String> value,
			Context ctx,
			Collector<Tuple2<String, Long>> out) throws Exception {

		// 获取当前的计数
		CountWithTimestamp current = state.value();
		if (current == null) {
			current = new CountWithTimestamp();
			current.key = value.f0;
		}

		// 更新状态计数值
		current.count++;

		// 设置该状态的时间戳为记录的分配的事件时间时间时间戳
                if (ctx != null) {
                	current.lastModified = ctx.timestamp();
                }

                // 将状态写回
		state.update(current);

		// 从当前事件时间开始安排下一个计时器60秒
		ctx.timerService().registerEventTimeTimer(current.lastModified + 60000);
		}

	// 如果一分钟内没有进一步的更新，则发出 key/count对
	@Override
	public void onTimer(
			long timestamp,
			OnTimerContext ctx,
			Collector<Tuple2<String, Long>> out) throws Exception {

		// 获取调度此计时器的key的状态
		CountWithTimestamp result = state.value();

		// 检查这是一个过时的计时器还是最新的计时器
		if (timestamp == result.lastModified + 60000) {
			// 超时时发出状态
			out.collect(new Tuple2<String, Long>(result.key, result.count));
		}
	}
}

4）在流处理的主方法中应用自定义的处理函数

public class StreamingJob {
    public static void main(String[] args) throws Exception {
	// 设置流执行环境
	final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

	// 默认情况下，Flink将使用处理时间。要改变这个，可以设置时间特征:
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        // 源数据流
        DataStream<Tuple2<String, String>> stream = env
                .fromElements("good good study","day day up","you see see you")
                .flatMap(new FlatMapFunction<String, Tuple2<String,String>>() {
                    @Override
                    public void flatMap(String line, Collector<Tuple2<String, String>> collector) throws Exception {
                        for(String word : line.split("\\W+")){
                            collector.collect(new Tuple2<>(word,"1"));
                        }
                    }
                });

	// 因为模拟数据没有时间戳，所以用此方法添加时间戳和水印
        DataStream<Tuple2<String, String>> withTimestampsAndWatermarks =
                stream.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Tuple2<String, String>>() {
                    @Override
                    public long extractAscendingTimestamp(Tuple2<String, String> element) {
                        return System.currentTimeMillis();
                    }
                });

	// 在keyed stream上应用该处理函数
	DataStream<Tuple2<String, Long>> result = withTimestampsAndWatermarks.keyBy(0).process(new CountWithTimeoutFunction());

	// 输出查看
        result.print();

	// 执行流程序
	env.execute("Process Function");
    }
}

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

flink