带替换的 Blob.decode 似乎不起作用



my $þor-blob = Blob.new("þor".ords);
$þor-blob.decode( "ascii", :replacement("0"), :strict(False) ).say


Will not decode invalid ASCII (code point > 127 found)␤


my $euro = Blob.new("3€".ords);
$euro.decode( "latin1", :replacement("euro") ).say

将 € 替换为 Ø 似乎根本不起作用。

确实如此这些方法没有经过测试 https://github.com/perl6/roast/issues/524,但是语法正确吗?


  • 只有 samcv 或其他一些核心开发人员才能提供权威答案。这是我对看到的代码、注释和结果的理解。

  • If my understanding is correct, some doc and/or code needs to be sorted out to render this SO moot.1

  • 指定$replacement参数匹配不同的 P6 核心多方法而不是不这样做。我们将其称为“替换器”代码路径。

  • “替换器”代码路径通过$replacement and $strict参数传递到 nqp 中的代码路径,然后 nqp 将它们传递到后端处理替换的代码路径。

  • On the MoarVM backend, the replacement and strict arguments are passed onto the decoders for the windows1252, windows1251, and shiftjis encodings but not for other encodings.2


你的代码调用这段代码在Buf.pm6 https://github.com/rakudo/rakudo/blob/b394b63c27c22bf6495f7bb2348fc56a47ead45d/src/core/Buf.pm6#L297:

multi method decode(Blob:D: $encoding,
                    Str    :$replacement!,
                    Bool:D :$strict = False) {
        $replacement.defined ?? $replacement !! nqp::null_s(),
        $strict ?? 0 !! 1))

The nqp::decoderepconffunction 直接映射到后端的相应函数。

在 MoarVM 后端,它是MVM_string_decode_from_buf_config in ops.c https://github.com/MoarVM/MoarVM/blob/6c7810ce7ca905d772ac2a3e47e73cf7c7c41ed8/src/strings/ops.c#L1781.

这又调用MVM_string_decode_config https://github.com/MoarVM/MoarVM/blob/6c7810ce7ca905d772ac2a3e47e73cf7c7c41ed8/src/strings/ops.c#L1642在同一个文件中。


Unlike MVM_string_decode,它不会通过没有官方映射的代码点。

目前,只有 windows-1252 和 windows-1251 会产生影响。

对代码库中的代码和提交进行深入研究表明后一条评论稍微过时了,因为它看起来也应该对 shiftjis 产生影响。

Also, to be clear, if one specifies the $replacement argument in P6 then the $strict argument is going to end up being ignored (and $strict = True assumed) if decoding any encoding other than the windows or shiftjis encodings.2

特别是 ascii 和 latin1 会发生什么

当前的代码为MVM_string_decode_config does not将替换/严格性参数传递给MVM_string_ascii_decode and MVM_string_latin1_decode功能。

因此,如果使用编码“ascii”,则 blob 必须仅包含 0 到 127 之间的值,而对于“latin1”,值必须介于 0 到 255 之间。

say "þor".ords; # (254 111 114)
say "3€".ords;  # (51 8364)

第一个字符串(作为Buf) 无法解码,而是生成错误消息,因为 254 大于 127 并且MoarVM 中的 ascii 解码器代码 https://github.com/MoarVM/MoarVM/blob/master/src/strings/ascii.c通过抛出带有“无效 ASCII”消息的异常来对无效值做出反应。

The second replaces with ¬. This is because by default a Buf is an 8 bit array, so a value above 255 gets truncated to its low byte, which for is the same as ¬ (in both latin1 and Unicode).3

但如果你使用一个也好不到哪儿去Buf具有更大的元素尺寸。结果仍然是一个¬,结合tofu https://english.stackexchange.com/questions/296505/where-is-tofu-for-font-fallback-box-glyph-coming-from。即使我看不到 C 我也能看到所以我很清楚the MVM_string_latin1_decodeMoarVM 中的函数 https://github.com/MoarVM/MoarVM/blob/master/src/strings/latin1.c解码 latin1 不会抛出异常。因此,大概当它遇到 0-255 范围之外的字符值时,它会将高位字节变成豆腐。


1 Of course the very thing JJ is doing that led them to post this SO in the first place is fixing the doc. I added this footnote so that other later readers would understand that context and realize that this SO is leading to changes in the doc, and may lead to changes in the code, that will presumably render this SO moot due to the work done.

2 It would be nice if there were multis that rejected use of the $replacement argument if the decoder for the specified encoding doesn't do anything with it.

3 See timotimo++'s comment below.


