它们代表可打印文本的最常见代码点,以及换行符、空格和回车符等。 ASCII 被覆盖到 0x7F,Latin-1 或 Windows Codepage 1251 等标准使用剩余的 128 个字节作为重音字符等。
您期望文本only使用这些代码点。二进制数据将使用all0x00-0xFF 范围内的代码点;例如文本文件可能不会使用 \x00 (NUL) 或 \x1F(ASCII 标准中的单位分隔符)。
不过,这充其量只是一种启发式的方法。某些文本文件可能仍会尝试使用C0 控制代码 https://en.wikipedia.org/wiki/C0_and_C1_control_codes在这 7 个明确命名的字符之外,我确信存在二进制数据,而这些数据恰好不包括未包含在textchars
string.
该系列的作者可能基于text_chars table https://github.com/file/file/blob/master/src/encoding.c#L151-L228来自file
命令。它将字节标记为非文本、ASCII、Latin-1 或非 ISO 扩展 ASCII,并包含有关选择这些代码点的原因的文档:
/*
* This table reflects a particular philosophy about what constitutes
* "text," and there is room for disagreement about it.
*
* [....]
*
* The table below considers a file to be ASCII if all of its characters
* are either ASCII printing characters (again, according to the X3.4
* standard, not isascii()) or any of the following controls: bell,
* backspace, tab, line feed, form feed, carriage return, esc, nextline.
*
* I include bell because some programs (particularly shell scripts)
* use it literally, even though it is rare in normal text. I exclude
* vertical tab because it never seems to be used in real text. I also
* include, with hesitation, the X3.64/ECMA-43 control nextline (0x85),
* because that's what the dd EBCDIC->ASCII table maps the EBCDIC newline
* character to. It might be more appropriate to include it in the 8859
* set instead of the ASCII set, but it's got to be included in *something*
* we recognize or EBCDIC files aren't going to be considered textual.
*
* [.....]
*/
有趣的是,那张桌子excludes0x7F,您找到的代码没有。