我正在尝试将 utf-8 字符串解析为“一口大小”的段。例如,我想将文本分解为“句子”。
是否存在与所有语言的句子结尾相对应的字符(或正则表达式)的全面集合?我正在寻找能够捕捉拉丁语句号、感叹号和问号、中文和日文句号等的东西。
类似上面的东西,但相当于一个逗号也很好。
您需要使用以下命令查看代码点\p{Sentence_Break=STerm}
or \p{Sentence_Break=ATerm}
属性也具有\p{Terminal_Punctuation}
财产。跑步the unichars script根据 Unicode v6.1,我们了解到这些代码点满足所有这些标准:
$ unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}'
U+00021 ! GC=Po SC=Common EXCLAMATION MARK
U+0002E . GC=Po SC=Common FULL STOP
U+0003F ? GC=Po SC=Common QUESTION MARK
U+00589 ։ GC=Po SC=Common ARMENIAN FULL STOP
U+0061F ؟ GC=Po SC=Common ARABIC QUESTION MARK
U+006D4 ۔ GC=Po SC=Arabic ARABIC FULL STOP
U+00700 ܀ GC=Po SC=Syriac SYRIAC END OF PARAGRAPH
U+00701 ܁ GC=Po SC=Syriac SYRIAC SUPRALINEAR FULL STOP
U+00702 ܂ GC=Po SC=Syriac SYRIAC SUBLINEAR FULL STOP
U+007F9 ߹ GC=Po SC=Nko NKO EXCLAMATION MARK
U+00964 । GC=Po SC=Common DEVANAGARI DANDA
U+00965 ॥ GC=Po SC=Common DEVANAGARI DOUBLE DANDA
U+0104A ၊ GC=Po SC=Myanmar MYANMAR SIGN LITTLE SECTION
U+0104B ။ GC=Po SC=Myanmar MYANMAR SIGN SECTION
U+01362 ። GC=Po SC=Ethiopic ETHIOPIC FULL STOP
U+01367 ፧ GC=Po SC=Ethiopic ETHIOPIC QUESTION MARK
U+01368 ፨ GC=Po SC=Ethiopic ETHIOPIC PARAGRAPH SEPARATOR
U+0166E ᙮ GC=Po SC=Canadian_Aboriginal CANADIAN SYLLABICS FULL STOP
U+01803 ᠃ GC=Po SC=Common MONGOLIAN FULL STOP
U+01809 ᠉ GC=Po SC=Mongolian MONGOLIAN MANCHU FULL STOP
U+01944 ᥄ GC=Po SC=Limbu LIMBU EXCLAMATION MARK
U+01945 ᥅ GC=Po SC=Limbu LIMBU QUESTION MARK
U+01AA8 ᪨ GC=Po SC=Tai_Tham TAI THAM SIGN KAAN
U+01AA9 ᪩ GC=Po SC=Tai_Tham TAI THAM SIGN KAANKUU
U+01AAA ᪪ GC=Po SC=Tai_Tham TAI THAM SIGN SATKAAN
U+01AAB ᪫ GC=Po SC=Tai_Tham TAI THAM SIGN SATKAANKUU
U+01B5A ᭚ GC=Po SC=Balinese BALINESE PANTI
U+01B5B ᭛ GC=Po SC=Balinese BALINESE PAMADA
U+01B5E ᭞ GC=Po SC=Balinese BALINESE CARIK SIKI
U+01B5F ᭟ GC=Po SC=Balinese BALINESE CARIK PAREREN
U+01C3B ᰻ GC=Po SC=Lepcha LEPCHA PUNCTUATION TA-ROL
U+01C3C ᰼ GC=Po SC=Lepcha LEPCHA PUNCTUATION NYET THYOOM TA-ROL
U+01C7E ᱾ GC=Po SC=Ol_Chiki OL CHIKI PUNCTUATION MUCAAD
U+01C7F ᱿ GC=Po SC=Ol_Chiki OL CHIKI PUNCTUATION DOUBLE MUCAAD
U+0203C ‼ GC=Po SC=Common DOUBLE EXCLAMATION MARK
U+0203D ‽ GC=Po SC=Common INTERROBANG
U+02047 ⁇ GC=Po SC=Common DOUBLE QUESTION MARK
U+02048 ⁈ GC=Po SC=Common QUESTION EXCLAMATION MARK
U+02049 ⁉ GC=Po SC=Common EXCLAMATION QUESTION MARK
U+02E2E ⸮ GC=Po SC=Common REVERSED QUESTION MARK
U+03002 。 GC=Po SC=Common IDEOGRAPHIC FULL STOP
U+0A4FF ꓿ GC=Po SC=Lisu LISU PUNCTUATION FULL STOP
U+0A60E ꘎ GC=Po SC=Vai VAI FULL STOP
U+0A60F ꘏ GC=Po SC=Vai VAI QUESTION MARK
U+0A6F3 ꛳ GC=Po SC=Bamum BAMUM FULL STOP
U+0A6F7 ꛷ GC=Po SC=Bamum BAMUM QUESTION MARK
U+0A876 ꡶ GC=Po SC=Phags_Pa PHAGS-PA MARK SHAD
U+0A877 ꡷ GC=Po SC=Phags_Pa PHAGS-PA MARK DOUBLE SHAD
U+0A8CE ꣎ GC=Po SC=Saurashtra SAURASHTRA DANDA
U+0A8CF ꣏ GC=Po SC=Saurashtra SAURASHTRA DOUBLE DANDA
U+0A92F ꤯ GC=Po SC=Kayah_Li KAYAH LI SIGN SHYA
U+0A9C8 ꧈ GC=Po SC=Javanese JAVANESE PADA LINGSA
U+0A9C9 ꧉ GC=Po SC=Javanese JAVANESE PADA LUNGSI
U+0AA5D ꩝ GC=Po SC=Cham CHAM PUNCTUATION DANDA
U+0AA5E ꩞ GC=Po SC=Cham CHAM PUNCTUATION DOUBLE DANDA
U+0AA5F ꩟ GC=Po SC=Cham CHAM PUNCTUATION TRIPLE DANDA
U+0AAF0 ꫰ GC=Po SC=Meetei_Mayek MEETEI MAYEK CHEIKHAN
U+0AAF1 ꫱ GC=Po SC=Meetei_Mayek MEETEI MAYEK AHANG KHUDAM
U+0ABEB ꯫ GC=Po SC=Meetei_Mayek MEETEI MAYEK CHEIKHEI
U+0FE52 ﹒ GC=Po SC=Common SMALL FULL STOP
U+0FE56 ﹖ GC=Po SC=Common SMALL QUESTION MARK
U+0FE57 ﹗ GC=Po SC=Common SMALL EXCLAMATION MARK
U+0FF01 ! GC=Po SC=Common FULLWIDTH EXCLAMATION MARK
U+0FF0E . GC=Po SC=Common FULLWIDTH FULL STOP
U+0FF1F ? GC=Po SC=Common FULLWIDTH QUESTION MARK
U+0FF61 。 GC=Po SC=Common HALFWIDTH IDEOGRAPHIC FULL STOP
U+11047
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)