Python自然语言处理学习笔记(18)：3.2 字符串：最底层的文本处理

2023-11-18

转载请注明出处“一块努力的牛皮糖”：http://www.cnblogs.com/yuxc/

新手上路，翻译不恰之处，恳请指出，不胜感谢　

Updated log

1st 2011.8.6

3.2 Strings: Text Processing at the Lowest Level 字符串：最底层的文本处理

PS:个人认为这部分很重要，字符串处理是NLP里最基本的部分，各位童鞋好好看，老鸟略过...

It’s time to study a fundamental data type that we’ve been studiously（故意地） avoiding so far. In earlier chapters we focused on a text as a list of words. We didn’t look too closely at words and how they are handled in the programming language. By using NLTK’s corpus interface we were able to ignore the files that these texts had come from. The contents of a word, and of a file, are represented by programming languages as a fundamental data type known as a string. In this section, we explore strings in detail, and show the connection between strings, words, texts, and files.

Basic Operations with Strings 字符串的基本操作

Strings are specified using single quotes ①or double quotes②, as shown in the following code example. If a string contains a single quote, we must backslash-escape the quote③ so Python knows a literal quote character is intended, or else put the string in double quotes②. Otherwise, the quote inside the string④will be interpreted as a close quote, and the Python interpreter will report a syntax error:

   >>> monty = ' Monty Python ' ①

   >>> monty

   ' Monty Python '

   >>> circus = " Monty Python's Flying Circus " ②

   >>> circus

   " Monty Python's Flying Circus "

   >>> circus = ' Monty Python\'s Flying Circus ' ③

   >>> circus

   " Monty Python's Flying Circus "

   >>> circus = ' Monty Python ' s Flying Circus '   ④

  File " <stdin> " , line 1

  circus = ' Monty Python ' s Flying Circus '

   ^

  SyntaxError: invalid syntax

Sometimes strings go over several lines. Python provides us with various ways of entering them. In the next example, a sequence of two strings is joined into a single string. We need to use backslash ① or parentheses ② so that the interpreter knows that the statement is not complete after the first line.

   >>> couplet = " Shall I compare thee to a Summer's day? " \

  ...            " Thou are more lovely and more temperate: " ①

   >>> print couplet

  Shall I compare thee to a Summer ' s day?Thou are more lovely and more temperate:

   >>> couplet = ( " Rough winds do shake the darling buds of May, "

  ...            " And Summer's lease hath all too short a date: " )  ②

   >>> print couplet

  Rough winds do shake the darling buds of May,And Summer ' s lease hath all too short a date:

Unfortunately these methods do not give us a newline between the two lines of the sonnet(十四行诗). Instead, we can use a triple-quoted string as follows:

>>> couplet = """ Shall I compare thee to a Summer's day?

  ... Thou are more lovely and more temperate: """

   >>> print couplet

  Shall I compare thee to a Summer ' s day?

  Thou are more lovely and more temperate:

   >>> couplet = ''' Rough winds do shake the darling buds of May,

  ... And Summer's lease hath all too short a date: '''

   >>> print couplet

  Rough winds do shake the darling buds of May,

  And Summer ' s lease hath all too short a date:

Now that we can define strings, we can try some simple operations on them. First let’s look at the + operation, known as concatenation ① . It produces a new string that is a copy of the two original strings pasted together end-to-end（首尾相连）. Notice that concatenation doesn’t do anything clever like insert a space between the words. We can even multiply strings②:

   >>> ' very ' + ' very ' + ' very ' ①

   ' veryveryvery '

   >>> ' very ' * 3 ②

   ' veryveryvery '

Your Turn: Try running the following code, then try to use your understanding of the string + and * operations to figure out how it works. Be careful to distinguish between the string ' ', which is a single whitespace character, and '', which is the empty string.

   >>> a = [ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 6 , 5 , 4 , 3 , 2 , 1 ]

   >>> b = [ ' ' * 2 * ( 7 - i) + ' very ' * i for i in a]

   >>> for line in b:

  ...     print b

We’ve seen that the addition and multiplication operations apply to strings, not just numbers. However, note that we cannot use subtraction or division with strings:

   >>> ' very ' - ' y '

  Traceback (most recent call last):

    File " <stdin> " , line 1 , in < module >

  TypeError: unsupported operand type(s) for - : ' str ' and ' str '

   >>> ' very ' / 2

  Traceback (most recent call last):

    File " <stdin> " , line 1 , in < module >

  TypeError: unsupported operand type(s) for / : ' str ' and ' int '

These error messages are another example of Python telling us that we have got our data types in a muddle（困惑）. In the first case, we are told that the operation of subtraction (i.e., -) cannot apply to objects of type str (strings), while in the second, we are told that division cannot take str and int as its two operands.

Printing Strings 打印字符串

So far, when we have wanted to look at the contents of a variable or see the result of a calculation, we have just typed the variable name into the interpreter. We can also see the contents of a variable using the print statement:

>>> print monty

Monty Python

Notice that there are no quotation marks this time. When we inspect a variable by typing its name in the interpreter, the interpreter prints the Python representation of its value. Since it’s a string, the result is quoted. However, when we tell the interpreter to print the contents of the variable, we don’t see quotation characters, since there are none inside the string.

The print statement allows us to display more than one item on a line in various ways,

as shown here:

>>> grail = ' Holy Grail '

   >>> print monty + grail

  Monty PythonHoly Grail

   >>> print monty, grail

  Monty Python Holy Grail

   >>> print monty, " and the " , grail    # 会在词之间自动添加空格

  Monty Python and the Holy Grail

Accessing Individual Characters 访问单独的字符

As we saw in Section 1.2 for lists, strings are indexed, starting from zero. When we index a string, we get one of its characters (or letters). A single character is nothing special—it’s just a string of length 1.

   >>> monty[ 0 ]

   ' M '

   >>> monty[ 3 ]

   ' t '

   >>> monty[ 5 ]

   ' '

As with lists, if we try to access an index that is outside of the string, we get an error:

   >>> monty[ 20 ]

  Traceback (most recent call last):

    File " <stdin> " , line 1 , in ?

  IndexError: string index out of range

Again as with lists, we can use negative indexes for strings, where -1 is the index of the last character①. Positive and negative indexes give us two ways to refer to any position in a string. In this case, when the string had a length of 12, indexes 5 and -7 both refer to the same character (a space). (Notice that 5 = len(monty) - 7.)

   >>> monty[ - 1 ]  #注意 monty = ' Monty Python ' 我刚还在想就5个字符啊…

   ' n '

   >>> monty[ 5 ]

   ' '

   >>> monty[ - 7 ]

   ' '

We can write for loops to iterate over the characters in strings. This print statement ends with a trailing comma, which is how we tell Python not to print a newline at the end.

   >>> sent = ' colorless green ideas sleep furiously '

   >>> for char in sent:

  ...     print char ,

  ...

  c o l o r l e s s   g r e e n   i d e a s   s l e e p   f u r i o u s l y

We can count individual characters as well. We should ignore the case distinction by normalizing everything to lowercase, and filter out non-alphabetic characters:

   >>> from nltk.corpus import gutenberg

   >>> raw = gutenberg.raw( ' melville-moby_dick.txt ' )

   >>> fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())

   >>> fdist.keys()

  [ ' e ' , ' t ' , ' a ' , ' o ' , ' n ' , ' i ' , ' s ' , ' h ' , ' r ' , ' l ' , ' d ' , ' u ' , ' m ' , ' c ' , ' w ' ,

   ' f ' , ' g ' , ' p ' , ' b ' , ' y ' , ' v ' , ' k ' , ' q ' , ' j ' , ' x ' , ' z ' ]

This gives us the letters of the alphabet, with the most frequently occurring letters listed first (this is quite complicated and we’ll explain it more carefully later). You might like to visualize the distribution using fdist.plot(). The relative character frequencies of a text can be used in automatically identifying the language of the text.

Accessing Substrings 访问子字符串

A substring is any continuous section of a string that we want to pull out（取出） for further processing. We can easily access substrings using the same slice notation we used for lists (see Figure 3-2). For example, the following code accesses the substring starting at index 6, up to (but not including) index 10:

>>> monty[ 6 : 10 ]

' Pyth '

Figure 3-2. String slicing字符串切片: The string Monty Python is shown along with its positive and negative indexes; two substrings are selected using “slice” notation. The slice [m,n] contains the characters from position m through n-1.

Here we see the characters are 'P', 'y', 't', and 'h', which correspond to monty[6] ...monty[9] but not monty[10]. This is because a slice starts at the first index but finishes one before the end index.

We can also slice with negative indexes—the same basic rule of starting from the start index and stopping one before the end index applies; here we stop before the space character.

>>> monty[ - 12 : - 7 ]

' Monty '

As with list slices, if we omit the first value, the substring begins at the start of the string.

If we omit the second value, the substring continues to the end of the string:

   >>> monty[: 5 ]

   ' Monty '

   >>> monty[ 6 :]

   ' Python '

We test if a string contains a particular substring using the in operator, as follows:

   >>> phrase = ' And now for something completely different '

   >>> if ' thing ' in phrase:

  ...     print ' found "thing" '

  found " thing "

We can also find the position of a substring within a string, using find():

>>> monty.find( ' Python ' )

6

Your Turn: Make up a sentence and assign it to a variable, e.g., sent = 'my sentence...'. Now write slice expressions to pull out individual words. (This is obviously not a convenient way to process the words of a text!)

More Operations on Strings 更多的字符串操作

Python has comprehensive（全面的）support for processing strings. A summary, including some operations we haven’t seen yet, is shown in Table 3-2. For more information on strings, type help(str) at the Python prompt.

Table 3-2. Useful string methods: Operations on strings in addition to the string tests shown in Table 1-4; all methods produce a new string or list

Method 　　　　　　　　 Functionality

s.find(t) Index of first instance of string t inside s (-1 if not found)

s.rfind(t) Index of last instance of string t inside s (-1 if not found)

s.index(t) Like s.find(t), except it raises ValueError if not found

s.rindex(t) Like s.rfind(t), except it raises ValueError if not found

s.join(text) Combine the words of the text into a string using s as the glue

s.split(t) Split s into a list wherever a t is found (whitespace by default)

s.splitlines() Split s into a list of strings, one per line

s.lower() A lowercased version of the string s

s.upper() An uppercased version of the string s

s.titlecase() A titlecased version of the string s

s.strip() A copy of s without leading or trailing whitespace

s.replace(t, u) Replace instances of t with u inside s

The Difference Between Lists and Strings 列表和字符串之间的不同

Strings and lists are both kinds of sequence. We can pull them apart by indexing and slicing them, and we can join them together by concatenating them. However, we can not join strings and lists:

   >>> query = ' Who knows? '

   >>> beatles = [ ' John ' , ' Paul ' , ' George ' , ' Ringo ' ]

   >>> query[ 2 ]

   ' o '

   >>> beatles[ 2 ]

   ' George '

   >>> query[: 2 ]

   ' Wh '

   >>> beatles[: 2 ]

  [ ' John ' , ' Paul ' ]

   >>> query + " I don't "

   " Who knows? I don't "

   >>> beatles + ' Brian '

  Traceback (most recent call last):

    File " <stdin> " , line 1 , in < module >

  TypeError: can only concatenate list (not " str " ) to list

   >>> beatles + [ ' Brian ' ]

  [ ' John ' , ' Paul ' , ' George ' , ' Ringo ' , ' Brian ' ]

When we open a file for reading into a Python program, we get a string corresponding to the contents of the whole file. If we use a for loop to process the elements of this string, all we can pick out （挑选出）are the individual characters—we don’t get to choose the granularity. By contrast, the elements of a list can be as big or small as we like: for example, they could be paragraphs, sentences, phrases, words, characters. So lists have the advantage that we can be flexible about the elements they contain, and correspondingly flexible about any downstream（后阶段的） processing. Consequently, one of the first things we are likely to do in a piece of NLP code is tokenize a string into a list of strings (Section 3.7). Conversely, when we want to write our results to a file, or to a terminal, we will usually format them as a string (Section 3.9). Lists and strings do not have exactly the same functionality. Lists have the added power that you can change their elements:

   >>> beatles[ 0 ] = " John Lennon "

   >>> del beatles[ - 1 ]

   >>> beatles

  [ ' John Lennon ' , ' Paul ' , ' George ' ]

On the other hand, if we try to do that with a string—changing the 0th character in query to 'F'—we get:

   >>> query[ 0 ] = ' F '

  Traceback (most recent call last):

    File " <stdin> " , line 1 , in ?

  TypeError: object does not support item assignment

This is because strings are immutable（不可变的）: you can’t change a string once you have created it. However, lists are mutable, and their contents can be modified at any time. As a result, lists support operations that modify the original value rather than producing a new value.

Your Turn: Consolidate your knowledge of strings by trying some of the exercises on strings at the end of this chapter.

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python

人工智能

Python自然语言处理学习笔记(18)：3.2 字符串：最底层的文本处理的相关文章

下载 PyQt6 的 Qt Designer 并使用 pyuic6 将 .ui 文件转换为 .py 文件

如何下载 PyQt6 的 QtDesigner 如果没有适用于 PyQt6 的 QtDesigner 我也可以使用 PyQt5 的 QtDesigner 但是如何将此 ui 文件转换为使用 PyQt6 库而不是 PyQt5 的 py 文件
将字符串转换为带有毫秒和时区的日期时间 - Python

我有以下 python 片段 from datetime import datetime timestamp 05 Jan 2015 17 47 59 000 0800 datetime object datetime strptime t
Python PAM 模块的安全问题？

我有兴趣编写一个 PAM 模块该模块将利用流行的 Unix 登录身份验证机制我过去的大部分编程经验都是使用 Python 进行的并且我正在交互的系统已经有一个 Python API 我用谷歌搜索发现pam python http pa
如何生成给定范围内的回文数列表？

假设范围是 1 X 120 这是我尝试过的 gt gt gt def isPalindrome s check if a number is a Palindrome s str s return s s 1 gt gt gt def ge
如何在android上的python kivy中关闭应用程序后使服务继续工作

我希望我的服务在关闭应用程序后继续工作但我做不到我听说我应该使用startForeground 但如何在Python中做到这一点呢应用程序代码 from kivy app import App from kivy uix floatl
如何打印没有类型的defaultdict变量？

在下面的代码中 from collections import defaultdict confusion proba dict defaultdict float for i in xrange 10 confusion proba di
Flask 和 uWSGI - 无法加载应用程序 0 (mountpoint='')（找不到可调用或导入错误）

当我尝试使用 uWSGI 启动 Flask 时出现以下错误我是这样开始的 gt cd gt root localhost uwsgi socket 127 0 0 1 6000 file path to folder run py ca
更改自动插入 tkinter 小部件的文本颜色

我有一个文本框小部件其中插入了三条消息一条是开始消息一条是结束消息一条是在单位被摧毁时发出警报的消息我希望开始和结束消息是黑色的但被毁坏的消息参见我在代码中评论的位置插入小部件时颜色为红色我不太确定如何去做这件事我看
为 pandas 数据透视表中的每个值列定义 aggfunc

试图生成具有多个值列的数据透视表我知道我可以使用 aggfunc 按照我想要的方式聚合值但是如果我不想对两列求和或求平均值而是想要一列的总和同时求另一列的平均值该怎么办那么使用 pandas 可以做到这一点吗 df pd D
python 集合可以包含的值的数量是否有限制？

我正在尝试使用 python 设置作为 mysql 表中 ids 的过滤器 python集存储了所有要过滤的id 现在大约有30000个这个数字会随着时间的推移慢慢增长我担心python集的最大容量它可以包含的元素数量有限制吗您最大
HTTPS 代理不适用于 Python 的 requests 模块

我对 Python 还很陌生我一直在使用他们的 requests 模块作为 PHP 的 cURL 库的替代品我的代码如下 import requests import json import os import urllib impor
如何改变Python中特定打印字母的颜色？

我正在尝试做一个简短的测验并且想将错误答案显示为红色欢迎来到我的测验您想开始吗是的祝你好运法国的首都是哪里法国随机答案不正确的答案我正在尝试将其显示为红色我的代码是 print Welcome to my Quiz be
Python 3 中“map”类型的对象没有 len()

我在使用 Python 3 时遇到问题我得到了 Python 2 7 代码目前我正在尝试更新它我收到错误类型错误 map 类型的对象没有 len 在这部分 str len seed candidates 在我像这样初始化它之前 se
Nuitka 未使用 nuitka --recurse-all hello.py [错误] 编译 exe

我正在尝试通过 nuitka 创建一个简单的 exe 这样我就可以在我的笔记本电脑上运行它而无需安装 Python 我在 Windows 10 上并使用 Anaconda Python 3 我输入 nuitka recurse all h
检查所有值是否作为字典中的键存在

我有一个值列表和一本字典我想确保列表中的每个值都作为字典中的键存在目前我正在使用两组来确定字典中是否存在任何值 unmapped set foo set bar keys 有没有更Pythonic的方法来测试这个感觉有点像黑客您的方
VSCode：调试配置中的 Python 路径无效

对 Python 和 VSCode 以及 stackoverflow 非常陌生直到最近我已经使用了大约 3 个月一切都很好当尝试在调试器中运行任何基本的 Python 程序时弹出窗口The Python path in your
在 Pandas DataFrame Python 中添加新列[重复]

这个问题在这里已经有答案了例如我在 Pandas 中有数据框 Col1 Col2 A 1 B 2 C 3 现在如果我想再添加一个名为 Col3 的列并且该值基于 Col2 式中如果Col2 gt 1 则Col3为0 否则为1 所以
对输入求 Keras 模型的导数返回全零

所以我有一个 Keras 模型我想将模型的梯度应用于其输入这就是我所做的 import tensorflow as tf from keras models import Sequential from keras layers imp
在python中，如何仅搜索所选子字符串之前的一个单词

给定文本文件中的长行列表我只想返回紧邻其前面的子字符串例如单词狗描述狗的单词例如假设有这些行包含狗 hotdog big dog is dogged dog spy with my dog brown dogs 在这种情况下期望
如何使用google colab在jupyter笔记本中显示GIF？

我正在使用 google colab 想嵌入一个 gif 有谁知道如何做到这一点我正在使用下面的代码它并没有在笔记本中为 gif 制作动画我希望笔记本是交互式的这样人们就可以看到代码的动画效果而无需运行它我发现很多方法在 Goo

随机推荐

node.js升级报错digital envelope routines unsupporte最简单解决方案

背景本地将nodejs 16升级成nodejs18运行时报错digital envelope routines unsupported 报错 Error error 0308010C digital envelope routines u
cytoscape插件下载_cytoscape五步曲之三：安装各种插件

软件安装我就不多说了直接去官网下载即可请务必下载3 x版本我讲的是最新版教程本次讲解如何给cytoscape安装插件 cytoscape本身是一个平台学者可以在上面开发各种各样功能的插件实现不同的分析需求类似于R语言这个平台
mysql中varbinary什么意思_MySQL中的数据类型binary和varbinary详解

前言 BINARY和VARBINARY与 CHAR和VARCHAR类型有点类似不同的是BINARY和VARBINARY存储的是二进制的字符串而非字符型字符串也就是说 BINARY和VARBINARY没有字符集的概念对其排序和比较都是
当我被酱香拿铁刷屏后......

这两天朋友圈刮起了酱香风跨界里的新宠儿酱香拿铁卖爆了不得不说瑞幸是懂跨界的短短一天时间酱香拿铁已售出 542 万杯销售额超一亿元谁能想到年轻人的第一杯茅台竟然是瑞幸卖出去的这可能也是星巴克最无语的一天吧瑞幸的订单长到可以直
python多进程cpu的占用率很低_Python 中的进程池与多进程

封面图片来源沙沙野内容概览进程池进程池和多进程的性能测试进程池的其他机制进程池的回调函数进程池如果有多少个任务就开启多少个进程实际上并不划算由于计算机的 cpu 个数是非常有限的因此开启的进程数量完全和 cpu 个数成
LOAM算法详解

激光SLAM 帧间匹配方法 Point to Plane ICP NDT Feature based Method 回环检测方法 Scan to Scan Scan to Map LOAM创新点定位和建图的分离里程计模块高频低质量的帧
在pycharm中更新pip失败

尝试了网上的各种方法各种翻车删除虚拟环境中的这两个文件夹包括pip 有只删除pip 21 1 2 dist info这个个文件夹然后重新安装pip之后在更新我试了没有用下载 get pip py 文件转到 https boots
drive数据集_英伟达的最强人脸GAN开源了，它吃的高清数据集也开源了

栗子假装发自凹非寺量子位出品公众号 QbitAI 你大概还没忘记英伟达去年年底推出的GAN 它合成的人脸甚至骗得过肉眼如今它终于有了自己的名字叫StyleGAN 顾名思义 GAN的生成器是借用风格迁移的思路重新发明的能
Docker 入门笔记

狂神说Java Docker最新超详细版教程通俗易懂视频地址 https www bilibili com video BV1og4y1q7M4 share source copy web Docker安装基本组成说明镜像 imag
小米2020校招软件开发工程师笔试题二

1 计算大于n n gt 1 的最小的斐波那契数以下划线出应填入 B function f n int int a new int 2 a 0 a 1 1 int i 1 while true i i 1 2 a i If a i gt
C++标准库--正态分布类 std::normal_distribution

参考链接 https en cppreference com w cpp numeric random normal distribution std normal distribution是C 11提供的一个正态分布函数模板类头文件 i
在matlab中使用遗传算法执行最优化

遗传算法是一种通用的最优化方法具体原理可以看遗传算法详解与实验下面记录在Matlab中如何使用遗传算法来做优化用法调用方式如下 1 x ga fun nvars 2 x ga fun nvars A b 3 x ga fun nv
webpack之sideEffects

webpack之sideEffects 前言一 sideEffects的使用二 sideEffects注意事项前言 webpack4新增了一个sideEffects新特性它允许我们通过配置的方式去标识我们的代码是否有副作用从而为
云计算的概念、原理和关键技术

1 云计算的定义 NIST 美国国家标准及技术研究所对云计算的定义云计算是一种模型实现无处不在的方便通过网络按需访问的可配置的共享计算资源池例如网络服务器存储应用程序服务这些资源可以快速提供通过最小化管理成本或与服
jsp下读取c:forEach的循环次数，以及内部循环数据累加统计等

前言近日接触到一个比较旧的项目框架使用的是Status2 Spring3 前端jsp大量内嵌了java代码几乎未使用jstl和el表达式个人习惯原因已经很不喜欢使用这种通过写java代码在jsp上做逻辑控制的方式很不好让别人读代
input checkbox js控制单选

html中checkbox的格式如下 div div div div
随笔之---java版本哲学家就餐问题【信号量的实现】

很喜欢这样的描述如果你喜欢也不防读一读从许多许多年前石头就呆在那座岭上了那是座无名的低岭毫不起眼没有足以称道的风景名胜那块石头只是许多石头中的一颗见证过日升日落经历过沧海桑田承受四季变迁黄河水数度从它的身上淹没而过人群
【你哥电力电子】THE BUCK 降压斩波电路

BUCK电路 2022年12月25日 nige in Tongji University elecEngeneer 文章目录 BUCK电路 1 BUCK电路来源 2 CCM下的理想稳态分析 2 1 分析流程 3 DCM下的理想稳态分析 3
解决win11能使用微信qq但是不可以使用浏览器上网的问题

百度找了好多教程都是让修改dns首选地址的这种一般是win10的解决方式下面将win11遇到这个问题的解决方式贴到下面 wifi连接正常且微信qq可以使用解决方式如下最后将这个代理服务器关掉即可
Python自然语言处理学习笔记(18)：3.2 字符串：最底层的文本处理

转载请注明出处一块努力的牛皮糖 http www cnblogs com yuxc 新手上路翻译不恰之处恳请指出不胜感谢 Updated log 1st 2011 8 6 3 2 Strings Text Processing at

Python自然语言处理学习笔记(18)：3.2 字符串：最底层的文本处理

Python自然语言处理学习笔记(18)：3.2 字符串：最底层的文本处理 的相关文章

随机推荐

热门标签

Python自然语言处理学习笔记(18)：3.2 字符串：最底层的文本处理的相关文章