我认为这个问题与以下想法有关
使用有限状态计算模型匹配字符串。
这个问题可以通过使用KMP字符串来解决
匹配算法。
KMP 算法尝试在模式的文本字符串中查找匹配项
字符串,通过考虑模式的前缀有多少
即使我们在某个时刻发现不匹配,仍然匹配。
用于确定“仍然可以匹配多少前缀” if
在匹配模式中的索引 i 后,我们遇到不匹配,
预先建立故障函数。 (请参考下面的代码
用于构建失效函数值)
该失效函数将告诉模式的每个索引 i,
即使这样,仍然可以匹配模式的多少前缀
我们在索引 i 之后遇到不匹配。
这个想法是找出模式的最长适当前缀的长度是多少
这也是由 1 到 i 表示的模式的每个子串的后缀
i 范围从 1 到 n 的索引。
我使用从 1 开始的字符串索引。
因此任何模式的第一个字符的失效函数值
是 0。(即到目前为止还没有匹配到任何字符)。
对于后续字符,对于每个索引 i = 2 到 n,我们看到什么
是最长的长度
模式 [1...i] 的子字符串的正确前缀,也是
模式 [1...i] 的子字符串的后缀。
假设我们的模式是“aac”,那么失败函数值为
索引1为0(尚未匹配),失败函数值
对于索引 2 为 1,(最长专有前缀的长度与
“aa”的最长正确后缀是 1)
对于模式“ababac”,索引 1 的失效函数值为 0,
索引 2 为 0,索引 3 为 1(因为第三个索引 'a' 与
第一个索引“a”),索引 4 为 2(因为索引 1 和 2 处的“ab”相同
作为索引 3 和 4 处的“ab”),索引 5 为 3(索引 [1...3] 处的“aba”
与索引 [3...5] 处的“aba”相同)。对于索引 6,失效函数值为 0。
这是构建失效函数和匹配的代码(C++)
使用它的文本(或流):
/* Assuming that string indices start from 1 for both pattern and text. */
/* Firstly build the failure function. */
int s = 1;
int t = 0;
/* n denotes the length of the pattern */
int *f = new int[n+1];
f[1] = 0;
for (s = 1; s < n; s++) {
while (t > 0 && pattern[t + 1] != pattern[s + 1]) {
t = f[t];
}
if (pattern[t + 1] == pattern[s + 1]) {
t++;
f[s + 1] = t;
}
else {
f[s + 1] = 0;
}
}
/* Now start reading characters from the stream */
int count = 0;
char current_char = stream.next_char();
/* s denotes the index of pattern upto which we have found match in text */
/* initially its 0 i.e. no character has been matched yet. */
s = 0;
while (current_char != '\0') {
/* IF previously, we had matched upto a certain number of
characters, and then failed, we return s to the point
which is the longest prefix that still might be matched.
(spaces between string are just for clarity)
For e.g., if pattern is "a b a b a a"
& its failure returning index is "0 0 1 2 3 1"
and we encounter
the text like : "x y z a b a b a b a a"
indices : 1 2 3 4 5 6 7 8 9 10 11
after matching the substring "a b a b a", starting at
index 4 of text, we are successful upto index 8 but we fail
at index 9, the next character at index 9 of text is 'b'
but in our pattern which should have been 'a'.Thus, the index
in pattern which has been matched till now is 5 ( a b a b a)
1 2 3 4 5
Now, we see that the failure returning index at index 5 of
pattern is 3, which means that the text is still matched upto
index 3 of pattern (a b a), not from the initial starting
index 4 of text, but starting from index 6 of text.
Thus we continue searching assuming that our next starting index
in text is 6, eventually finding the match starting from index 6
upto index 11.
*/
while (s > 0 && current_char != pattern[s + 1]) {
s = f[s];
}
if (current_char == pattern[s + 1]) s++; /* We test the next character after the currently
matched portion of pattern with the current
character of text , if it matches, we increase
the size of our matched portion by 1*/
if (s == n) {
count++;
}
current_char = stream.next_char();
}
printf("Count is %d\n", count);
`
注意:即使在重叠的模式出现中,此方法也有助于找出计数。例如,单词“aba”出现两次
在“ababa”流中。