我很高兴你问这个问题,因为我在 LibCodeJam rust 实现中解决了这个问题。具体来说,从 a 中读取原始令牌BufRead
是由处理TokensReader type https://github.com/Lucretiel/LibCodeJam/blob/fcd6201e693082d3db334ad53116d2cc00ae1a17/rust/src/tokens.rs#L185-L227以及一些相关的小帮手。
这是相关摘录。这里的基本思想是扫描BufRead::fill_buf
缓冲区的空白,并将非空白字符复制到本地缓冲区,该缓冲区在令牌调用之间重用。一旦找到空白字符,或者流结束,本地缓冲区将被解释为 UTF-8 并作为&str
.
#[derive(Debug)]
pub enum LoadError {
Io(io::Error),
Utf8Error(Utf8Error),
OutOfTokens,
}
/// TokenBuffer is a resuable buffer into which tokens are
/// read into, one-by-one. It is cleared but not deallocated
/// between each token.
#[derive(Debug)]
struct TokenBuffer(Vec<u8>);
impl TokenBuffer {
/// Clear the buffer and start reading a new token
fn lock(&mut self) -> TokenBufferLock {
self.0.clear();
TokenBufferLock(&mut self.0)
}
}
/// TokenBufferLock is a helper type that helps manage the lifecycle
/// of reading a new token, then interpreting it as UTF-8.
#[derive(Debug, Default)]
struct TokenBufferLock<'a>(&'a mut Vec<u8>);
impl<'a> TokenBufferLock<'a> {
/// Add some bytes to a token
fn extend(&mut self, chunk: &[u8]) {
self.0.extend(chunk)
}
/// Complete the token and attempt to interpret it as UTF-8
fn complete(self) -> Result<&'a str, LoadError> {
from_utf8(self.0).map_err(LoadError::Utf8Error)
}
}
pub struct TokensReader<R: io::BufRead> {
reader: R,
token: TokenBuffer,
}
impl<R: io::BufRead> Tokens for TokensReader<R> {
fn next_raw(&mut self) -> Result<&str, LoadError> {
use std::io::ErrorKind::Interrupted;
// Clear leading whitespace
loop {
match self.reader.fill_buf() {
Err(ref err) if err.kind() == Interrupted => continue,
Err(err) => return Err(LoadError::Io(err)),
Ok([]) => return Err(LoadError::OutOfTokens),
// Got some content; scan for the next non-whitespace character
Ok(buf) => match buf.iter().position(|byte| !byte.is_ascii_whitespace()) {
Some(i) => {
self.reader.consume(i);
break;
}
None => self.reader.consume(buf.len()),
},
};
}
// If we reach this point, there is definitely a non-empty token ready to be read.
let mut token_buf = self.token.lock();
loop {
match self.reader.fill_buf() {
Err(ref err) if err.kind() == Interrupted => continue,
Err(err) => return Err(LoadError::Io(err)),
Ok([]) => return token_buf.complete(),
// Got some content; scan for the next whitespace character
Ok(buf) => match buf.iter().position(u8::is_ascii_whitespace) {
Some(i) => {
token_buf.extend(&buf[..i]);
self.reader.consume(i + 1);
return token_buf.complete();
}
None => {
token_buf.extend(buf);
self.reader.consume(buf.len());
}
},
}
}
}
}
本次实施doesn't处理将字符串解析为FromStr
类型(单独处理),但它确实正确处理累积字节,将它们分隔成空格分隔的标记,并将这些标记解释为 UTF-8。它确实假设仅使用 ASCII 空格来分隔令牌。
值得注意的是FromStr
不能直接用于fill_buf
缓冲区,因为不能保证令牌不会跨越两个之间的边界fill_buf
调用,并且没有办法强制BufRead
读取更多字节,直到现有缓冲区被完全消耗。我假设很明显,一旦你有了Ok(&str)
,你可以执行FromStr
闲暇时就可以使用它。
此实现不是 0 复制,而是(摊销)0 分配,并且它最大限度地减少了不必要的复制或缓冲。它使用单个持久缓冲区,仅当它对于单个令牌来说太小时才调整大小,并且它在令牌之间重用该缓冲区。字节直接从输入复制到该缓冲区中BufRead
缓冲区,无需额外的中间复制。