如何从 Perl 中的文本文件中提取/解析表格数据?


我正在寻找类似的东西HTML::表格提取 http://search.cpan.org/dist/HTML-TableExtract/,只是不适用于 HTML 输入,而是适用于包含采用缩进和间距格式化的“表格”的纯文本输入。


Here is some header text.

Column One       Column Two      Column Three
a                                           b
a                    b                      c

Some more text

Another Table     Another Column
abdbdbdb          aaaa

不知道任何打包的解决方案,但是假设您可以对文件进行两次传递,那么一些不太灵活的事情就相当简单:(以下是部分 Perlish 伪代码示例)

  • 假设:数据可能包含空格,如果有空格,则不会在 CSV 中引用 - 如果不是这种情况,只需使用Text::CSV(_XS).
  • 假设:没有使用制表符进行格式化。
  • 该逻辑将“列分隔符”定义为 100% 填充有空格的任何连续的垂直行集。
  • 如果偶然每行都有一个空格,该空格是偏移量 M 个字符处的数据的一部分,则逻辑将认为偏移量 M 是列分隔符,因为它无法知道任何更好的情况。它可以更好地了解的唯一方法是,如果您要求列间隔至少为 X 个空格,其中 X>1- 请参阅第二个代码片段。


my $INFER_FROM_N_LINES = 10; # Infer columns from this # of lines
                             # 0 means from entire file
my $lines_scanned = 0;
my @non_spaces=[];
# First pass - find which character columns in the file have all spaces and which don't
my $fh = open(...) or die;
while (<$fh>) {
    last if $INFER_FROM_N_LINES && $lines_scanned++ == $INFER_FROM_N_LINES;
    my $line = $_;
    my @chars = split(//, $line); 
    for (my $i = 0; $i < @chars; $i++) { # Probably can be done prettier via map?
        $non_spaces[$i] = 1 if $chars[$i] ne " ";
close $fh or die;

# Find columns, defined as consecutive "non-spaces" slices.
my @starts, @ends; # Index at which columns start and end
my $state = " "; # Not inside a column
for (my $i = 0; $i < @non_spaces; $i++) {
    next if $state eq " " && !$non_spaces[$i];
    next if $state eq "c" && $non_spaces[$i];
    if ($state eq " ") { # && $non_spaces[$i] of course => start column
        $state = "c";
        push @starts, $i;
    } else { # meaning $state eq "c" && !$non_spaces[$i] => end column
        $state = " ";
        push @ends, $i-1;
if ($state eq "c") { # Last char is NOT a space - produce the last column end
    push @ends, $#non_spaces;

# Now split lines
my $fh = open(...) or die;
my @rows = ();
while (<$fh>) {
    my @columns = ();
    push @rows, \@columns;
    my $line = $_;
    for (my $col_num = 0; $col_num < @starts; $col_num++) {
        $columns[$col_num] = substr($_, $starts[$col_num], $ends[$col_num]-$starts[$col_num]+1);
close $fh or die;

现在,如果你要求列间隔至少为 X 个空格,其中 X>1,它也是可行的,但列位置的解析器需要更复杂一些:

# Find columns, defined as consecutive "non-spaces" slices separated by at least 3 spaces.
my $min_col_separator_is_X_spaces = 3;
my @starts, @ends; # Index at which columns start and end
my $state = "S"; # inside a separator
NEXT_CHAR: for (my $i = 0; $i < @non_spaces; $i++) {
    if ($state eq "S") { # done with last column, inside a separator
        if ($non_spaces[$i]) { # start a new column
            $state = "c";
            push @starts, $i;
    if ($state eq "c") { # Processing a column
        if (!$non_spaces[$i]) { # First space after non-space
                                # Could be beginning of separator? check next X chars!
            for (my $j = $i+1; $j < @non_spaces
                            || $j < $i+$min_col_separator_is_X_spaces; $j++) {
                 if ($non_spaces[$j]) {
                     $i = $j++; # No need to re-scan again
                     next NEXT_CHAR; # OUTER loop
                 # If we reach here, next X chars are spaces! Column ended!
                 push @ends, $i-1;
                 $state = "S";
                 $i = $i + $min_col_separator_is_X_spaces;

