|
|
 |
 |
 |
 |
Perl Programming Language
|
 |
 |
 |
 |
 |
 |
 |
 |
Parsing: Help on ignoring quoted tokens.
Hi all, I am writing a (hopefully) simple parser to parse the contents of a text file and turn it into some sort of html form. Here's a small example: forms.txt contains something like: # Registration Form registration { numcols:2 [heading: Account Details] [ ] [label:"User Name:"] [textbox:username:amcnab:mandatory] [label:"First Name:"] [textbox:first_name:Andy] [label:"Last Name:"] [textbox:last_name:McNab] [label:"Password:"] [passbox:passwd::mandatory] }
# Error form error { numcols:2 [heading:Explosion Error!][] [label:"Vent Gas?:"] [select:vent:yes|no:no] }
where: [.*] denotes an html table cell. Then, later in my perl code I want to be able to do: show_form("registration"), or show_form("error") and have it render the appropriateform layout. Now, my question is: what is the best way to approach the parsing of this file? Perhaps more importantly, how can i structure the file to make the parsing as easy and practical as possible? Also, can anyone please suggest how to ignore tokens (like ':') that occur within quoted strings? Bonus points if your answer makes no reference to lex or yacc. :) Thank you for any suggestions! pakt.
On Jun 1, 6:30 am, paktsardi@gmail.com wrote: > Also, can anyone please suggest how to ignore tokens (like ':') that > occur within quoted strings?
This is very closely related to the FAQ "How can I split a [character] delimited string except when inside [character]?"
On Fri, 01 Jun 2007 05:30:10 +0000, paktsardines wrote: > Now, my question is: what is the best way to approach the parsing of > this file?
Use Parse::RecDecent. A bit of a learning curve, but very, very powerful. HTH, M4
On Jun 1, 7:30 am, paktsardi@gmail.com wrote:
> I am writing a (hopefully) simple parser to parse the contents of a > text file and turn it into some sort of html form. Here's a small > example: > forms.txt contains something like: > # Registration Form > registration { > numcols:2 > [heading: Account Details] [ ] > [label:"User Name:"] [textbox:username:amcnab:mandatory] > [label:"First Name:"] [textbox:first_name:Andy] > [label:"Last Name:"] [textbox:last_name:McNab] > [label:"Password:"] [passbox:passwd::mandatory] > } > # Error form > error { > numcols:2 > [heading:Explosion Error!][] > [label:"Vent Gas?:"] [select:vent:yes|no:no] > } > where: > [.*] denotes an html table cell.
[...snip...] > Now, my question is: what is the best way to approach the parsing of > this file?
If you say "parse a text file", you are usually dealing with brackets and/or nested { ... } constructs and I can clearly see the "registration { ... }" - and "error { ... }" - structure in your file. I strongly recommend to read first perlfaq4: "How do I find matching/ nesting anything?" However, in order to keep this simple, I would suggest to make a few assumptions about the structure of your file, thereby effectively eliminating the inherent nested structure. Those assumption would be, for example: - there are no nested { ... } constructs. - each { ... } - contruct begins with a single line format /^\w+\s*{$/ and it ends with a single line /^}$/ - inside a { ... } construct, each line begins with format /^\s+/ and it is of the form /\s*\[.*?\]/g - the first line inside a { ... } construct would be of the form /^\s+\[heading:.*?\]\s+\[\s*\]$/ This would allow to process the file line-by-line using only regexes, but still producing valid html code. At first, this solution seems to be over simplified, but as long as you can keep away from nested structures, you can easily add/remove/modify more regexes in a trial- and-error approach as you develop your Perl program from the bottom up. Here is how I would start the bottom-up approach with your test-file: ============================== use strict; use warnings; my $inputfile = 'forms.txt'; open my $inp, '<', $inputfile or die "Error 0010: open < '$inputfile': $!"; my $comment = ''; while (<$inp>) { chomp; if (m{^\#\s*(.*)$}xms) { $comment = $1; } if (m{^\s+\[}xms) { my @td = m{\[(.*?)\]}gxms; if ($comment ne '') { if (@td != 2 or $td[0] !~ m{^heading:(.*)$}xms) { die "Error 0020: unexpected '$_'"; } print "<h2>$1 ($comment)</h2>\n"; print "<table>\n"; $comment = ''; next; } print " <tr>\n"; for my $element (@td) { if ($element =~ m{^\s*$}xms) { print " <td> </td>\n"; } else { print " <td>$element</td>\n"; } } print " </tr>\n"; next; } if (/^}/xms) { print "</table>\n"; $comment = ''; next; } }
close $inp; ============================== This approach is very flexible and extremely scalable, I've already tried it successfully by transforming a plain old schema-listing of a mainframe database from basic Ascii format into Html. > Bonus points if your answer makes no reference to lex or yacc. :)
Thanks for the bonus points :-) -- Klaus
On Jun 1, 1:30 am, paktsardi@gmail.com wrote:
> Hi all, > I am writing a (hopefully) simple parser to parse the contents of a > text file and turn it into some sort of html form. Here's a small > example: > forms.txt contains something like: > # Registration Form > registration { > numcols:2 > [heading: Account Details] [ ] > [label:"User Name:"] [textbox:username:amcnab:mandatory] > [label:"First Name:"] [textbox:first_name:Andy] > [label:"Last Name:"] [textbox:last_name:McNab] > [label:"Password:"] [passbox:passwd::mandatory] > } > # Error form > error { > numcols:2 > [heading:Explosion Error!][] > [label:"Vent Gas?:"] [select:vent:yes|no:no] > } > where: > [.*] denotes an html table cell. > Then, later in my perl code I want to be able to do: > show_form("registration"), or show_form("error") and have it render > the appropriateform layout. > Now, my question is: what is the best way to approach the parsing of > this file? Perhaps more importantly, how can i structure the file to > make the parsing as easy and practical as possible?
I think your input data format is just fine, so: 1) use paragraph-mode to separate between tables, make sure no empty line within a single table block. * specify number of columns and optional table caption, make each of them in the same line(no embedded newline). (you could make caption in multiple lines though:)) * each table row is in the same line, and each column enclosed by square brackets. * if you have embedded square brackets, make a rule and leave that to Perl regex:-). I guess you've done all these above. :-) 2) then you need a data structure or probably database. For a data structure, I would use a hash to organize tables and then use array of array to define each table. Here is a sample: #!/usr/local/bin/perl use warnings; use strict; my %tables = (); local $/ = "\n\n"; # build the data structure while(my $tbl = <DATA>) { # find table name next if not $tbl =~ /^(\w+)\s*\{\s*$/m; my $table = $1; # get the number of columns my $numCol = $1 if $tbl =~ /^\s*numcols:(\d+)/m; # find caption if there is any (note: it parses only the first #line) my $caption = $1 if $tbl =~ /^#(.*)/m; push @{$tables{$table}}, $caption if defined $caption; # check each line and find table rows foreach my $row (split "\n", $tbl) { # adjust the following regex if you have embedded square bracket my @cols = ($row =~ /\[([^][]*)\]/g); push @{$tables{$table}}, [ @cols ] if scalar @cols == $numCol; } }
print "Check registration form\n"; show_form('registration'); print "\n\nCheck error form\n"; show_form('error'); ##### subroutines ##### sub show_form { my $tbl = shift; my @form = @{$tables{$tbl}}; print "<table>\n"; if (not ref $form[0]) { print " <caption>$form[0]</caption>\n"; shift @form; } foreach my $row (@form) { print " <tr>\n"; foreach my $col (@{$row}) { $col = ' ' if $col =~ /^\s*$/; my $var = mkCol($col); print " <th>$var</th>\n" if $row->[0] =~ /^heading:/; print " <td>$var</td>\n" if $row->[0] =~ /^label:/; } print " </tr>\n"; } print "</table>\n"; }
##### subroutine to parse table cell ##### sub mkCol { my $col = shift; return $1 if $col =~ /^label:"([^"]*?):?"$/; return $1 if $col =~ /^heading:\s*(.*)/; return $col; }
__DATA__ # Registration Form registration { numcols:2 [heading: Account Details] [ ] [label:"User Name:"] [textbox:username:amcnab:mandatory] [label:"First Name:"] [textbox:first_name:Andy] [label:"Last Name:"] [textbox:last_name:McNab] [label:"Password:"] [passbox:passwd::mandatory] }
# Error form error { numcols:2 [heading:Explosion Error!][] [label:"Vent Gas?:"] [select:vent:yes|no:no] }
(you need to do more test by yourself though) > Also, can anyone please suggest how to ignore tokens (like ':') that > occur within quoted strings?
Don't know your final goal, but you probably can leave that to handling each cell (i.e. subroutine mkCol() in my test code). Good luck, Xicheng
On Jun 1, 11:07 am, Xicheng Jia <xich@gmail.com> wrote:
> On Jun 1, 1:30 am, paktsardi @gmail.com wrote: > > Hi all, > > I am writing a (hopefully) simple parser to parse the contents of a > > text file and turn it into some sort of html form. Here's a small > > example: > > forms.txt contains something like: > > # Registration Form > > registration { > > numcols:2 > > [heading: Account Details] [ ] > > [label:"User Name:"] [textbox:username:amcnab:mandatory] > > [label:"First Name:"] [textbox:first_name:Andy] > > [label:"Last Name:"] [textbox:last_name:McNab] > > [label:"Password:"] [passbox:passwd::mandatory] > > } > > # Error form > > error { > > numcols:2 > > [heading:Explosion Error!][] > > [label:"Vent Gas?:"] [select:vent:yes|no:no] > > } > > where: > > [.*] denotes an html table cell. > > Then, later in my perl code I want to be able to do: > > show_form("registration"), or show_form("error") and have it render > > the appropriateform layout. > > Now, my question is: what is the best way to approach the parsing of > > this file? Perhaps more importantly, how can i structure the file to > > make the parsing as easy and practical as possible? > I think your input data format is just fine, so: > 1) use paragraph-mode to separate between tables, make sure no empty > line within a single table block. [..snip..] > local $/ = "\n\n";
In fact, no need to use paragraph-mode to read your data, just set $/ = "\n}"; I guess this should work for you, just make sure the closing curly bracket of any table blocks is the first character on a line. :-) [..snip..] > ##### subroutines ##### > sub show_form { > my $tbl = shift;
add at least this block: if (not exists $tables{$tbl}) { print "table '$tbl' not exists\n"; return; }
> my @form = @{$tables{$tbl}}; > print "<table>\n"; > if (not ref $form[0]) { > print " <caption>$form[0]</caption>\n"; > shift @form; > } > foreach my $row (@form) { > print " <tr>\n"; > foreach my $col (@{$row}) { > $col = ' ' if $col =~ /^\s*$/; > my $var = mkCol($col); > print " <th>$var</th>\n" if $row->[0] =~ /^heading:/; > print " <td>$var</td>\n" if $row->[0] =~ /^label:/; > } > print " </tr>\n"; > } > print "</table>\n"; > }
[..cut..] Good luck, Xicheng
|
 |
 |
 |
 |
|