Home     |     .Net Programming    |     cSharp Home    |     Sql Server Home    |     Javascript / Client Side Development     |     Ajax Programming

Ruby on Rails Development     |     Perl Programming     |     C Programming Language     |     C++ Programming     |     IT Jobs

Python Programming Language     |     Laptop Suggestions?    |     TCL Scripting     |     Fortran Programming     |     Scheme Programming Language


 
 
Cervo Technologies
The Right Source to Outsource

MS Dynamics CRM 3.0

Perl Programming Language

Parsing: Help on ignoring quoted tokens.


Hi all,

   I am writing a (hopefully) simple parser to parse the contents of a
text file and turn it into some sort of html form.  Here's a small
example:

forms.txt contains something like:

# Registration Form
registration {
    numcols:2
    [heading: Account Details] [ ]
    [label:"User Name:"] [textbox:username:amcnab:mandatory]
    [label:"First Name:"] [textbox:first_name:Andy]
    [label:"Last Name:"] [textbox:last_name:McNab]
    [label:"Password:"]   [passbox:passwd::mandatory]

}

# Error form
error {
    numcols:2
    [heading:Explosion Error!][]
    [label:"Vent Gas?:"] [select:vent:yes|no:no]

}

where:
   [.*] denotes an html table cell.

Then, later in my perl code I want to be able to do:

show_form("registration"), or show_form("error") and have it render
the appropriateform layout.

Now, my question is: what is the best way to approach the parsing of
this file?  Perhaps more importantly, how can i structure the file to
make the parsing as easy and practical as possible?

Also, can anyone please suggest how to ignore tokens (like ':') that
occur within quoted strings?

Bonus points if your answer makes no reference to lex or yacc.  :)

Thank you for any suggestions!

pakt.

On Jun 1, 6:30 am, paktsardi@gmail.com wrote:

> Also, can anyone please suggest how to ignore tokens (like ':') that
> occur within quoted strings?

This is very closely related to the FAQ "How can I split a [character]
delimited string except when inside [character]?"

On Fri, 01 Jun 2007 05:30:10 +0000, paktsardines wrote:
> Now, my question is: what is the best way to approach the parsing of
> this file?

Use Parse::RecDecent. A bit of a learning curve, but very, very powerful.

HTH,
M4

On Jun 1, 7:30 am, paktsardi@gmail.com wrote:

[...snip...]

> Now, my question is: what is the best way to approach the parsing of
> this file?

If you say "parse a text file", you are usually dealing with brackets
and/or nested { ... } constructs and I can clearly see the
"registration { ... }" -  and "error { ... }" - structure in your
file.

I strongly recommend to read first perlfaq4: "How do I find matching/
nesting anything?"

However, in order to keep this simple, I would suggest to make a few
assumptions about the structure of your file, thereby effectively
eliminating the inherent nested structure.

Those assumption would be, for example:
- there are no nested { ... } constructs.
- each { ... } - contruct begins with a single line format /^\w+\s*{$/
  and it ends with a single line /^}$/
- inside a { ... } construct, each line begins with format /^\s+/
  and it is of the form /\s*\[.*?\]/g
- the first line inside a { ... } construct would be of the form
  /^\s+\[heading:.*?\]\s+\[\s*\]$/

This would allow to process the file line-by-line using only regexes,
but still producing valid html code. At first, this solution seems to
be over simplified, but as long as you can keep away from nested
structures, you can easily add/remove/modify more regexes in a trial-
and-error approach as you develop your Perl program from the bottom
up.

Here is how I would start the bottom-up approach with your test-file:

==============================
use strict;
use warnings;

my $inputfile = 'forms.txt';
open my $inp, '<', $inputfile
  or die "Error 0010: open < '$inputfile': $!";

my $comment = '';
while (<$inp>) {
    chomp;
    if (m{^\#\s*(.*)$}xms) {
        $comment = $1;
    }
    if (m{^\s+\[}xms) {
        my @td = m{\[(.*?)\]}gxms;
        if ($comment ne '') {
            if (@td != 2
            or $td[0] !~ m{^heading:(.*)$}xms) {
                die "Error 0020: unexpected '$_'";
            }
            print "<h2>$1 ($comment)</h2>\n";
            print "<table>\n";
            $comment = '';
            next;
        }
        print "  <tr>\n";
        for my $element (@td) {
            if ($element =~ m{^\s*$}xms) {
                print "    <td>&nbsp;</td>\n";
            }
            else {
                print "    <td>$element</td>\n";
            }
        }
        print "  </tr>\n";
        next;
    }
    if (/^}/xms) {
        print "</table>\n";
        $comment = '';
        next;
    }

}

close $inp;
==============================

This approach is very flexible and extremely scalable, I've already
tried it successfully by transforming a plain old schema-listing of a
mainframe database from basic Ascii format into Html.

> Bonus points if your answer makes no reference to lex or yacc.  :)

Thanks for the bonus points :-)

--
Klaus

On Jun 1, 1:30 am, paktsardi@gmail.com wrote:

I think your input data format is just fine, so:

1) use paragraph-mode to separate between tables, make sure no empty
line within a single table block.
  * specify number of columns and optional table caption, make each of
them in the same line(no embedded newline). (you could make caption in
multiple lines though:))
  * each table row is in the same line, and each column enclosed by
square brackets.
  * if you have embedded square brackets, make a rule and leave that
to Perl regex:-).

I guess you've done all these above. :-)

2) then you need a data structure or probably database. For a data
structure, I would use a hash to organize tables and then use array of
array to define each table.

Here is a sample:

#!/usr/local/bin/perl
use warnings;
use strict;

my %tables = ();
local $/ = "\n\n";

# build the data structure
while(my $tbl = <DATA>)
{
    # find table name
    next if not $tbl =~ /^(\w+)\s*\{\s*$/m;
    my $table = $1;
    # get the number of columns
    my $numCol = $1 if $tbl =~ /^\s*numcols:(\d+)/m;
    # find caption if there is any (note: it parses only the first
#line)
    my $caption = $1 if $tbl =~ /^#(.*)/m;
    push @{$tables{$table}}, $caption if defined $caption;
    # check each line and find table rows
    foreach my $row (split "\n", $tbl) {
        # adjust the following regex if you have embedded square
bracket
        my @cols = ($row =~ /\[([^][]*)\]/g);
        push @{$tables{$table}}, [ @cols ] if scalar @cols == $numCol;
    }

}

print "Check registration form\n";
show_form('registration');

print "\n\nCheck error form\n";
show_form('error');

##### subroutines #####
sub show_form {
    my $tbl = shift;
    my @form = @{$tables{$tbl}};
    print "<table>\n";
    if (not ref $form[0]) {
        print "  <caption>$form[0]</caption>\n";
        shift @form;
    }
    foreach my $row (@form) {
        print "  <tr>\n";
        foreach my $col (@{$row}) {
            $col = '&nbsp;' if $col =~ /^\s*$/;
            my $var = mkCol($col);
            print "    <th>$var</th>\n" if $row->[0] =~ /^heading:/;
            print "    <td>$var</td>\n" if $row->[0] =~ /^label:/;
        }
        print "  </tr>\n";
    }
    print "</table>\n";

}

##### subroutine to parse table cell #####
sub mkCol {
    my $col = shift;
    return $1 if $col =~ /^label:"([^"]*?):?"$/;
    return $1 if $col =~ /^heading:\s*(.*)/;
    return $col;

}

__DATA__
# Registration Form
registration {
    numcols:2
    [heading: Account Details] [ ]
    [label:"User Name:"] [textbox:username:amcnab:mandatory]
    [label:"First Name:"] [textbox:first_name:Andy]
    [label:"Last Name:"] [textbox:last_name:McNab]
    [label:"Password:"]   [passbox:passwd::mandatory]

}

# Error form
error {
    numcols:2
    [heading:Explosion Error!][]
    [label:"Vent Gas?:"] [select:vent:yes|no:no]

}

(you need to do more test by yourself though)

> Also, can anyone please suggest how to ignore tokens (like ':') that
> occur within quoted strings?

Don't know your final goal, but you probably can leave that to
handling each cell (i.e. subroutine mkCol() in my test code).

Good luck,
Xicheng

On Jun 1, 11:07 am, Xicheng Jia <xich@gmail.com> wrote:

In fact, no need to use paragraph-mode to read your data, just set $/
= "\n}"; I guess this should work for you, just make sure the closing
curly bracket of any table blocks is the first character on a
line. :-)

[..snip..]

> ##### subroutines #####
> sub show_form {
>     my $tbl = shift;

add at least this block:

    if (not exists $tables{$tbl}) {
        print "table '$tbl' not exists\n";
        return;
    }

[..cut..]

Good luck,
Xicheng

Add to del.icio.us | Digg this | Stumble it | Powered by Megasolutions Inc