Home     |     .Net Programming    |     cSharp Home    |     Sql Server Home    |     Javascript / Client Side Development     |     Ajax Programming

Ruby on Rails Development     |     Perl Programming     |     C Programming Language     |     C++ Programming     |     IT Jobs

Python Programming Language     |     Laptop Suggestions?    |     TCL Scripting     |     Fortran Programming     |     Scheme Programming Language


 
 
Cervo Technologies
The Right Source to Outsource

MS Dynamics CRM 3.0

TCL(Tool Command Language) Scripting

Looking for suggestions on fast string comparison


I need to diff two text files line by line. Does anybody know what the
best (fastest) way of doing it? I can use "string match" or "$line1 ==
$line2", but am not sure if those are the fastest comparison.

Thansk,
Jim

At 2007-05-22 02:13PM, "jimwu88NOOOS@yahoo.com" wrote:

>  I need to diff two text files line by line. Does anybody know what the
>  best (fastest) way of doing it? I can use "string match" or "$line1 ==
>  $line2", but am not sure if those are the fastest comparison.

I would guess [string compare] would be what you want.

Or perhaps [exec diff old_file new_file]

--
Glenn Jackman
"You can only be young once. But you can always be immature." -- Dave Barry

Glenn Jackman wrote:
> At 2007-05-22 02:13PM, "jimwu88NOOOS@yahoo.com" wrote:
>>  I need to diff two text files line by line. Does anybody know what the
>>  best (fastest) way of doing it? I can use "string match" or "$line1 ==
>>  $line2", but am not sure if those are the fastest comparison.

> I would guess [string compare] would be what you want.

> Or perhaps [exec diff old_file new_file]

In pure Tcl I would guess that

if { $string1 ne $string2 } {
  not_equal_something

}

is the fastest comparison of two strings, because "ne" is direct string
comparison implemented in C and not equal (ne) should be faster than equal
(eq).

For additional speedup you should read both files in big chunks with [read]
instead of [gets] and then split into lines with [split] or compare chunks
directly, if exact line-numbering is not necessary.

Regards
Stephan

jimwu88NOOOS@yahoo.com schrieb:
> I need to diff two text files line by line. Does anybody know what the
> best (fastest) way of doing it? I can use "string match" or "$line1 ==
> $line2", but am not sure if those are the fastest comparison.

If you need something like the diff program:
Take a look at http://wiki.tcl.tk/3108 or the tcllib struct::list
package, which has the core of the code included.

If you simply need to say: line1 and line2 differ, but do not look for
insertions etc. a simple foreach loop with two indices might be the
fastest way:

foreach line1 $file1 line2 $file2 {
        if {$line1 ne $line2} { puts "Different: $line1 $line2" }

}

Basically you use the same bytecodes if you use string equal/string
compare or the Tcl 8.4 eq/ne operators.

But how fast do you need the compare and for what size/kind of files?

Michael

Stephan Kuhagen wrote:
> In pure Tcl I would guess that
>   if { $string1 ne $string2 } {
>      not_equal_something
>   }
> is the fastest comparison of two strings, because "ne" is direct string
> comparison implemented in C and not equal (ne) should be faster than equal
> (eq).

It really makes no difference; it just inverts the sense of a test. On
the other hand, the eq/ne operators are very fast since they check for
obvious stuff first and do things that modern processors like to do
anyway. The gripping hand is that [string equal] is normally compiled to
the same bytecode anyway; there's no penalty for verbosity.

Now [string compare] is slower since it has to work out which string is
the lesser. That makes a whole bunch of short-cuts invalid.

Donal.

On May 22, 3:44 pm, Michael Schlenker <schl@uni-oldenburg.de>
wrote:

Thanks to everybody who replied.

I don't have a target speed. I just wanted to run as fast as I could.
Each file is about 20MB with ~300K lines. I need to skip two lines out
of every ~100 lines (one packet). Those two lines have timestamps, so
they will be different depending on when the packet is generated. All
other lines should be the same if everything works as expected.
Otherwise, the script flags an error. I don't need to know at which
position the lines are different.

Thanks,
Jim

jimwu88NOOOS@yahoo.com wrote:
> I don't have a target speed. I just wanted to run as fast as I could.
> Each file is about 20MB with ~300K lines. I need to skip two lines out
> of every ~100 lines (one packet). Those two lines have timestamps, so
> they will be different depending on when the packet is generated. All
> other lines should be the same if everything works as expected.
> Otherwise, the script flags an error. I don't need to know at which
> position the lines are different.

are you on a unixy platform?
set diffres [ exec diff $filea $fileb ]
set diffres [ split $diffres \n ]
foreach line $diffres {
        switch -glob -- $line \
                  \[0-9]* {
                        # handle linepos
                        scan $line %d%c%d leftlineno what rightlineno
                } >* {
                        puts "leftline ( $leftlineno ):$line"
                } <* {
                        puts "rightline( $rightlineno ):$line"
                } ---* {
                        puts .

}

uwe
On May 23, 2:49 am, Uwe Klein <uwe_klein_habertw@t-online.de>
wrote:

> are you on a unixy platform?

That same technique will work on windows and other systems with a
command line - just fetch the appropriate diff command.
Larry W. Virden wrote:
> On May 23, 2:49 am, Uwe Klein <uwe_klein_habertw@t-online.de>
> wrote:

>>are you on a unixy platform?

> That same technique will work on windows and other systems with a
> command line - just fetch the appropriate diff command.

Hey larry,
I am a bit slow on occasion.

but not that slow that I would need 4 reminders ;-))

uwe

apropos: with unix it is in the box.
          with win every useful prog is extra hassle to get.
        Thus I can understand that people write
        their utilities in tcl on windows.

jimwu88NOOOS@yahoo.com wrote:

...

> I don't have a target speed. I just wanted to run as fast as I could.
> Each file is about 20MB with ~300K lines. I need to skip two lines out
> of every ~100 lines (one packet). Those two lines have timestamps, so
> they will be different depending on when the packet is generated. All
> other lines should be the same if everything works as expected.
> Otherwise, the script flags an error. I don't need to know at which
> position the lines are different.

A simple foreach loop and [string equal] test will do then. For 2 20MB
files you may as well just slurp the whole lot into memory using [read]
and then use [split] and [foreach]. Should be plenty fast enough.

-- Neil

On May 23, 9:32 am, Uwe Klein <uwe_klein_habertw@t-online.de>
wrote:

> Hey larry,
> I am a bit slow on occasion.

> but not that slow that I would need 4 reminders ;-))

Sigh - google groups kept saying "I'm sorry, but your posting failed;
try again later." I finally gave up hoping that the item would post.

I've deleted the duplicate postings (which, in itself, was painful...)

Larry W. Virden wrote:
> On May 23, 9:32 am, Uwe Klein <uwe_klein_habertw@t-online.de>
> wrote:

>> Hey larry,
>> I am a bit slow on occasion.

>> but not that slow that I would need 4 reminders ;-))

> Sigh - google groups kept saying "I'm sorry, but your posting failed;
> try again later." I finally gave up hoping that the item would post.

> I've deleted the duplicate postings (which, in itself, was painful...)

you're not the only one, I've seen multiple instances of multiple posts
today.

bruce

Donal K. Fellows wrote:
>> comparison implemented in C and not equal (ne) should be faster than
>> equal (eq).

> It really makes no difference;

You are obviously right. My brain must have gone to bed some time earlier
than me yesterday...

Stephan

In article <T%I4i.38712$Ug.33@fe1.news.blueyonder.co.uk>,
Donal K. Fellows <donal.k.fell@manchester.ac.uk> wrote:
>Stephan Kuhagen wrote:
>> In pure Tcl I would guess that
>>   if { $string1 ne $string2 } {
>>      not_equal_something
>>   }
>> is the fastest comparison of two strings, because "ne" is direct string
>> comparison implemented in C and not equal (ne) should be faster than equal
>> (eq).

>It really makes no difference; it just inverts the sense of a test. On
>the other hand, the eq/ne operators are very fast since they check for
>obvious stuff first and do things that modern processors like to do
>anyway. The gripping hand is that [string equal] is normally compiled to

         ^^^^^^^^^^^^^^^^^

>the same bytecode anyway; there's no penalty for verbosity.

More of the watchmaker's work, no doubt? :-)

MH

MH wrote:
> Donal K. Fellows wrote:
>> It really makes no difference; it just inverts the sense of a test. On
>> the other hand, the eq/ne operators are very fast since they check for
>> obvious stuff first and do things that modern processors like to do
>> anyway. The gripping hand is that [string equal] is normally compiled to
>          ^^^^^^^^^^^^^^^^^
>> the same bytecode anyway; there's no penalty for verbosity.

> More of the watchmaker's work, no doubt? :-)

It's been motied about that that might be the case.

Donal.

Add to del.icio.us | Digg this | Stumble it | Powered by Megasolutions Inc