|
|
 |
 |
 |
 |
TCL(Tool Command Language) Scripting
|
 |
 |
 |
 |
 |
 |
 |
 |
Looking for suggestions on fast string comparison
I need to diff two text files line by line. Does anybody know what the best (fastest) way of doing it? I can use "string match" or "$line1 == $line2", but am not sure if those are the fastest comparison. Thansk, Jim
At 2007-05-22 02:13PM, "jimwu88NOOOS@yahoo.com" wrote: > I need to diff two text files line by line. Does anybody know what the > best (fastest) way of doing it? I can use "string match" or "$line1 == > $line2", but am not sure if those are the fastest comparison.
I would guess [string compare] would be what you want. Or perhaps [exec diff old_file new_file] -- Glenn Jackman "You can only be young once. But you can always be immature." -- Dave Barry
Glenn Jackman wrote: > At 2007-05-22 02:13PM, "jimwu88NOOOS @yahoo.com" wrote: >> I need to diff two text files line by line. Does anybody know what the >> best (fastest) way of doing it? I can use "string match" or "$line1 == >> $line2", but am not sure if those are the fastest comparison. > I would guess [string compare] would be what you want. > Or perhaps [exec diff old_file new_file]
In pure Tcl I would guess that if { $string1 ne $string2 } { not_equal_something }
is the fastest comparison of two strings, because "ne" is direct string comparison implemented in C and not equal (ne) should be faster than equal (eq). For additional speedup you should read both files in big chunks with [read] instead of [gets] and then split into lines with [split] or compare chunks directly, if exact line-numbering is not necessary. Regards Stephan
jimwu88NOOOS@yahoo.com schrieb: > I need to diff two text files line by line. Does anybody know what the > best (fastest) way of doing it? I can use "string match" or "$line1 == > $line2", but am not sure if those are the fastest comparison.
If you need something like the diff program: Take a look at http://wiki.tcl.tk/3108 or the tcllib struct::list package, which has the core of the code included. If you simply need to say: line1 and line2 differ, but do not look for insertions etc. a simple foreach loop with two indices might be the fastest way: foreach line1 $file1 line2 $file2 { if {$line1 ne $line2} { puts "Different: $line1 $line2" } }
Basically you use the same bytecodes if you use string equal/string compare or the Tcl 8.4 eq/ne operators. But how fast do you need the compare and for what size/kind of files? Michael
Stephan Kuhagen wrote: > In pure Tcl I would guess that > if { $string1 ne $string2 } { > not_equal_something > } > is the fastest comparison of two strings, because "ne" is direct string > comparison implemented in C and not equal (ne) should be faster than equal > (eq).
It really makes no difference; it just inverts the sense of a test. On the other hand, the eq/ne operators are very fast since they check for obvious stuff first and do things that modern processors like to do anyway. The gripping hand is that [string equal] is normally compiled to the same bytecode anyway; there's no penalty for verbosity. Now [string compare] is slower since it has to work out which string is the lesser. That makes a whole bunch of short-cuts invalid. Donal.
On May 22, 3:44 pm, Michael Schlenker <schl@uni-oldenburg.de> wrote:
> jimwu88NOOOS @yahoo.com schrieb:> I need to diff two text files line by line. Does anybody know what the > > best (fastest) way of doing it? I can use "string match" or "$line1 == > > $line2", but am not sure if those are the fastest comparison. > If you need something like the diff program: > Take a look athttp://wiki.tcl.tk/3108or the tcllib struct::list > package, which has the core of the code included. > If you simply need to say: line1 and line2 differ, but do not look for > insertions etc. a simple foreach loop with two indices might be the > fastest way: > foreach line1 $file1 line2 $file2 { > if {$line1 ne $line2} { puts "Different: $line1 $line2" } > } > Basically you use the same bytecodes if you use string equal/string > compare or the Tcl 8.4 eq/ne operators. > But how fast do you need the compare and for what size/kind of files? > Michael
Thanks to everybody who replied. I don't have a target speed. I just wanted to run as fast as I could. Each file is about 20MB with ~300K lines. I need to skip two lines out of every ~100 lines (one packet). Those two lines have timestamps, so they will be different depending on when the packet is generated. All other lines should be the same if everything works as expected. Otherwise, the script flags an error. I don't need to know at which position the lines are different. Thanks, Jim
jimwu88NOOOS @yahoo.com wrote: > I don't have a target speed. I just wanted to run as fast as I could. > Each file is about 20MB with ~300K lines. I need to skip two lines out > of every ~100 lines (one packet). Those two lines have timestamps, so > they will be different depending on when the packet is generated. All > other lines should be the same if everything works as expected. > Otherwise, the script flags an error. I don't need to know at which > position the lines are different. are you on a unixy platform? set diffres [ exec diff $filea $fileb ] set diffres [ split $diffres \n ] foreach line $diffres { switch -glob -- $line \ \[0-9]* { # handle linepos scan $line %d%c%d leftlineno what rightlineno } >* { puts "leftline ( $leftlineno ):$line" } <* { puts "rightline( $rightlineno ):$line" } ---* { puts . }
uwe
On May 23, 2:49 am, Uwe Klein <uwe_klein_habertw@t-online.de> wrote: > are you on a unixy platform?
That same technique will work on windows and other systems with a command line - just fetch the appropriate diff command.
Larry W. Virden wrote: > On May 23, 2:49 am, Uwe Klein <uwe_klein_habertw @t-online.de> > wrote: >>are you on a unixy platform? > That same technique will work on windows and other systems with a > command line - just fetch the appropriate diff command.
Hey larry, I am a bit slow on occasion. but not that slow that I would need 4 reminders ;-)) uwe apropos: with unix it is in the box. with win every useful prog is extra hassle to get. Thus I can understand that people write their utilities in tcl on windows.
jimwu88NOOOS @yahoo.com wrote: ... > I don't have a target speed. I just wanted to run as fast as I could. > Each file is about 20MB with ~300K lines. I need to skip two lines out > of every ~100 lines (one packet). Those two lines have timestamps, so > they will be different depending on when the packet is generated. All > other lines should be the same if everything works as expected. > Otherwise, the script flags an error. I don't need to know at which > position the lines are different.
A simple foreach loop and [string equal] test will do then. For 2 20MB files you may as well just slurp the whole lot into memory using [read] and then use [split] and [foreach]. Should be plenty fast enough. -- Neil
On May 23, 9:32 am, Uwe Klein <uwe_klein_habertw@t-online.de> wrote: > Hey larry, > I am a bit slow on occasion. > but not that slow that I would need 4 reminders ;-))
Sigh - google groups kept saying "I'm sorry, but your posting failed; try again later." I finally gave up hoping that the item would post. I've deleted the duplicate postings (which, in itself, was painful...)
Larry W. Virden wrote: > On May 23, 9:32 am, Uwe Klein <uwe_klein_habertw @t-online.de> > wrote: >> Hey larry, >> I am a bit slow on occasion. >> but not that slow that I would need 4 reminders ;-)) > Sigh - google groups kept saying "I'm sorry, but your posting failed; > try again later." I finally gave up hoping that the item would post. > I've deleted the duplicate postings (which, in itself, was painful...)
you're not the only one, I've seen multiple instances of multiple posts today. bruce
Donal K. Fellows wrote: >> comparison implemented in C and not equal (ne) should be faster than >> equal (eq). > It really makes no difference;
You are obviously right. My brain must have gone to bed some time earlier than me yesterday... Stephan
In article <T%I4i.38712$Ug.33@fe1.news.blueyonder.co.uk>, Donal K. Fellows <donal.k.fell@manchester.ac.uk> wrote: >Stephan Kuhagen wrote: >> In pure Tcl I would guess that >> if { $string1 ne $string2 } { >> not_equal_something >> } >> is the fastest comparison of two strings, because "ne" is direct string >> comparison implemented in C and not equal (ne) should be faster than equal >> (eq). >It really makes no difference; it just inverts the sense of a test. On >the other hand, the eq/ne operators are very fast since they check for >obvious stuff first and do things that modern processors like to do >anyway. The gripping hand is that [string equal] is normally compiled to
^^^^^^^^^^^^^^^^^ >the same bytecode anyway; there's no penalty for verbosity.
More of the watchmaker's work, no doubt? :-) MH
MH wrote: > Donal K. Fellows wrote: >> It really makes no difference; it just inverts the sense of a test. On >> the other hand, the eq/ne operators are very fast since they check for >> obvious stuff first and do things that modern processors like to do >> anyway. The gripping hand is that [string equal] is normally compiled to > ^^^^^^^^^^^^^^^^^ >> the same bytecode anyway; there's no penalty for verbosity. > More of the watchmaker's work, no doubt? :-)
It's been motied about that that might be the case. Donal.
|
 |
 |
 |
 |
|