|
|
 |
 |
 |
 |
_efficiently_ obtain n-th word from a std::string
Hi! I need a routine like: std::string nth_word(const std::string &s, unsigned int n) { // return n-th word from the string, n is 0-based // if 's' contains too few words, return "" // 'words' are any sequences of non-whitespace characters // leading, trailing and multiple whitespace characters // should be ignored. // eg. "These are four\t\twords\t\t". }
I am currenlty using something like: std::string nth_word(const std::string& source, unsigned int n) { // the addition of " " allows for the extraction of the last // word, after which ss would go eof() below if not for the space stringstream ss(source+" "); string s; for(unsigned int k=0;k<=n;k++) { ss >> s; if(!ss.good()) return ""; // eof } return s; }
which is fine, except it performs poorly. Before I'm flamed with accusations of premature optimization, let me tell you that I profiled my code and over 50% of time is spent in this routine. This does not surprise me -- I am extracting words from text files in the order of GB and it takes annoyingly long... I'm thinking of a combination of find_first_not_of and find_first_of, but before I code it, perhaps somebody can comment on this? I have a gut feeling that some nasty strtok hack would be even faster, would it? Or is there perhaps some other, performance-oriented way like traversing s.c_str() with a pointer and memcpying out the relevant part? TIA, - J.
Jacek Dziedzic wrote: > Hi! > I need a routine like: > std::string nth_word(const std::string &s, unsigned int n) { > // return n-th word from the string, n is 0-based > // if 's' contains too few words, return "" > // 'words' are any sequences of non-whitespace characters > // leading, trailing and multiple whitespace characters > // should be ignored. > // eg. "These are four\t\twords\t\t". > } > I am currenlty using something like: > std::string nth_word(const std::string& source, unsigned int n) { > // the addition of " " allows for the extraction of the last > // word, after which ss would go eof() below if not for the space > stringstream ss(source+" "); > string s; > for(unsigned int k=0;k<=n;k++) { > ss >> s; > if(!ss.good()) return ""; // eof > } > return s; > } > which is fine, except it performs poorly. Before I'm flamed > with accusations of premature optimization, let me tell you > that I profiled my code and over 50% of time is spent in this > routine. This does not surprise me -- I am extracting words > from text files in the order of GB and it takes annoyingly > long... > I'm thinking of a combination of find_first_not_of and > find_first_of, but before I code it, perhaps somebody can > comment on this? I have a gut feeling that some nasty > strtok hack would be even faster, would it? Or is there > perhaps some other, performance-oriented way like traversing > s.c_str() with a pointer and memcpying out the relevant part?
Nah. If efficiency is critical, I have found the istream interface a bad solution. Nothing like a state machine. // warning - brain dump - not compiled ever - not tested - use at your own peril #include <cctype> std::string nth_word(const std::string& source, unsigned int n) { enum state_type { inspace, inword }; std::string::const_iterator l_iter = source.begin(); const std::string::const_iterator l_end = source.end(); if ( l_end == l_iter ) { return ""; } std::string::const_iterator l_beg_word = l_iter; state_type state = ( std::isspace( * l_iter ) ) ? inspace : inword; int word_count = state == inspace ? 0 : 1; ++ l_iter; while ( l_end != l_iter ) { switch ( state ) { case inspace : if ( std::isspace( * l_iter ) ) break; state = inword; l_beg_word = l_iter; ++ word_count; break; case inword : if ( ! std::isspace( * l_iter ) ) break; state = inspace; if ( n == word_count ) { return std::string( l_beg_word, l_iter ); } break; } ++ l_iter; } switch ( state ) { case inspace : return ""; case inword : if ( n == word_count ) { return std::string( l_beg_word, l_iter ); } return ""; } return "";
}
Gianni Mariani wrote: > Jacek Dziedzic wrote: ... >> which is fine, except it performs poorly. Before I'm flamed
... oh - I forgot to mention, if you extend the state machine a little, you can apply the state machine to an entire file. Then you can mmap the entire file and never copy the file into and out of buffers.
Hi
Jacek Dziedzic wrote: > I need a routine like: > std::string nth_word(const std::string &s, unsigned int n) { > // return n-th word from the string, n is 0-based > // if 's' contains too few words, return "" > // 'words' are any sequences of non-whitespace characters > // leading, trailing and multiple whitespace characters > // should be ignored. > // eg. "These are four\t\twords\t\t". > } > I am currenlty using something like: > std::string nth_word(const std::string& source, unsigned int n) { > // the addition of " " allows for the extraction of the last > // word, after which ss would go eof() below if not for the space > stringstream ss(source+" "); > string s; > for(unsigned int k=0;k<=n;k++) { > ss >> s; > if(!ss.good()) return ""; // eof > } > return s; > } > which is fine, except it performs poorly. Before I'm flamed > with accusations of premature optimization, let me tell you > that I profiled my code and over 50% of time is spent in this > routine. This does not surprise me -- I am extracting words > from text files in the order of GB and it takes annoyingly > long...
I'm afraid that your efforts will not help much. If your n-th word starts at position i in your text file, you will have to inspect all preceding positions to know that (well, not entirely true, in some circumstances you can skip a character, but that doesn't really matter). If your text contains billions of characters, that might take very long. The single overhead that could be avoided is copying strings that you don't need at all. It would of course be enough to iterate over the string and count the number of words until you have found the n-th word. If you go architecture dependent, mmx or other streaming instructions might be able to help you (I really don't know... mmx has comparison instructions for 8 packed bytes, but I wonder if this will provide any speed-up, as you would have to check results for each byte) > I'm thinking of a combination of find_first_not_of and > find_first_of, but before I code it, perhaps somebody can > comment on this? I have a gut feeling that some nasty > strtok hack would be even faster, would it? Or is there > perhaps some other, performance-oriented way like traversing > s.c_str() with a pointer and memcpying out the relevant part?
I would not recommend any of those... Use a string::iterator. Inspect the individual characters. Have a current state (word, non-word), update the state according to what you read. Count toggles. Construct a string from the n-th word. Markus
Markus Moll wrote: > Hi > Jacek Dziedzic wrote: >> I need a routine like: >> std::string nth_word(const std::string &s, unsigned int n) { >> // return n-th word from the string, n is 0-based >> // if 's' contains too few words, return "" >> // 'words' are any sequences of non-whitespace characters >> // leading, trailing and multiple whitespace characters >> // should be ignored. >> // eg. "These are four\t\twords\t\t". >> } >> I am currenlty using something like: >> std::string nth_word(const std::string& source, unsigned int n) { >> // the addition of " " allows for the extraction of the last >> // word, after which ss would go eof() below if not for the space >> stringstream ss(source+" "); >> string s; >> for(unsigned int k=0;k<=n;k++) { >> ss >> s; >> if(!ss.good()) return ""; // eof >> } >> return s; >> } >> which is fine, except it performs poorly. Before I'm flamed >> with accusations of premature optimization, let me tell you >> that I profiled my code and over 50% of time is spent in this >> routine. This does not surprise me -- I am extracting words >> from text files in the order of GB and it takes annoyingly >> long... > I'm afraid that your efforts will not help much. > If your n-th word starts at position i in your text file, you will have > to inspect all preceding positions to know that (well, not entirely > true, in some circumstances you can skip a character, but that doesn't > really matter). If your text contains billions of characters, that might > take very long.
Fortunately I'm only splitting individual lines into words. I don't need to search through the file or anything like that, only something like for all lines in the file { depending on word 0 in the line, extract some other word from the line convert this word to a double, operate on it, output result } > The single overhead that could be avoided is copying strings that you > don't need at all. It would of course be enough to iterate over the > string and count the number of words until you have found the n-th word. > If you go architecture dependent, mmx or other streaming instructions > might be able to help you (I really don't know... mmx has comparison > instructions for 8 packed bytes, but I wonder if this will provide any > speed-up, as you would have to check results for each byte)
I'd like that to be portable. >> I'm thinking of a combination of find_first_not_of and >> find_first_of, but before I code it, perhaps somebody can >> comment on this? I have a gut feeling that some nasty >> strtok hack would be even faster, would it? Or is there >> perhaps some other, performance-oriented way like traversing >> s.c_str() with a pointer and memcpying out the relevant part? > I would not recommend any of those... > Use a string::iterator. Inspect the individual characters. Have a > current state (word, non-word), update the state according to what you > read. Count toggles. Construct a string from the n-th word.
So, Gianni's approach. I will try that, thanks. - J.
Gianni Mariani wrote: > Gianni Mariani wrote: >> Jacek Dziedzic wrote: > ... >>> which is fine, except it performs poorly. Before I'm flamed > ... > oh - I forgot to mention, if you extend the state machine a little, you > can apply the state machine to an entire file. Then you can mmap the > entire file and never copy the file into and out of buffers.
thanks, I'll try that, - J.
Jacek Dziedzic wrote: > Markus Moll wrote: .... >> I'm afraid that your efforts will not help much. >> If your n-th word starts at position i in your text file, you will >> have to inspect all preceding positions to know that (well, not >> entirely true, in some circumstances you can skip a character, but >> that doesn't really matter). If your text contains billions of >> characters, that might take very long. > Fortunately I'm only splitting individual lines into words. > I don't need to search through the file or anything like that, > only something like > for all lines in the file { > depending on word 0 in the line, extract some other word from the line > convert this word to a double, operate on it, output result > }
sounds like a job for a database. select v4 where v0 = "THIS ONE"; Index on v0 and you can find all your values very quickly. Try dumping your gigabytes of data into Postgresql or maybe sqllite. I've used postgresql with a few hunded million rows and a few "appropriate" indexes and the answers pop out rather quickly.
Gianni Mariani wrote: > Gianni Mariani wrote: >> Jacek Dziedzic wrote: > ... >>> which is fine, except it performs poorly. Before I'm flamed > ... > oh - I forgot to mention, if you extend the state machine a little, you > can apply the state machine to an entire file. Then you can mmap the > entire file and never copy the file into and out of buffers.
And i the performances are really the problem, this would be much better, even if the program structure can suffer a little bit. The string is totally useless, and each new string is a "new", end each "new" is waste of time. Try using an istream_iterator to directly iterate through your input file in order to implement your state machine. Regards, Zeppe
Gianni Mariani wrote: > Jacek Dziedzic wrote: >> Markus Moll wrote: > .... >>> I'm afraid that your efforts will not help much. >>> If your n-th word starts at position i in your text file, you will >>> have to inspect all preceding positions to know that (well, not >>> entirely true, in some circumstances you can skip a character, but >>> that doesn't really matter). If your text contains billions of >>> characters, that might take very long. >> Fortunately I'm only splitting individual lines into words. >> I don't need to search through the file or anything like that, >> only something like >> for all lines in the file { >> depending on word 0 in the line, extract some other word from the line >> convert this word to a double, operate on it, output result >> } > sounds like a job for a database. > select v4 where v0 = "THIS ONE"; > Index on v0 and you can find all your values very quickly. > Try dumping your gigabytes of data into Postgresql or maybe sqllite. > I've used postgresql with a few hunded million rows and a few > "appropriate" indexes and the answers pop out rather quickly.
Thanks for the advice, but that would be way overkill. What I do is perform postprocessing on raw results of computer simulations. I usually have to process several tens of files, each approx 1GB in size. These files are only processed _once_ (if I don't mess up) so storing them inside a database would not really help. Anyway, people around me would give me weird looks if I wanted to put all this GBs of numbers inside a database, I am a part of a small minority of non-Fortran users here :), co C++ is weird enough :/. thanks, - J.
On May 9, 1:18 pm, Jacek Dziedzic
<jacek.dziedzic.n.o.s.p. @gmail.com> wrote: > I need a routine like: > std::string nth_word(const std::string &s, unsigned int n) { > // return n-th word from the string, n is 0-based > // if 's' contains too few words, return "" > // 'words' are any sequences of non-whitespace characters > // leading, trailing and multiple whitespace characters > // should be ignored. > // eg. "These are four\t\twords\t\t". > } > I am currenlty using something like: > std::string nth_word(const std::string& source, unsigned int n) { > // the addition of " " allows for the extraction of the last > // word, after which ss would go eof() below if not for the space > stringstream ss(source+" "); > string s; > for(unsigned int k=0;k<=n;k++) { > ss >> s; > if(!ss.good()) return ""; // eof Just a detail, but good() may be false after reading the last word correctly. What you want here is: if ( ! ss ) { return "" ; } (If you do this, you don't have to add the extra space at the end of the initializer of ss.) Even better would be to move the condition up into the loop condition. Something like: std::istringstream ss( source ) ; std::string s ; while ( n > 0 && ss >> s ) { -- n ; } return ss ? s : std::string() ; > } > return s; > } > which is fine, except it performs poorly. Before I'm flamed > with accusations of premature optimization, let me tell you > that I profiled my code and over 50% of time is spent in this > routine. This does not surprise me -- I am extracting words > from text files in the order of GB and it takes annoyingly > long... > I'm thinking of a combination of find_first_not_of and > find_first_of, but before I code it, perhaps somebody can > comment on this? I have a gut feeling that some nasty > strtok hack would be even faster, would it? Or is there > perhaps some other, performance-oriented way like traversing > s.c_str() with a pointer and memcpying out the relevant part?
For starters, you say you're processing a text file. Do you offen call nth_word on the same string, with different values of n? If so, you're doing a lot of duplicate work; if extract every word, you've basically changed an O(n) algorithm into an O(n^2). If performance is an issue, that's the first thing I'd consider. Other than that, std::istringstream does a lot of copying. I've occasionally used something like: template< std::ctype_base::mask mask > class CTypeIs : public std::unary_function< char, bool > { public: typedef std::ctype< char > CType ; CTypeIs( std::locale const& l = std::locale() ) : myCType( &std::use_facet< CType >( l ) ) { } bool operator()( char ch ) const { return myCType->is( mask, ch ) ; } private: CType const* myCType ; } ; void split( std::vector< std::string >& dest, std::string const& source ) { static CTypeIs< std::ctype_base::space > const isBlank ; dest.clear() ; std::string::const_iterator end = source.end() ; std::string::const_iterator current = std::find_if( source.begin(), end, std::not1( isBlank ) ) ; while ( current != end ) { std::string::const_iterator start = current ; current = std::find_if( current, end, isBlank ) ; dest.push_back( std::string( start, current ) ) ; current = std::find_if( current, end, std::not1( isBlank ) ) ; } } to break up a line into fields. CTypeIs is, of course, in my usual tools library, so I don't have to write it every time. Note too that it does nothing to guarantee the lifetime of the facet it is using---this is up to the caller, but in practice is almost never a problem. Also, the actual object is a local variable, and so won't be constructed until the function is actually called. Thus giving the user time to set the global locale. Depending on what you are doing with the words, building the vector may or may not be necessary as well. In the end, if you can work directly with the two iterators which define the word, you can avoid any copy what so ever. -- James Kanze (GABI Software) email:james.ka@gmail.com Conseils en informatique oriente objet/ Beratung in objektorientierter Datenverarbeitung 9 place Smard, 78210 St.-Cyr-l'cole, France, +33 (0)1 30 23 00 34
On May 9, 3:07 pm, Gianni Mariani <gi3nos@mariani.ws> wrote: [...] > state_type state = ( std::isspace( * l_iter ) ) ? inspace : inword;
Just a reminder. The call to isspace here is undefined behavior, and in practice, doesn't give correct results on most machines. It's generally preferrable to use <locale>, but at the very least, you have to cast: std::isspace( static_cast< unsigned char >( *l_iter ) ) ; -- James Kanze (GABI Software) email:james.ka@gmail.com Conseils en informatique oriente objet/ Beratung in objektorientierter Datenverarbeitung 9 place Smard, 78210 St.-Cyr-l'cole, France, +33 (0)1 30 23 00 34
James Kanze wrote:
> [...] thank you, - J.
|
 |
 |
 |
 |
|