Home     |     .Net Programming    |     cSharp Home    |     Sql Server Home    |     Javascript / Client Side Development     |     Ajax Programming

Ruby on Rails Development     |     Perl Programming     |     C Programming Language     |     C++ Programming     |     IT Jobs

Python Programming Language     |     Laptop Suggestions?    |     TCL Scripting     |     Fortran Programming     |     Scheme Programming Language


 
 
Cervo Technologies
The Right Source to Outsource

MS Dynamics CRM 3.0

Python Programming Language

RDFXML Parser For "qualified Dublin Core" Database File


Well, have you tried xml.etree.cElementTree, a part of the standard
library since 2.5? Well worth a go, as it seems to outperform many XML
libraries.

The iterparse function is your best bet, allowing you to iterate over
the events as you parse the source, thus avoiding the need to build a
huge in-memory data structure just to get the parsing done.

The following program took about four minutes to run on my not-terribly
up-to-date Windows laptop with 1.5 GB of memory with the pure Python
version of ElementTree:

import xml.etree.ElementTree as ET
events = ET.iterparse(open("catalog.rdf"))
count = 0
for e in events:
       count += 1
       if count % 100000 == 0: print count
print count, "total events"

Here's an example output after I changed to using the extension module -
by default, only the end-element events are reported. I think you'll be
impressed by the timing. The only change was to the import staement,
which now reads

import xml.etree.cElementTree as ET

sholden@bigboy ~/Projects/Python
$ time python test19.py
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
1100000
1200000
1300000
1400000
1469971 total events

real    0m11.145s
user    0m10.124s
sys     0m0.580s

Good luck!

regards
  Steve
--
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC/Ltd           http://www.holdenweb.com
Skype: holdenweb      http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com        squidoo.com/pythonology
tagged items:         del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

Brandon McGinty wrote:

[actually he top-posted, but I have moved his comments down
because he probably doesn't realise it's naughty]

 > Hi,
 > Thanks for the info. The speed is fantastic. 58 mb in under 15 sec,
just as
 > shown.
 > I did notice that in this rdf file, I can't do any sort of find or
findall.
 > I haven't been able to find any documentation on how to search. For
 > instance, in perl, one can do a search for "/rdf:RDF/pgterms:etext", and
 > when done in python, with many many variations, find and findall
return no
 > results, either when searching from root or children of root.
 >
You should keep the list in the loop, as I am not an ElementTree expert,
simply someone who knew of its capabilities on the basis of some earlier
casual use. I am glad it is working well for you. Memory usage is also
pretty low with the extension module.

You may well find that other readers can advise you about the selective
aspects of processing. I'd have thought find() and findall() would have
been what you want. The more you explain about your requirements the
more help you are likely to get.

regards
  Steve
--
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC/Ltd           http://www.holdenweb.com
Skype: holdenweb      http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com        squidoo.com/pythonology
tagged items:         del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

Brandon McGinty wrote:
> Hi,
> Thanks for the info. The speed is fantastic. 58 mb in under 15 sec, just as
> shown.
> I did notice that in this rdf file, I can't do any sort of find or findall.
> I haven't been able to find any documentation on how to search. For
> instance, in perl, one can do a search for "/rdf:RDF/pgterms:etext", and
> when done in python, with many many variations, find and findall return no
> results, either when searching from root or children of root.

... one thought comes to mind: you will have to actually build the parse
tree if you want to be ab le to search it that way. The technique I
used, you would need to process the end events to get the data you seem
to need!

regards
  Steve
--
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC/Ltd           http://www.holdenweb.com
Skype: holdenweb      http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com        squidoo.com/pythonology
tagged items:         del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

Add to del.icio.us | Digg this | Stumble it | Powered by Megasolutions Inc