> -----Original Message-----
> From: python-list-bounces+brandon.mcginty=gmail.
@python.org
> [mailto:python-list-bounces+brandon.mcginty=gmail.
@python.org] On Behalf
> Of Steve Holden
> Sent: Tuesday, May 29, 2007 11:30 PM
> To: python-l
@python.org
> Subject: Re: RDFXML Parser For "qualified Dublin Core" Database File
> Brandon McGinty wrote:
>> Hi All,
>> My goal is to be able to read the www.gutenberg.org
>> <http://www.gutenberg.org/> rdf catalog, parse it into a python
>> structure, and pull out data for each record.
>> The catalog is a Dublin core RDF/XML catalog, divided into sections
>> for each book and details for that book.
>> I have done a very large amount of research on this problem.
>> I've tried tools such as pyrple, sax/dom/minidom, and some others both
>> standard and nonstandard to a python installation.
>> None of the tools has been able to read this file successfully, and
>> those that can even see the data can take up to half an hour to load
>> with 2 gb of ram.
>> So you all know what I'm talking about, the file is located at:
>> http://www.gutenberg.org/feeds/catalog.rdf.bz2
>> Does anyone have suggestions for a parser or converter, so I'd be able
>> to view this file, and extract data?
>> Any help is appreciated.
> Well, have you tried xml.etree.cElementTree, a part of the standard library
> since 2.5? Well worth a go, as it seems to outperform many XML libraries.
> The iterparse function is your best bet, allowing you to iterate over the
> events as you parse the source, thus avoiding the need to build a huge
> in-memory data structure just to get the parsing done.
> The following program took about four minutes to run on my not-terribly
> up-to-date Windows laptop with 1.5 GB of memory with the pure Python version
> of ElementTree:
> import xml.etree.ElementTree as ET
> events = ET.iterparse(open("catalog.rdf")) count = 0 for e in events:
> count += 1
> if count % 100000 == 0: print count print count, "total events"
> Here's an example output after I changed to using the extension module - by
> default, only the end-element events are reported. I think you'll be
> impressed by the timing. The only change was to the import staement, which
> now reads
> import xml.etree.cElementTree as ET
> sholden@bigboy ~/Projects/Python
> $ time python test19.py
> 100000
> 200000
> 300000
> 400000
> 500000
> 600000
> 700000
> 800000
> 900000
> 1000000
> 1100000
> 1200000
> 1300000
> 1400000
> 1469971 total events
> real 0m11.145s
> user 0m10.124s
> sys 0m0.580s
> Good luck!
> ------- End Original Message --------
> Hi,