Home     |     .Net Programming    |     cSharp Home    |     Sql Server Home    |     Javascript / Client Side Development     |     Ajax Programming

Ruby on Rails Development     |     Perl Programming     |     C Programming Language     |     C++ Programming     |     IT Jobs

Python Programming Language     |     Laptop Suggestions?    |     TCL Scripting     |     Fortran Programming     |     Scheme Programming Language


 
 
Cervo Technologies
The Right Source to Outsource

MS Dynamics CRM 3.0

Python Programming Language

Parsing HTML


I am trying to parse a webpage and extract information. I am trying to
use pyparser. Here is what I have:

from pyparsing import *
import urllib

# define basic text pattern
spanStart = Literal('<span class=\"hpPageText\">')

spanEnd = Literal('</span></td>')

printCount = spanStart + SkipTo(spanEnd) + spanEnd

# get printer addresses
printerURL = "http://printer.mydomain.com/hp/device/this.LCDispatcher?
nav=hp.Usage"
printerListPage = urllib.urlopen(printerURL)
printerListHTML = printerListPage.read()
printerListPage.close

for srvrtokens,startloc,endloc in
printCount.scanString(printerListHTML): print srvrtokens

print printCount

I have the last print statement to check what is being sent because I
am getting nothing back. What it sends is:
{"<span class="hpPageText">" SkipTo:("</span></td>") "</span></td>"}

If I pull out the "hpPageText" I get results back, but more than what
I want. I know it has something to do with escaping the quotation
marks, but I am puzzled as to how to do it.

Thanks,

Mike

On Feb 8, 2:38 pm, "mtuller" <mitul@gmail.com> wrote:

> I am trying to parse a webpage and extract information.

BeautifulSoup is a great Python module for this purpose:

    http://www.crummy.com/software/BeautifulSoup/

Here's an article on screen scraping using it:

    http://iwiwdsmi.blogspot.com/2007/01/how-to-use-python-and-beautiful-...

I was asking how to escape the quotation marks. I have everything
working in pyparser except for that. I don't want to drop everything
and go to a different parser.

Can someone else help?

On Feb 8, 4:15 pm, "mtuller" <mitul@gmail.com> wrote:
> I was asking how to escape the quotation marks. I have everything
> working in pyparser except for that. I don't want to drop everything
> and go to a different parser.

> Can someone else help?

Mike -

pyparsing includes a helper for constructing HTML tags called
makeHTMLTags.  This method does more than just wrap the given tag text
within <>'s, but also comprehends attributes, upper/lower case, and
various styles of quoted strings.  To use it, replace your Literal
definitions for spanStart and spanEnd with:

spanStart, spanEnd = makeHTMLTags('span')

If you don't want to match just *any* <span> tag, but say, you only
want those with the class = "hpPageText", then add this parse action
to spanStart:

def onlyAcceptWithTagAttr(attrname,attrval):
    def action(tagAttrs):
        if not(attrname in tagAttrs and tagAttrs[attrname]==attrval):
            raise ParseException("",0,"")
    return action

spanStart.setParseAction(onlyAcceptWithTagAttr("class","hpPageText"))

-- Paul

Add to del.icio.us | Digg this | Stumble it | Powered by Megasolutions Inc