CMS 3D CMS Logo

Public Member Functions | Public Attributes | Static Public Attributes | Private Member Functions

BeautifulSoup::BeautifulStoneSoup Class Reference

Inheritance diagram for BeautifulSoup::BeautifulStoneSoup:
BeautifulSoup::Tag BeautifulSoup::Tag BeautifulSoup::PageElement BeautifulSoup::PageElement BeautifulSoup::PageElement BeautifulSoup::PageElement BeautifulSoup::BeautifulSOAP BeautifulSoup::BeautifulSOAP BeautifulSoup::BeautifulSoup BeautifulSoup::BeautifulSoup BeautifulSoup::RobustXMLParser BeautifulSoup::RobustXMLParser BeautifulSoup::SimplifyingSOAPParser BeautifulSoup::SimplifyingSOAPParser BeautifulSoup::SimplifyingSOAPParser BeautifulSoup::SimplifyingSOAPParser BeautifulSoup::ICantBelieveItsBeautifulSoup BeautifulSoup::ICantBelieveItsBeautifulSoup BeautifulSoup::MinimalSoup BeautifulSoup::MinimalSoup BeautifulSoup::RobustHTMLParser BeautifulSoup::RobustHTMLParser BeautifulSoup::ICantBelieveItsBeautifulSoup BeautifulSoup::ICantBelieveItsBeautifulSoup BeautifulSoup::MinimalSoup BeautifulSoup::MinimalSoup BeautifulSoup::RobustHTMLParser BeautifulSoup::RobustHTMLParser

List of all members.

Public Member Functions

def __init__
def __init__
def endData
def endData
def extractCharsetFromMeta
def extractCharsetFromMeta
def handle_data
def handle_data
def isSelfClosingTag
def isSelfClosingTag
def popTag
def popTag
def pushTag
def pushTag
def reset
def reset
def unknown_endtag
def unknown_endtag
def unknown_starttag
def unknown_starttag

Public Attributes

 builder
 convertEntities
 convertHTMLEntities
 convertXMLEntities
 currentData
 currentTag
 declaredHTMLEncoding
 escapeUnrecognizedEntities
 fromEncoding
 hidden
 instanceSelfClosingTags
 literal
 markup
 markupMassage
 originalEncoding
 parseOnlyThese
 previous
 quoteStack
 smartQuotesTo
 tagStack

Static Public Attributes

 ALL_ENTITIES = XHTML_ENTITIES
string HTML_ENTITIES = "html"
list MARKUP_MASSAGE
dictionary NESTABLE_TAGS = {}
list PRESERVE_WHITESPACE_TAGS = []
dictionary QUOTE_TAGS = {}
dictionary RESET_NESTING_TAGS = {}
string ROOT_TAG_NAME = u'[document]'
dictionary SELF_CLOSING_TAGS = {}
dictionary STRIP_ASCII_SPACES = { 9: None, 10: None, 12: None, 13: None, 32: None, }
string XHTML_ENTITIES = "xhtml"
string XML_ENTITIES = "xml"

Private Member Functions

def _feed
def _feed
def _popToTag
def _popToTag
def _smartPop
def _smartPop

Detailed Description

This class contains the basic parser and search code. It defines
a parser that knows nothing about tag behavior except for the
following:

  You can't close a tag without closing all the tags it encloses.
  That is, "<foo><bar></foo>" actually means
  "<foo><bar></bar></foo>".

[Another possible explanation is "<foo><bar /></foo>", but since
this class defines no SELF_CLOSING_TAGS, it will never use that
explanation.]

This class is useful for parsing XML or made-up markup languages,
or when BeautifulSoup makes an assumption counter to what you were
expecting.

Definition at line 1120 of file BeautifulSoup.py.


Constructor & Destructor Documentation

def BeautifulSoup::BeautifulStoneSoup::__init__ (   self,
  markup = "",
  parseOnlyThese = None,
  fromEncoding = None,
  markupMassage = True,
  smartQuotesTo = XML_ENTITIES,
  convertEntities = None,
  selfClosingTags = None,
  isHTML = False,
  builder = HTMLParserBuilder 
)
The Soup object is initialized as the 'root tag', and the
provided markup (which can be a string or a file-like object)
is fed into the underlying parser.

HTMLParser will process most bad HTML, and the BeautifulSoup
class has some tricks for dealing with some HTML that kills
HTMLParser, but Beautiful Soup can nonetheless choke or lose data
if your data uses self-closing tags or declarations
incorrectly.

By default, Beautiful Soup uses regexes to sanitize input,
avoiding the vast majority of these problems. If the problems
don't apply to you, pass in False for markupMassage, and
you'll get better performance.

The default parser massage techniques fix the two most common
instances of invalid HTML that choke HTMLParser:

 <br/> (No space between name of closing tag and tag close)
 <! --Comment--> (Extraneous whitespace in declaration)

You can pass in a custom list of (RE object, replace method)
tuples to get Beautiful Soup to scrub your input the way you
want.

Definition at line 1164 of file BeautifulSoup.py.

01168                                            :
01169         """The Soup object is initialized as the 'root tag', and the
01170         provided markup (which can be a string or a file-like object)
01171         is fed into the underlying parser.
01172 
01173         HTMLParser will process most bad HTML, and the BeautifulSoup
01174         class has some tricks for dealing with some HTML that kills
01175         HTMLParser, but Beautiful Soup can nonetheless choke or lose data
01176         if your data uses self-closing tags or declarations
01177         incorrectly.
01178 
01179         By default, Beautiful Soup uses regexes to sanitize input,
01180         avoiding the vast majority of these problems. If the problems
01181         don't apply to you, pass in False for markupMassage, and
01182         you'll get better performance.
01183 
01184         The default parser massage techniques fix the two most common
01185         instances of invalid HTML that choke HTMLParser:
01186 
01187          <br/> (No space between name of closing tag and tag close)
01188          <! --Comment--> (Extraneous whitespace in declaration)
01189 
01190         You can pass in a custom list of (RE object, replace method)
01191         tuples to get Beautiful Soup to scrub your input the way you
01192         want."""
01193 
01194         self.parseOnlyThese = parseOnlyThese
01195         self.fromEncoding = fromEncoding
01196         self.smartQuotesTo = smartQuotesTo
01197         self.convertEntities = convertEntities
01198         # Set the rules for how we'll deal with the entities we
01199         # encounter
01200         if self.convertEntities:
01201             # It doesn't make sense to convert encoded characters to
01202             # entities even while you're converting entities to Unicode.
01203             # Just convert it all to Unicode.
01204             self.smartQuotesTo = None
01205             if convertEntities == self.HTML_ENTITIES:
01206                 self.convertXMLEntities = False
01207                 self.convertHTMLEntities = True
01208                 self.escapeUnrecognizedEntities = True
01209             elif convertEntities == self.XHTML_ENTITIES:
01210                 self.convertXMLEntities = True
01211                 self.convertHTMLEntities = True
01212                 self.escapeUnrecognizedEntities = False
01213             elif convertEntities == self.XML_ENTITIES:
01214                 self.convertXMLEntities = True
01215                 self.convertHTMLEntities = False
01216                 self.escapeUnrecognizedEntities = False
01217         else:
01218             self.convertXMLEntities = False
01219             self.convertHTMLEntities = False
01220             self.escapeUnrecognizedEntities = False
01221 
01222         self.instanceSelfClosingTags = buildTagMap(None, selfClosingTags)
01223         self.builder = builder(self)
01224         self.reset()
01225 
01226         if hasattr(markup, 'read'):        # It's a file-type object.
01227             markup = markup.read()
01228         self.markup = markup
01229         self.markupMassage = markupMassage
01230         try:
01231             self._feed(isHTML=isHTML)
01232         except StopParsing:
01233             pass
01234         self.markup = None                 # The markup can now be GCed.
01235         self.builder = None                # So can the builder.

def BeautifulSoup::BeautifulStoneSoup::__init__ (   self,
  markup = "",
  parseOnlyThese = None,
  fromEncoding = None,
  markupMassage = True,
  smartQuotesTo = XML_ENTITIES,
  convertEntities = None,
  selfClosingTags = None,
  isHTML = False,
  builder = HTMLParserBuilder 
)
The Soup object is initialized as the 'root tag', and the
provided markup (which can be a string or a file-like object)
is fed into the underlying parser.

HTMLParser will process most bad HTML, and the BeautifulSoup
class has some tricks for dealing with some HTML that kills
HTMLParser, but Beautiful Soup can nonetheless choke or lose data
if your data uses self-closing tags or declarations
incorrectly.

By default, Beautiful Soup uses regexes to sanitize input,
avoiding the vast majority of these problems. If the problems
don't apply to you, pass in False for markupMassage, and
you'll get better performance.

The default parser massage techniques fix the two most common
instances of invalid HTML that choke HTMLParser:

 <br/> (No space between name of closing tag and tag close)
 <! --Comment--> (Extraneous whitespace in declaration)

You can pass in a custom list of (RE object, replace method)
tuples to get Beautiful Soup to scrub your input the way you
want.

Definition at line 1164 of file BeautifulSoup.py.

01168                                            :
01169         """The Soup object is initialized as the 'root tag', and the
01170         provided markup (which can be a string or a file-like object)
01171         is fed into the underlying parser.
01172 
01173         HTMLParser will process most bad HTML, and the BeautifulSoup
01174         class has some tricks for dealing with some HTML that kills
01175         HTMLParser, but Beautiful Soup can nonetheless choke or lose data
01176         if your data uses self-closing tags or declarations
01177         incorrectly.
01178 
01179         By default, Beautiful Soup uses regexes to sanitize input,
01180         avoiding the vast majority of these problems. If the problems
01181         don't apply to you, pass in False for markupMassage, and
01182         you'll get better performance.
01183 
01184         The default parser massage techniques fix the two most common
01185         instances of invalid HTML that choke HTMLParser:
01186 
01187          <br/> (No space between name of closing tag and tag close)
01188          <! --Comment--> (Extraneous whitespace in declaration)
01189 
01190         You can pass in a custom list of (RE object, replace method)
01191         tuples to get Beautiful Soup to scrub your input the way you
01192         want."""
01193 
01194         self.parseOnlyThese = parseOnlyThese
01195         self.fromEncoding = fromEncoding
01196         self.smartQuotesTo = smartQuotesTo
01197         self.convertEntities = convertEntities
01198         # Set the rules for how we'll deal with the entities we
01199         # encounter
01200         if self.convertEntities:
01201             # It doesn't make sense to convert encoded characters to
01202             # entities even while you're converting entities to Unicode.
01203             # Just convert it all to Unicode.
01204             self.smartQuotesTo = None
01205             if convertEntities == self.HTML_ENTITIES:
01206                 self.convertXMLEntities = False
01207                 self.convertHTMLEntities = True
01208                 self.escapeUnrecognizedEntities = True
01209             elif convertEntities == self.XHTML_ENTITIES:
01210                 self.convertXMLEntities = True
01211                 self.convertHTMLEntities = True
01212                 self.escapeUnrecognizedEntities = False
01213             elif convertEntities == self.XML_ENTITIES:
01214                 self.convertXMLEntities = True
01215                 self.convertHTMLEntities = False
01216                 self.escapeUnrecognizedEntities = False
01217         else:
01218             self.convertXMLEntities = False
01219             self.convertHTMLEntities = False
01220             self.escapeUnrecognizedEntities = False
01221 
01222         self.instanceSelfClosingTags = buildTagMap(None, selfClosingTags)
01223         self.builder = builder(self)
01224         self.reset()
01225 
01226         if hasattr(markup, 'read'):        # It's a file-type object.
01227             markup = markup.read()
01228         self.markup = markup
01229         self.markupMassage = markupMassage
01230         try:
01231             self._feed(isHTML=isHTML)
01232         except StopParsing:
01233             pass
01234         self.markup = None                 # The markup can now be GCed.
01235         self.builder = None                # So can the builder.


Member Function Documentation

def BeautifulSoup::BeautifulStoneSoup::_feed (   self,
  inDocumentEncoding = None,
  isHTML = False 
) [private]

Definition at line 1236 of file BeautifulSoup.py.

01237                                                           :
01238         # Convert the document to Unicode.
01239         markup = self.markup
01240         if isinstance(markup, unicode):
01241             if not hasattr(self, 'originalEncoding'):
01242                 self.originalEncoding = None
01243         else:
01244             dammit = UnicodeDammit\
01245                      (markup, [self.fromEncoding, inDocumentEncoding],
01246                       smartQuotesTo=self.smartQuotesTo, isHTML=isHTML)
01247             markup = dammit.unicode
01248             self.originalEncoding = dammit.originalEncoding
01249             self.declaredHTMLEncoding = dammit.declaredHTMLEncoding
01250         if markup:
01251             if self.markupMassage:
01252                 if not isList(self.markupMassage):
01253                     self.markupMassage = self.MARKUP_MASSAGE
01254                 for fix, m in self.markupMassage:
01255                     markup = fix.sub(m, markup)
01256                 # TODO: We get rid of markupMassage so that the
01257                 # soup object can be deepcopied later on. Some
01258                 # Python installations can't copy regexes. If anyone
01259                 # was relying on the existence of markupMassage, this
01260                 # might cause problems.
01261                 del(self.markupMassage)
01262         self.builder.reset()
01263 
01264         self.builder.feed(markup)
01265         # Close out any unfinished strings and close all the open tags.
01266         self.endData()
01267         while self.currentTag.name != self.ROOT_TAG_NAME:
01268             self.popTag()

def BeautifulSoup::BeautifulStoneSoup::_feed (   self,
  inDocumentEncoding = None,
  isHTML = False 
) [private]

Definition at line 1236 of file BeautifulSoup.py.

01237                                                           :
01238         # Convert the document to Unicode.
01239         markup = self.markup
01240         if isinstance(markup, unicode):
01241             if not hasattr(self, 'originalEncoding'):
01242                 self.originalEncoding = None
01243         else:
01244             dammit = UnicodeDammit\
01245                      (markup, [self.fromEncoding, inDocumentEncoding],
01246                       smartQuotesTo=self.smartQuotesTo, isHTML=isHTML)
01247             markup = dammit.unicode
01248             self.originalEncoding = dammit.originalEncoding
01249             self.declaredHTMLEncoding = dammit.declaredHTMLEncoding
01250         if markup:
01251             if self.markupMassage:
01252                 if not isList(self.markupMassage):
01253                     self.markupMassage = self.MARKUP_MASSAGE
01254                 for fix, m in self.markupMassage:
01255                     markup = fix.sub(m, markup)
01256                 # TODO: We get rid of markupMassage so that the
01257                 # soup object can be deepcopied later on. Some
01258                 # Python installations can't copy regexes. If anyone
01259                 # was relying on the existence of markupMassage, this
01260                 # might cause problems.
01261                 del(self.markupMassage)
01262         self.builder.reset()
01263 
01264         self.builder.feed(markup)
01265         # Close out any unfinished strings and close all the open tags.
01266         self.endData()
01267         while self.currentTag.name != self.ROOT_TAG_NAME:
01268             self.popTag()

def BeautifulSoup::BeautifulStoneSoup::_popToTag (   self,
  name,
  inclusivePop = True 
) [private]
Pops the tag stack up to and including the most recent
instance of the given tag. If inclusivePop is false, pops the tag
stack up to but *not* including the most recent instqance of
the given tag.

Definition at line 1329 of file BeautifulSoup.py.

01330                                                 :
01331         """Pops the tag stack up to and including the most recent
01332         instance of the given tag. If inclusivePop is false, pops the tag
01333         stack up to but *not* including the most recent instqance of
01334         the given tag."""
01335         #print "Popping to %s" % name
01336         if name == self.ROOT_TAG_NAME:
01337             return
01338 
01339         numPops = 0
01340         mostRecentTag = None
01341         for i in range(len(self.tagStack)-1, 0, -1):
01342             if name == self.tagStack[i].name:
01343                 numPops = len(self.tagStack)-i
01344                 break
01345         if not inclusivePop:
01346             numPops = numPops - 1
01347 
01348         for i in range(0, numPops):
01349             mostRecentTag = self.popTag()
01350         return mostRecentTag

def BeautifulSoup::BeautifulStoneSoup::_popToTag (   self,
  name,
  inclusivePop = True 
) [private]
Pops the tag stack up to and including the most recent
instance of the given tag. If inclusivePop is false, pops the tag
stack up to but *not* including the most recent instqance of
the given tag.

Definition at line 1329 of file BeautifulSoup.py.

01330                                                 :
01331         """Pops the tag stack up to and including the most recent
01332         instance of the given tag. If inclusivePop is false, pops the tag
01333         stack up to but *not* including the most recent instqance of
01334         the given tag."""
01335         #print "Popping to %s" % name
01336         if name == self.ROOT_TAG_NAME:
01337             return
01338 
01339         numPops = 0
01340         mostRecentTag = None
01341         for i in range(len(self.tagStack)-1, 0, -1):
01342             if name == self.tagStack[i].name:
01343                 numPops = len(self.tagStack)-i
01344                 break
01345         if not inclusivePop:
01346             numPops = numPops - 1
01347 
01348         for i in range(0, numPops):
01349             mostRecentTag = self.popTag()
01350         return mostRecentTag

def BeautifulSoup::BeautifulStoneSoup::_smartPop (   self,
  name 
) [private]
We need to pop up to the previous tag of this type, unless
one of this tag's nesting reset triggers comes between this
tag and the previous tag of this type, OR unless this tag is a
generic nesting trigger and another generic nesting trigger
comes between this tag and the previous tag of this type.

Examples:
 <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'.
 <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'.
 <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'.

 <li><ul><li> *<li>* should pop to 'ul', not the first 'li'.
 <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr'
 <td><tr><td> *<td>* should pop to 'tr', not the first 'td'

Definition at line 1351 of file BeautifulSoup.py.

01352                              :
01353 
01354         """We need to pop up to the previous tag of this type, unless
01355         one of this tag's nesting reset triggers comes between this
01356         tag and the previous tag of this type, OR unless this tag is a
01357         generic nesting trigger and another generic nesting trigger
01358         comes between this tag and the previous tag of this type.
01359 
01360         Examples:
01361          <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'.
01362          <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'.
01363          <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'.
01364 
01365          <li><ul><li> *<li>* should pop to 'ul', not the first 'li'.
01366          <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr'
01367          <td><tr><td> *<td>* should pop to 'tr', not the first 'td'
01368         """
01369 
01370         nestingResetTriggers = self.NESTABLE_TAGS.get(name)
01371         isNestable = nestingResetTriggers != None
01372         isResetNesting = self.RESET_NESTING_TAGS.has_key(name)
01373         popTo = None
01374         inclusive = True
01375         for i in range(len(self.tagStack)-1, 0, -1):
01376             p = self.tagStack[i]
01377             if (not p or p.name == name) and not isNestable:
01378                 #Non-nestable tags get popped to the top or to their
01379                 #last occurance.
01380                 popTo = name
01381                 break
01382             if (nestingResetTriggers != None
01383                 and p.name in nestingResetTriggers) \
01384                 or (nestingResetTriggers == None and isResetNesting
01385                     and self.RESET_NESTING_TAGS.has_key(p.name)):
01386 
01387                 #If we encounter one of the nesting reset triggers
01388                 #peculiar to this tag, or we encounter another tag
01389                 #that causes nesting to reset, pop up to but not
01390                 #including that tag.
01391                 popTo = p.name
01392                 inclusive = False
01393                 break
01394             p = p.parent
01395         if popTo:
01396             self._popToTag(popTo, inclusive)

def BeautifulSoup::BeautifulStoneSoup::_smartPop (   self,
  name 
) [private]
We need to pop up to the previous tag of this type, unless
one of this tag's nesting reset triggers comes between this
tag and the previous tag of this type, OR unless this tag is a
generic nesting trigger and another generic nesting trigger
comes between this tag and the previous tag of this type.

Examples:
 <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'.
 <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'.
 <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'.

 <li><ul><li> *<li>* should pop to 'ul', not the first 'li'.
 <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr'
 <td><tr><td> *<td>* should pop to 'tr', not the first 'td'

Definition at line 1351 of file BeautifulSoup.py.

01352                              :
01353 
01354         """We need to pop up to the previous tag of this type, unless
01355         one of this tag's nesting reset triggers comes between this
01356         tag and the previous tag of this type, OR unless this tag is a
01357         generic nesting trigger and another generic nesting trigger
01358         comes between this tag and the previous tag of this type.
01359 
01360         Examples:
01361          <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'.
01362          <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'.
01363          <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'.
01364 
01365          <li><ul><li> *<li>* should pop to 'ul', not the first 'li'.
01366          <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr'
01367          <td><tr><td> *<td>* should pop to 'tr', not the first 'td'
01368         """
01369 
01370         nestingResetTriggers = self.NESTABLE_TAGS.get(name)
01371         isNestable = nestingResetTriggers != None
01372         isResetNesting = self.RESET_NESTING_TAGS.has_key(name)
01373         popTo = None
01374         inclusive = True
01375         for i in range(len(self.tagStack)-1, 0, -1):
01376             p = self.tagStack[i]
01377             if (not p or p.name == name) and not isNestable:
01378                 #Non-nestable tags get popped to the top or to their
01379                 #last occurance.
01380                 popTo = name
01381                 break
01382             if (nestingResetTriggers != None
01383                 and p.name in nestingResetTriggers) \
01384                 or (nestingResetTriggers == None and isResetNesting
01385                     and self.RESET_NESTING_TAGS.has_key(p.name)):
01386 
01387                 #If we encounter one of the nesting reset triggers
01388                 #peculiar to this tag, or we encounter another tag
01389                 #that causes nesting to reset, pop up to but not
01390                 #including that tag.
01391                 popTo = p.name
01392                 inclusive = False
01393                 break
01394             p = p.parent
01395         if popTo:
01396             self._popToTag(popTo, inclusive)

def BeautifulSoup::BeautifulStoneSoup::endData (   self,
  containerClass = NavigableString 
)

Definition at line 1306 of file BeautifulSoup.py.

01307                                                      :
01308         if self.currentData:
01309             currentData = u''.join(self.currentData)
01310             if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and
01311                 not set([tag.name for tag in self.tagStack]).intersection(
01312                     self.PRESERVE_WHITESPACE_TAGS)):
01313                 if '\n' in currentData:
01314                     currentData = '\n'
01315                 else:
01316                     currentData = ' '
01317             self.currentData = []
01318             if self.parseOnlyThese and len(self.tagStack) <= 1 and \
01319                    (not self.parseOnlyThese.text or \
01320                     not self.parseOnlyThese.search(currentData)):
01321                 return
01322             o = containerClass(currentData)
01323             o.setup(self.currentTag, self.previous)
01324             if self.previous:
01325                 self.previous.next = o
01326             self.previous = o
01327             self.currentTag.contents.append(o)
01328 

def BeautifulSoup::BeautifulStoneSoup::endData (   self,
  containerClass = NavigableString 
)

Definition at line 1306 of file BeautifulSoup.py.

01307                                                      :
01308         if self.currentData:
01309             currentData = u''.join(self.currentData)
01310             if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and
01311                 not set([tag.name for tag in self.tagStack]).intersection(
01312                     self.PRESERVE_WHITESPACE_TAGS)):
01313                 if '\n' in currentData:
01314                     currentData = '\n'
01315                 else:
01316                     currentData = ' '
01317             self.currentData = []
01318             if self.parseOnlyThese and len(self.tagStack) <= 1 and \
01319                    (not self.parseOnlyThese.text or \
01320                     not self.parseOnlyThese.search(currentData)):
01321                 return
01322             o = containerClass(currentData)
01323             o.setup(self.currentTag, self.previous)
01324             if self.previous:
01325                 self.previous.next = o
01326             self.previous = o
01327             self.currentTag.contents.append(o)
01328 

def BeautifulSoup::BeautifulStoneSoup::extractCharsetFromMeta (   self,
  attrs 
)

Reimplemented in BeautifulSoup::BeautifulSoup, and BeautifulSoup::BeautifulSoup.

Definition at line 1443 of file BeautifulSoup.py.

01444                                            :
01445         self.unknown_starttag('meta', attrs)
01446 

def BeautifulSoup::BeautifulStoneSoup::extractCharsetFromMeta (   self,
  attrs 
)

Reimplemented in BeautifulSoup::BeautifulSoup, and BeautifulSoup::BeautifulSoup.

Definition at line 1443 of file BeautifulSoup.py.

01444                                            :
01445         self.unknown_starttag('meta', attrs)
01446 

def BeautifulSoup::BeautifulStoneSoup::handle_data (   self,
  data 
)

Definition at line 1440 of file BeautifulSoup.py.

01441                                :
01442         self.currentData.append(data)

def BeautifulSoup::BeautifulStoneSoup::handle_data (   self,
  data 
)

Definition at line 1440 of file BeautifulSoup.py.

01441                                :
01442         self.currentData.append(data)

def BeautifulSoup::BeautifulStoneSoup::isSelfClosingTag (   self,
  name 
)
Returns true iff the given string is the name of a
self-closing tag according to this parser.

Definition at line 1269 of file BeautifulSoup.py.

01270                                     :
01271         """Returns true iff the given string is the name of a
01272         self-closing tag according to this parser."""
01273         return self.SELF_CLOSING_TAGS.has_key(name) \
01274                or self.instanceSelfClosingTags.has_key(name)

def BeautifulSoup::BeautifulStoneSoup::isSelfClosingTag (   self,
  name 
)
Returns true iff the given string is the name of a
self-closing tag according to this parser.

Definition at line 1269 of file BeautifulSoup.py.

01270                                     :
01271         """Returns true iff the given string is the name of a
01272         self-closing tag according to this parser."""
01273         return self.SELF_CLOSING_TAGS.has_key(name) \
01274                or self.instanceSelfClosingTags.has_key(name)

def BeautifulSoup::BeautifulStoneSoup::popTag (   self)

Reimplemented in BeautifulSoup::BeautifulSOAP, and BeautifulSoup::BeautifulSOAP.

Definition at line 1285 of file BeautifulSoup.py.

01286                     :
01287         tag = self.tagStack.pop()
01288         # Tags with just one string-owning child get the child as a
01289         # 'string' property, so that soup.tag.string is shorthand for
01290         # soup.tag.contents[0]
01291         if len(self.currentTag.contents) == 1 and \
01292            isinstance(self.currentTag.contents[0], NavigableString):
01293             self.currentTag.string = self.currentTag.contents[0]
01294 
01295         #print "Pop", tag.name
01296         if self.tagStack:
01297             self.currentTag = self.tagStack[-1]
01298         return self.currentTag

def BeautifulSoup::BeautifulStoneSoup::popTag (   self)

Reimplemented in BeautifulSoup::BeautifulSOAP, and BeautifulSoup::BeautifulSOAP.

Definition at line 1285 of file BeautifulSoup.py.

01286                     :
01287         tag = self.tagStack.pop()
01288         # Tags with just one string-owning child get the child as a
01289         # 'string' property, so that soup.tag.string is shorthand for
01290         # soup.tag.contents[0]
01291         if len(self.currentTag.contents) == 1 and \
01292            isinstance(self.currentTag.contents[0], NavigableString):
01293             self.currentTag.string = self.currentTag.contents[0]
01294 
01295         #print "Pop", tag.name
01296         if self.tagStack:
01297             self.currentTag = self.tagStack[-1]
01298         return self.currentTag

def BeautifulSoup::BeautifulStoneSoup::pushTag (   self,
  tag 
)

Definition at line 1299 of file BeautifulSoup.py.

01300                           :
01301         #print "Push", tag.name
01302         if self.currentTag:
01303             self.currentTag.contents.append(tag)
01304         self.tagStack.append(tag)
01305         self.currentTag = self.tagStack[-1]

def BeautifulSoup::BeautifulStoneSoup::pushTag (   self,
  tag 
)

Definition at line 1299 of file BeautifulSoup.py.

01300                           :
01301         #print "Push", tag.name
01302         if self.currentTag:
01303             self.currentTag.contents.append(tag)
01304         self.tagStack.append(tag)
01305         self.currentTag = self.tagStack[-1]

def BeautifulSoup::BeautifulStoneSoup::reset (   self)

Definition at line 1275 of file BeautifulSoup.py.

01276                    :
01277         Tag.__init__(self, self, self.ROOT_TAG_NAME)
01278         self.hidden = 1
01279         self.builder.reset()
01280         self.currentData = []
01281         self.currentTag = None
01282         self.tagStack = []
01283         self.quoteStack = []
01284         self.pushTag(self)

def BeautifulSoup::BeautifulStoneSoup::reset (   self)

Definition at line 1275 of file BeautifulSoup.py.

01276                    :
01277         Tag.__init__(self, self, self.ROOT_TAG_NAME)
01278         self.hidden = 1
01279         self.builder.reset()
01280         self.currentData = []
01281         self.currentTag = None
01282         self.tagStack = []
01283         self.quoteStack = []
01284         self.pushTag(self)

def BeautifulSoup::BeautifulStoneSoup::unknown_endtag (   self,
  name 
)

Definition at line 1427 of file BeautifulSoup.py.

01428                                   :
01429         #print "End tag %s" % name
01430         if self.quoteStack and self.quoteStack[-1] != name:
01431             #This is not a real end tag.
01432             #print "</%s> is not real!" % name
01433             self.handle_data('</%s>' % name)
01434             return
01435         self.endData()
01436         self._popToTag(name)
01437         if self.quoteStack and self.quoteStack[-1] == name:
01438             self.quoteStack.pop()
01439             self.literal = (len(self.quoteStack) > 0)

def BeautifulSoup::BeautifulStoneSoup::unknown_endtag (   self,
  name 
)

Definition at line 1427 of file BeautifulSoup.py.

01428                                   :
01429         #print "End tag %s" % name
01430         if self.quoteStack and self.quoteStack[-1] != name:
01431             #This is not a real end tag.
01432             #print "</%s> is not real!" % name
01433             self.handle_data('</%s>' % name)
01434             return
01435         self.endData()
01436         self._popToTag(name)
01437         if self.quoteStack and self.quoteStack[-1] == name:
01438             self.quoteStack.pop()
01439             self.literal = (len(self.quoteStack) > 0)

def BeautifulSoup::BeautifulStoneSoup::unknown_starttag (   self,
  name,
  attrs,
  selfClosing = 0 
)

Definition at line 1397 of file BeautifulSoup.py.

01398                                                           :
01399         #print "Start tag %s: %s" % (name, attrs)
01400         if self.quoteStack:
01401             #This is not a real tag.
01402             #print "<%s> is not real!" % name
01403             attrs = ''.join(map(lambda(x, y): ' %s="%s"' % (x, y), attrs))
01404             self.handle_data('<%s%s>' % (name, attrs))
01405             return
01406         self.endData()
01407 
01408         if not self.isSelfClosingTag(name) and not selfClosing:
01409             self._smartPop(name)
01410 
01411         if self.parseOnlyThese and len(self.tagStack) <= 1 \
01412                and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)):
01413             return
01414 
01415         tag = Tag(self, name, attrs, self.currentTag, self.previous)
01416         if self.previous:
01417             self.previous.next = tag
01418         self.previous = tag
01419         self.pushTag(tag)
01420         if selfClosing or self.isSelfClosingTag(name):
01421             self.popTag()
01422         if name in self.QUOTE_TAGS:
01423             #print "Beginning quote (%s)" % name
01424             self.quoteStack.append(name)
01425             self.literal = 1
01426         return tag

def BeautifulSoup::BeautifulStoneSoup::unknown_starttag (   self,
  name,
  attrs,
  selfClosing = 0 
)

Definition at line 1397 of file BeautifulSoup.py.

01398                                                           :
01399         #print "Start tag %s: %s" % (name, attrs)
01400         if self.quoteStack:
01401             #This is not a real tag.
01402             #print "<%s> is not real!" % name
01403             attrs = ''.join(map(lambda(x, y): ' %s="%s"' % (x, y), attrs))
01404             self.handle_data('<%s%s>' % (name, attrs))
01405             return
01406         self.endData()
01407 
01408         if not self.isSelfClosingTag(name) and not selfClosing:
01409             self._smartPop(name)
01410 
01411         if self.parseOnlyThese and len(self.tagStack) <= 1 \
01412                and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)):
01413             return
01414 
01415         tag = Tag(self, name, attrs, self.currentTag, self.previous)
01416         if self.previous:
01417             self.previous.next = tag
01418         self.previous = tag
01419         self.pushTag(tag)
01420         if selfClosing or self.isSelfClosingTag(name):
01421             self.popTag()
01422         if name in self.QUOTE_TAGS:
01423             #print "Beginning quote (%s)" % name
01424             self.quoteStack.append(name)
01425             self.literal = 1
01426         return tag


Member Data Documentation

Definition at line 1156 of file BeautifulSoup.py.

Definition at line 1187 of file BeautifulSoup.py.

Definition at line 1187 of file BeautifulSoup.py.

Definition at line 1187 of file BeautifulSoup.py.

Definition at line 1187 of file BeautifulSoup.py.

Definition at line 1275 of file BeautifulSoup.py.

Definition at line 1275 of file BeautifulSoup.py.

Reimplemented in BeautifulSoup::BeautifulSoup.

Definition at line 1236 of file BeautifulSoup.py.

Definition at line 1187 of file BeautifulSoup.py.

Definition at line 1187 of file BeautifulSoup.py.

Definition at line 1275 of file BeautifulSoup.py.

Definition at line 1152 of file BeautifulSoup.py.

Definition at line 1187 of file BeautifulSoup.py.

Definition at line 1397 of file BeautifulSoup.py.

Definition at line 1187 of file BeautifulSoup.py.

Initial value:
[(re.compile('(<[^<>]*)/>'),
                       lambda x: x.group(1) + ' />'),
                      (re.compile('<!\s+([^<>]*)>'),
                       lambda x: '<!' + x.group(1) + '>')
                      ]

Definition at line 1144 of file BeautifulSoup.py.

Definition at line 1187 of file BeautifulSoup.py.

Reimplemented in BeautifulSoup::BeautifulSoup.

Definition at line 1236 of file BeautifulSoup.py.

Definition at line 1187 of file BeautifulSoup.py.

Reimplemented in BeautifulSoup::BeautifulSoup.

Definition at line 1142 of file BeautifulSoup.py.

Reimplemented from BeautifulSoup::PageElement.

Definition at line 1306 of file BeautifulSoup.py.

Reimplemented in BeautifulSoup::BeautifulSoup.

Definition at line 1141 of file BeautifulSoup.py.

Definition at line 1275 of file BeautifulSoup.py.

Reimplemented in BeautifulSoup::BeautifulSoup, and BeautifulSoup::MinimalSoup.

Definition at line 1140 of file BeautifulSoup.py.

string BeautifulSoup::BeautifulStoneSoup::ROOT_TAG_NAME = u'[document]' [static]

Definition at line 1150 of file BeautifulSoup.py.

Reimplemented in BeautifulSoup::BeautifulSoup.

Definition at line 1138 of file BeautifulSoup.py.

Definition at line 1187 of file BeautifulSoup.py.

dictionary BeautifulSoup::BeautifulStoneSoup::STRIP_ASCII_SPACES = { 9: None, 10: None, 12: None, 13: None, 32: None, } [static]

Definition at line 1162 of file BeautifulSoup.py.

Definition at line 1275 of file BeautifulSoup.py.

Definition at line 1154 of file BeautifulSoup.py.

Definition at line 1153 of file BeautifulSoup.py.