CMS 3D CMS Logo

Public Member Functions | Public Attributes | Static Public Attributes

BeautifulSoup::BeautifulSoup Class Reference

Inheritance diagram for BeautifulSoup::BeautifulSoup:
BeautifulSoup::BeautifulStoneSoup BeautifulSoup::BeautifulStoneSoup BeautifulSoup::Tag BeautifulSoup::Tag BeautifulSoup::Tag BeautifulSoup::Tag BeautifulSoup::PageElement BeautifulSoup::PageElement BeautifulSoup::PageElement BeautifulSoup::PageElement BeautifulSoup::PageElement BeautifulSoup::PageElement BeautifulSoup::PageElement BeautifulSoup::PageElement BeautifulSoup::ICantBelieveItsBeautifulSoup BeautifulSoup::ICantBelieveItsBeautifulSoup BeautifulSoup::MinimalSoup BeautifulSoup::MinimalSoup BeautifulSoup::RobustHTMLParser BeautifulSoup::RobustHTMLParser BeautifulSoup::RobustWackAssHTMLParser BeautifulSoup::RobustWackAssHTMLParser BeautifulSoup::RobustWackAssHTMLParser BeautifulSoup::RobustWackAssHTMLParser BeautifulSoup::RobustInsanelyWackAssHTMLParser BeautifulSoup::RobustInsanelyWackAssHTMLParser BeautifulSoup::RobustInsanelyWackAssHTMLParser BeautifulSoup::RobustInsanelyWackAssHTMLParser

List of all members.

Public Member Functions

def __init__
def __init__
def extractCharsetFromMeta
def extractCharsetFromMeta

Public Attributes

 declaredHTMLEncoding
 originalEncoding

Static Public Attributes

tuple CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M)
list NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']
list NESTABLE_INLINE_TAGS
dictionary NESTABLE_LIST_TAGS
dictionary NESTABLE_TABLE_TAGS
tuple NESTABLE_TAGS
list NON_NESTABLE_BLOCK_TAGS = ['address', 'form', 'p', 'pre']
tuple PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
dictionary QUOTE_TAGS = {'script' : None, 'textarea' : None}
tuple RESET_NESTING_TAGS
tuple SELF_CLOSING_TAGS

Detailed Description

This parser knows the following facts about HTML:

* Some tags have no closing tag and should be interpreted as being
  closed as soon as they are encountered.

* The text inside some tags (ie. 'script') may contain tags which
  are not really part of the document and which should be parsed
  as text, not tags. If you want to parse the text as tags, you can
  always fetch it and parse it explicitly.

* Tag nesting rules:

  Most tags can't be nested at all. For instance, the occurance of
  a <p> tag should implicitly close the previous <p> tag.

   <p>Para1<p>Para2
    should be transformed into:
   <p>Para1</p><p>Para2

  Some tags can be nested arbitrarily. For instance, the occurance
  of a <blockquote> tag should _not_ implicitly close the previous
  <blockquote> tag.

   Alice said: <blockquote>Bob said: <blockquote>Blah
    should NOT be transformed into:
   Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah

  Some tags can be nested, but the nesting is reset by the
  interposition of other tags. For instance, a <tr> tag should
  implicitly close the previous <tr> tag within the same <table>,
  but not close a <tr> tag in another table.

   <table><tr>Blah<tr>Blah
    should be transformed into:
   <table><tr>Blah</tr><tr>Blah
    but,
   <tr>Blah<table><tr>Blah
    should NOT be transformed into
   <tr>Blah<table></tr><tr>Blah

Differing assumptions about tag nesting rules are a major source
of problems with the BeautifulSoup class. If BeautifulSoup is not
treating as nestable a tag your page author treats as nestable,
try ICantBelieveItsBeautifulSoup, MinimalSoup, or
BeautifulStoneSoup before writing your own subclass.

Definition at line 1447 of file BeautifulSoup.py.


Constructor & Destructor Documentation

def BeautifulSoup::BeautifulSoup::__init__ (   self,
  args,
  kwargs 
)

Definition at line 1495 of file BeautifulSoup.py.

01496                                        :
01497         if not kwargs.has_key('smartQuotesTo'):
01498             kwargs['smartQuotesTo'] = self.HTML_ENTITIES
01499         kwargs['isHTML'] = True
01500         BeautifulStoneSoup.__init__(self, *args, **kwargs)

def BeautifulSoup::BeautifulSoup::__init__ (   self,
  args,
  kwargs 
)

Definition at line 1495 of file BeautifulSoup.py.

01496                                        :
01497         if not kwargs.has_key('smartQuotesTo'):
01498             kwargs['smartQuotesTo'] = self.HTML_ENTITIES
01499         kwargs['isHTML'] = True
01500         BeautifulStoneSoup.__init__(self, *args, **kwargs)


Member Function Documentation

def BeautifulSoup::BeautifulSoup::extractCharsetFromMeta (   self,
  attrs 
)
Beautiful Soup can detect a charset included in a META tag,
try to convert the document to that charset, and re-parse the
document from the beginning.

Reimplemented from BeautifulSoup::BeautifulStoneSoup.

Definition at line 1553 of file BeautifulSoup.py.

01554                                            :
01555         """Beautiful Soup can detect a charset included in a META tag,
01556         try to convert the document to that charset, and re-parse the
01557         document from the beginning."""
01558         httpEquiv = None
01559         contentType = None
01560         contentTypeIndex = None
01561         tagNeedsEncodingSubstitution = False
01562 
01563         for i in range(0, len(attrs)):
01564             key, value = attrs[i]
01565             key = key.lower()
01566             if key == 'http-equiv':
01567                 httpEquiv = value
01568             elif key == 'content':
01569                 contentType = value
01570                 contentTypeIndex = i
01571 
01572         if httpEquiv and contentType: # It's an interesting meta tag.
01573             match = self.CHARSET_RE.search(contentType)
01574             if match:
01575                 if (self.declaredHTMLEncoding is not None or
01576                     self.originalEncoding == self.fromEncoding):
01577                     # An HTML encoding was sniffed while converting
01578                     # the document to Unicode, or an HTML encoding was
01579                     # sniffed during a previous pass through the
01580                     # document, or an encoding was specified
01581                     # explicitly and it worked. Rewrite the meta tag.
01582                     def rewrite(match):
01583                         return match.group(1) + "%SOUP-ENCODING%"
01584                     newAttr = self.CHARSET_RE.sub(rewrite, contentType)
01585                     attrs[contentTypeIndex] = (attrs[contentTypeIndex][0],
01586                                                newAttr)
01587                     tagNeedsEncodingSubstitution = True
01588                 else:
01589                     # This is our first pass through the document.
01590                     # Go through it again with the encoding information.
01591                     newCharset = match.group(3)
01592                     if newCharset and newCharset != self.originalEncoding:
01593                         self.declaredHTMLEncoding = newCharset
01594                         self._feed(self.declaredHTMLEncoding)
01595                         raise StopParsing
01596                     pass
01597         tag = self.unknown_starttag("meta", attrs)
01598         if tag and tagNeedsEncodingSubstitution:
01599             tag.containsSubstitutions = True
01600 

def BeautifulSoup::BeautifulSoup::extractCharsetFromMeta (   self,
  attrs 
)
Beautiful Soup can detect a charset included in a META tag,
try to convert the document to that charset, and re-parse the
document from the beginning.

Reimplemented from BeautifulSoup::BeautifulStoneSoup.

Definition at line 1553 of file BeautifulSoup.py.

01554                                            :
01555         """Beautiful Soup can detect a charset included in a META tag,
01556         try to convert the document to that charset, and re-parse the
01557         document from the beginning."""
01558         httpEquiv = None
01559         contentType = None
01560         contentTypeIndex = None
01561         tagNeedsEncodingSubstitution = False
01562 
01563         for i in range(0, len(attrs)):
01564             key, value = attrs[i]
01565             key = key.lower()
01566             if key == 'http-equiv':
01567                 httpEquiv = value
01568             elif key == 'content':
01569                 contentType = value
01570                 contentTypeIndex = i
01571 
01572         if httpEquiv and contentType: # It's an interesting meta tag.
01573             match = self.CHARSET_RE.search(contentType)
01574             if match:
01575                 if (self.declaredHTMLEncoding is not None or
01576                     self.originalEncoding == self.fromEncoding):
01577                     # An HTML encoding was sniffed while converting
01578                     # the document to Unicode, or an HTML encoding was
01579                     # sniffed during a previous pass through the
01580                     # document, or an encoding was specified
01581                     # explicitly and it worked. Rewrite the meta tag.
01582                     def rewrite(match):
01583                         return match.group(1) + "%SOUP-ENCODING%"
01584                     newAttr = self.CHARSET_RE.sub(rewrite, contentType)
01585                     attrs[contentTypeIndex] = (attrs[contentTypeIndex][0],
01586                                                newAttr)
01587                     tagNeedsEncodingSubstitution = True
01588                 else:
01589                     # This is our first pass through the document.
01590                     # Go through it again with the encoding information.
01591                     newCharset = match.group(3)
01592                     if newCharset and newCharset != self.originalEncoding:
01593                         self.declaredHTMLEncoding = newCharset
01594                         self._feed(self.declaredHTMLEncoding)
01595                         raise StopParsing
01596                     pass
01597         tag = self.unknown_starttag("meta", attrs)
01598         if tag and tagNeedsEncodingSubstitution:
01599             tag.containsSubstitutions = True
01600 


Member Data Documentation

tuple BeautifulSoup::BeautifulSoup::CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M) [static]

Definition at line 1551 of file BeautifulSoup.py.

Reimplemented from BeautifulSoup::BeautifulStoneSoup.

Definition at line 1555 of file BeautifulSoup.py.

list BeautifulSoup::BeautifulSoup::NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del'] [static]

Definition at line 1518 of file BeautifulSoup.py.

Initial value:
['span', 'font', 'q', 'object', 'bdo', 'sub', 'sup',
                            'center']

Definition at line 1512 of file BeautifulSoup.py.

Initial value:
{ 'ol' : [],
                           'ul' : [],
                           'li' : ['ul', 'ol'],
                           'dl' : [],
                           'dd' : ['dl'],
                           'dt' : ['dl'] }

Definition at line 1521 of file BeautifulSoup.py.

Initial value:
{'table' : [],
                           'tr' : ['table', 'tbody', 'tfoot', 'thead'],
                           'td' : ['tr'],
                           'th' : ['tr'],
                           'thead' : ['table'],
                           'tbody' : ['table'],
                           'tfoot' : ['table'],
                           }

Definition at line 1529 of file BeautifulSoup.py.

Initial value:
buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE_BLOCK_TAGS,
                                NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS)

Reimplemented from BeautifulSoup::BeautifulStoneSoup.

Reimplemented in BeautifulSoup::ICantBelieveItsBeautifulSoup, and BeautifulSoup::MinimalSoup.

Definition at line 1547 of file BeautifulSoup.py.

list BeautifulSoup::BeautifulSoup::NON_NESTABLE_BLOCK_TAGS = ['address', 'form', 'p', 'pre'] [static]

Definition at line 1538 of file BeautifulSoup.py.

Reimplemented from BeautifulSoup::BeautifulStoneSoup.

Definition at line 1555 of file BeautifulSoup.py.

tuple BeautifulSoup::BeautifulSoup::PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea']) [static]

Reimplemented from BeautifulSoup::BeautifulStoneSoup.

Definition at line 1505 of file BeautifulSoup.py.

dictionary BeautifulSoup::BeautifulSoup::QUOTE_TAGS = {'script' : None, 'textarea' : None} [static]

Reimplemented from BeautifulSoup::BeautifulStoneSoup.

Definition at line 1507 of file BeautifulSoup.py.

Initial value:
buildTagMap(None, NESTABLE_BLOCK_TAGS, 'noscript',
                                     NON_NESTABLE_BLOCK_TAGS,
                                     NESTABLE_LIST_TAGS,
                                     NESTABLE_TABLE_TAGS)

Reimplemented from BeautifulSoup::BeautifulStoneSoup.

Reimplemented in BeautifulSoup::MinimalSoup.

Definition at line 1542 of file BeautifulSoup.py.

Initial value:
buildTagMap(None,
                                    ['br' , 'hr', 'input', 'img', 'meta',
                                    'spacer', 'link', 'frame', 'base'])

Reimplemented from BeautifulSoup::BeautifulStoneSoup.

Definition at line 1501 of file BeautifulSoup.py.