app/htmlsanitizer/BeautifulSoup.py
author Daniel Diniz <ajaksu@gmail.com>
Wed, 08 Jul 2009 10:40:46 +0200
changeset 2570 851640749319
parent 2323 b3daada52dd3
permissions -rw-r--r--
Several Survey UI fixes. Fixes: Too narrow fieldsets in new question/option dialogs. Survey submit (on take view) and save/export/etc. buttons at weird places, instead of at bottom. Weird placement of radio buttons in Opera. Too narrow selects in IE. Broken images in edit view in IE and Opera. Reviewed by: Lennard de Rijk (Only on IE)
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
2323
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     1
"""Beautiful Soup
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     2
Elixir and Tonic
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     3
"The Screen-Scraper's Friend"
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     4
http://www.crummy.com/software/BeautifulSoup/
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     5
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     6
Beautiful Soup parses a (possibly invalid) XML or HTML document into a
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     7
tree representation. It provides methods and Pythonic idioms that make
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     8
it easy to navigate, search, and modify the tree.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     9
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    10
A well-formed XML/HTML document yields a well-formed data
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    11
structure. An ill-formed XML/HTML document yields a correspondingly
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    12
ill-formed data structure. If your document is only locally
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    13
well-formed, you can use this library to find and process the
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    14
well-formed part of it.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    15
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    16
Beautiful Soup works with Python 2.2 and up. It has no external
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    17
dependencies, but you'll have more success at converting data to UTF-8
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    18
if you also install these three packages:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    19
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    20
* chardet, for auto-detecting character encodings
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    21
  http://chardet.feedparser.org/
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    22
* cjkcodecs and iconv_codec, which add more encodings to the ones supported
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    23
  by stock Python.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    24
  http://cjkpython.i18n.org/
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    25
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    26
Beautiful Soup defines classes for two main parsing strategies:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    27
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    28
 * BeautifulStoneSoup, for parsing XML, SGML, or your domain-specific
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    29
   language that kind of looks like XML.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    30
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    31
 * BeautifulSoup, for parsing run-of-the-mill HTML code, be it valid
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    32
   or invalid. This class has web browser-like heuristics for
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    33
   obtaining a sensible parse tree in the face of common HTML errors.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    34
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    35
Beautiful Soup also defines a class (UnicodeDammit) for autodetecting
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    36
the encoding of an HTML or XML document, and converting it to
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    37
Unicode. Much of this code is taken from Mark Pilgrim's Universal Feed Parser.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    38
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    39
For more than you ever wanted to know about Beautiful Soup, see the
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    40
documentation:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    41
http://www.crummy.com/software/BeautifulSoup/documentation.html
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    42
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    43
Here, have some legalese:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    44
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    45
Copyright (c) 2004-2009, Leonard Richardson
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    46
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    47
All rights reserved.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    48
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    49
Redistribution and use in source and binary forms, with or without
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    50
modification, are permitted provided that the following conditions are
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    51
met:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    52
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    53
  * Redistributions of source code must retain the above copyright
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    54
    notice, this list of conditions and the following disclaimer.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    55
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    56
  * Redistributions in binary form must reproduce the above
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    57
    copyright notice, this list of conditions and the following
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    58
    disclaimer in the documentation and/or other materials provided
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    59
    with the distribution.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    60
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    61
  * Neither the name of the the Beautiful Soup Consortium and All
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    62
    Night Kosher Bakery nor the names of its contributors may be
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    63
    used to endorse or promote products derived from this software
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    64
    without specific prior written permission.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    65
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    66
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    67
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    68
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    69
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    70
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    71
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    72
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    73
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    74
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    75
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    76
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE, DAMMIT.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    77
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    78
"""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    79
from __future__ import generators
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    80
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    81
__author__ = "Leonard Richardson (leonardr@segfault.org)"
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    82
__version__ = "3.1.0.1"
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    83
__copyright__ = "Copyright (c) 2004-2009 Leonard Richardson"
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    84
__license__ = "New-style BSD"
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    85
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    86
import codecs
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    87
import markupbase
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    88
import types
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    89
import re
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    90
from HTMLParser import HTMLParser, HTMLParseError
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    91
try:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    92
    from htmlentitydefs import name2codepoint
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    93
except ImportError:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    94
    name2codepoint = {}
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    95
try:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    96
    set
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    97
except NameError:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    98
    from sets import Set as set
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    99
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   100
#These hacks make Beautiful Soup able to parse XML with namespaces
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   101
markupbase._declname_match = re.compile(r'[a-zA-Z][-_.:a-zA-Z0-9]*\s*').match
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   102
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   103
DEFAULT_OUTPUT_ENCODING = "utf-8"
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   104
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   105
# First, the classes that represent markup elements.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   106
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   107
def sob(unicode, encoding):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   108
    """Returns either the given Unicode string or its encoding."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   109
    if encoding is None:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   110
        return unicode
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   111
    else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   112
        return unicode.encode(encoding)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   113
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   114
class PageElement:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   115
    """Contains the navigational information for some part of the page
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   116
    (either a tag or a piece of text)"""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   117
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   118
    def setup(self, parent=None, previous=None):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   119
        """Sets up the initial relations between this element and
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   120
        other elements."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   121
        self.parent = parent
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   122
        self.previous = previous
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   123
        self.next = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   124
        self.previousSibling = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   125
        self.nextSibling = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   126
        if self.parent and self.parent.contents:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   127
            self.previousSibling = self.parent.contents[-1]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   128
            self.previousSibling.nextSibling = self
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   129
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   130
    def replaceWith(self, replaceWith):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   131
        oldParent = self.parent
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   132
        myIndex = self.parent.contents.index(self)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   133
        if hasattr(replaceWith, 'parent') and replaceWith.parent == self.parent:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   134
            # We're replacing this element with one of its siblings.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   135
            index = self.parent.contents.index(replaceWith)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   136
            if index and index < myIndex:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   137
                # Furthermore, it comes before this element. That
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   138
                # means that when we extract it, the index of this
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   139
                # element will change.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   140
                myIndex = myIndex - 1
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   141
        self.extract()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   142
        oldParent.insert(myIndex, replaceWith)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   143
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   144
    def extract(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   145
        """Destructively rips this element out of the tree."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   146
        if self.parent:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   147
            try:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   148
                self.parent.contents.remove(self)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   149
            except ValueError:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   150
                pass
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   151
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   152
        #Find the two elements that would be next to each other if
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   153
        #this element (and any children) hadn't been parsed. Connect
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   154
        #the two.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   155
        lastChild = self._lastRecursiveChild()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   156
        nextElement = lastChild.next
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   157
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   158
        if self.previous:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   159
            self.previous.next = nextElement
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   160
        if nextElement:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   161
            nextElement.previous = self.previous
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   162
        self.previous = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   163
        lastChild.next = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   164
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   165
        self.parent = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   166
        if self.previousSibling:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   167
            self.previousSibling.nextSibling = self.nextSibling
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   168
        if self.nextSibling:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   169
            self.nextSibling.previousSibling = self.previousSibling
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   170
        self.previousSibling = self.nextSibling = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   171
        return self
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   172
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   173
    def _lastRecursiveChild(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   174
        "Finds the last element beneath this object to be parsed."
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   175
        lastChild = self
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   176
        while hasattr(lastChild, 'contents') and lastChild.contents:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   177
            lastChild = lastChild.contents[-1]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   178
        return lastChild
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   179
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   180
    def insert(self, position, newChild):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   181
        if (isinstance(newChild, basestring)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   182
            or isinstance(newChild, unicode)) \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   183
            and not isinstance(newChild, NavigableString):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   184
            newChild = NavigableString(newChild)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   185
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   186
        position =  min(position, len(self.contents))
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   187
        if hasattr(newChild, 'parent') and newChild.parent != None:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   188
            # We're 'inserting' an element that's already one
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   189
            # of this object's children.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   190
            if newChild.parent == self:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   191
                index = self.find(newChild)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   192
                if index and index < position:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   193
                    # Furthermore we're moving it further down the
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   194
                    # list of this object's children. That means that
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   195
                    # when we extract this element, our target index
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   196
                    # will jump down one.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   197
                    position = position - 1
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   198
            newChild.extract()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   199
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   200
        newChild.parent = self
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   201
        previousChild = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   202
        if position == 0:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   203
            newChild.previousSibling = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   204
            newChild.previous = self
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   205
        else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   206
            previousChild = self.contents[position-1]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   207
            newChild.previousSibling = previousChild
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   208
            newChild.previousSibling.nextSibling = newChild
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   209
            newChild.previous = previousChild._lastRecursiveChild()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   210
        if newChild.previous:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   211
            newChild.previous.next = newChild
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   212
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   213
        newChildsLastElement = newChild._lastRecursiveChild()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   214
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   215
        if position >= len(self.contents):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   216
            newChild.nextSibling = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   217
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   218
            parent = self
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   219
            parentsNextSibling = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   220
            while not parentsNextSibling:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   221
                parentsNextSibling = parent.nextSibling
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   222
                parent = parent.parent
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   223
                if not parent: # This is the last element in the document.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   224
                    break
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   225
            if parentsNextSibling:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   226
                newChildsLastElement.next = parentsNextSibling
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   227
            else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   228
                newChildsLastElement.next = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   229
        else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   230
            nextChild = self.contents[position]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   231
            newChild.nextSibling = nextChild
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   232
            if newChild.nextSibling:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   233
                newChild.nextSibling.previousSibling = newChild
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   234
            newChildsLastElement.next = nextChild
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   235
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   236
        if newChildsLastElement.next:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   237
            newChildsLastElement.next.previous = newChildsLastElement
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   238
        self.contents.insert(position, newChild)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   239
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   240
    def append(self, tag):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   241
        """Appends the given tag to the contents of this tag."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   242
        self.insert(len(self.contents), tag)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   243
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   244
    def findNext(self, name=None, attrs={}, text=None, **kwargs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   245
        """Returns the first item that matches the given criteria and
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   246
        appears after this Tag in the document."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   247
        return self._findOne(self.findAllNext, name, attrs, text, **kwargs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   248
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   249
    def findAllNext(self, name=None, attrs={}, text=None, limit=None,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   250
                    **kwargs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   251
        """Returns all items that match the given criteria and appear
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   252
        after this Tag in the document."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   253
        return self._findAll(name, attrs, text, limit, self.nextGenerator,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   254
                             **kwargs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   255
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   256
    def findNextSibling(self, name=None, attrs={}, text=None, **kwargs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   257
        """Returns the closest sibling to this Tag that matches the
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   258
        given criteria and appears after this Tag in the document."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   259
        return self._findOne(self.findNextSiblings, name, attrs, text,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   260
                             **kwargs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   261
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   262
    def findNextSiblings(self, name=None, attrs={}, text=None, limit=None,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   263
                         **kwargs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   264
        """Returns the siblings of this Tag that match the given
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   265
        criteria and appear after this Tag in the document."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   266
        return self._findAll(name, attrs, text, limit,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   267
                             self.nextSiblingGenerator, **kwargs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   268
    fetchNextSiblings = findNextSiblings # Compatibility with pre-3.x
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   269
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   270
    def findPrevious(self, name=None, attrs={}, text=None, **kwargs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   271
        """Returns the first item that matches the given criteria and
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   272
        appears before this Tag in the document."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   273
        return self._findOne(self.findAllPrevious, name, attrs, text, **kwargs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   274
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   275
    def findAllPrevious(self, name=None, attrs={}, text=None, limit=None,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   276
                        **kwargs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   277
        """Returns all items that match the given criteria and appear
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   278
        before this Tag in the document."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   279
        return self._findAll(name, attrs, text, limit, self.previousGenerator,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   280
                           **kwargs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   281
    fetchPrevious = findAllPrevious # Compatibility with pre-3.x
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   282
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   283
    def findPreviousSibling(self, name=None, attrs={}, text=None, **kwargs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   284
        """Returns the closest sibling to this Tag that matches the
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   285
        given criteria and appears before this Tag in the document."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   286
        return self._findOne(self.findPreviousSiblings, name, attrs, text,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   287
                             **kwargs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   288
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   289
    def findPreviousSiblings(self, name=None, attrs={}, text=None,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   290
                             limit=None, **kwargs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   291
        """Returns the siblings of this Tag that match the given
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   292
        criteria and appear before this Tag in the document."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   293
        return self._findAll(name, attrs, text, limit,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   294
                             self.previousSiblingGenerator, **kwargs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   295
    fetchPreviousSiblings = findPreviousSiblings # Compatibility with pre-3.x
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   296
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   297
    def findParent(self, name=None, attrs={}, **kwargs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   298
        """Returns the closest parent of this Tag that matches the given
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   299
        criteria."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   300
        # NOTE: We can't use _findOne because findParents takes a different
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   301
        # set of arguments.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   302
        r = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   303
        l = self.findParents(name, attrs, 1)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   304
        if l:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   305
            r = l[0]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   306
        return r
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   307
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   308
    def findParents(self, name=None, attrs={}, limit=None, **kwargs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   309
        """Returns the parents of this Tag that match the given
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   310
        criteria."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   311
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   312
        return self._findAll(name, attrs, None, limit, self.parentGenerator,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   313
                             **kwargs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   314
    fetchParents = findParents # Compatibility with pre-3.x
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   315
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   316
    #These methods do the real heavy lifting.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   317
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   318
    def _findOne(self, method, name, attrs, text, **kwargs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   319
        r = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   320
        l = method(name, attrs, text, 1, **kwargs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   321
        if l:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   322
            r = l[0]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   323
        return r
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   324
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   325
    def _findAll(self, name, attrs, text, limit, generator, **kwargs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   326
        "Iterates over a generator looking for things that match."
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   327
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   328
        if isinstance(name, SoupStrainer):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   329
            strainer = name
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   330
        else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   331
            # Build a SoupStrainer
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   332
            strainer = SoupStrainer(name, attrs, text, **kwargs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   333
        results = ResultSet(strainer)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   334
        g = generator()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   335
        while True:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   336
            try:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   337
                i = g.next()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   338
            except StopIteration:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   339
                break
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   340
            if i:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   341
                found = strainer.search(i)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   342
                if found:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   343
                    results.append(found)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   344
                    if limit and len(results) >= limit:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   345
                        break
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   346
        return results
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   347
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   348
    #These Generators can be used to navigate starting from both
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   349
    #NavigableStrings and Tags.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   350
    def nextGenerator(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   351
        i = self
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   352
        while i:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   353
            i = i.next
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   354
            yield i
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   355
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   356
    def nextSiblingGenerator(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   357
        i = self
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   358
        while i:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   359
            i = i.nextSibling
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   360
            yield i
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   361
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   362
    def previousGenerator(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   363
        i = self
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   364
        while i:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   365
            i = i.previous
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   366
            yield i
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   367
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   368
    def previousSiblingGenerator(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   369
        i = self
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   370
        while i:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   371
            i = i.previousSibling
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   372
            yield i
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   373
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   374
    def parentGenerator(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   375
        i = self
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   376
        while i:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   377
            i = i.parent
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   378
            yield i
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   379
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   380
    # Utility methods
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   381
    def substituteEncoding(self, str, encoding=None):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   382
        encoding = encoding or "utf-8"
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   383
        return str.replace("%SOUP-ENCODING%", encoding)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   384
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   385
    def toEncoding(self, s, encoding=None):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   386
        """Encodes an object to a string in some encoding, or to Unicode.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   387
        ."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   388
        if isinstance(s, unicode):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   389
            if encoding:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   390
                s = s.encode(encoding)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   391
        elif isinstance(s, str):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   392
            if encoding:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   393
                s = s.encode(encoding)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   394
            else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   395
                s = unicode(s)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   396
        else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   397
            if encoding:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   398
                s  = self.toEncoding(str(s), encoding)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   399
            else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   400
                s = unicode(s)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   401
        return s
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   402
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   403
class NavigableString(unicode, PageElement):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   404
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   405
    def __new__(cls, value):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   406
        """Create a new NavigableString.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   407
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   408
        When unpickling a NavigableString, this method is called with
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   409
        the string in DEFAULT_OUTPUT_ENCODING. That encoding needs to be
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   410
        passed in to the superclass's __new__ or the superclass won't know
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   411
        how to handle non-ASCII characters.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   412
        """
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   413
        if isinstance(value, unicode):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   414
            return unicode.__new__(cls, value)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   415
        return unicode.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   416
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   417
    def __getnewargs__(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   418
        return (unicode(self),)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   419
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   420
    def __getattr__(self, attr):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   421
        """text.string gives you text. This is for backwards
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   422
        compatibility for Navigable*String, but for CData* it lets you
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   423
        get the string without the CData wrapper."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   424
        if attr == 'string':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   425
            return self
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   426
        else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   427
            raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   428
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   429
    def encode(self, encoding=DEFAULT_OUTPUT_ENCODING):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   430
        return self.decode().encode(encoding)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   431
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   432
    def decodeGivenEventualEncoding(self, eventualEncoding):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   433
        return self
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   434
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   435
class CData(NavigableString):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   436
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   437
    def decodeGivenEventualEncoding(self, eventualEncoding):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   438
        return u'<![CDATA[' + self + u']]>'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   439
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   440
class ProcessingInstruction(NavigableString):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   441
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   442
    def decodeGivenEventualEncoding(self, eventualEncoding):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   443
        output = self
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   444
        if u'%SOUP-ENCODING%' in output:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   445
            output = self.substituteEncoding(output, eventualEncoding)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   446
        return u'<?' + output + u'?>'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   447
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   448
class Comment(NavigableString):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   449
    def decodeGivenEventualEncoding(self, eventualEncoding):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   450
        return u'<!--' + self + u'-->'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   451
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   452
class Declaration(NavigableString):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   453
    def decodeGivenEventualEncoding(self, eventualEncoding):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   454
        return u'<!' + self + u'>'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   455
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   456
class Tag(PageElement):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   457
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   458
    """Represents a found HTML tag with its attributes and contents."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   459
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   460
    def _invert(h):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   461
        "Cheap function to invert a hash."
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   462
        i = {}
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   463
        for k,v in h.items():
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   464
            i[v] = k
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   465
        return i
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   466
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   467
    XML_ENTITIES_TO_SPECIAL_CHARS = { "apos" : "'",
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   468
                                      "quot" : '"',
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   469
                                      "amp" : "&",
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   470
                                      "lt" : "<",
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   471
                                      "gt" : ">" }
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   472
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   473
    XML_SPECIAL_CHARS_TO_ENTITIES = _invert(XML_ENTITIES_TO_SPECIAL_CHARS)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   474
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   475
    def _convertEntities(self, match):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   476
        """Used in a call to re.sub to replace HTML, XML, and numeric
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   477
        entities with the appropriate Unicode characters. If HTML
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   478
        entities are being converted, any unrecognized entities are
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   479
        escaped."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   480
        x = match.group(1)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   481
        if self.convertHTMLEntities and x in name2codepoint:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   482
            return unichr(name2codepoint[x])
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   483
        elif x in self.XML_ENTITIES_TO_SPECIAL_CHARS:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   484
            if self.convertXMLEntities:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   485
                return self.XML_ENTITIES_TO_SPECIAL_CHARS[x]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   486
            else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   487
                return u'&%s;' % x
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   488
        elif len(x) > 0 and x[0] == '#':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   489
            # Handle numeric entities
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   490
            if len(x) > 1 and x[1] == 'x':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   491
                return unichr(int(x[2:], 16))
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   492
            else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   493
                return unichr(int(x[1:]))
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   494
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   495
        elif self.escapeUnrecognizedEntities:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   496
            return u'&amp;%s;' % x
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   497
        else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   498
            return u'&%s;' % x
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   499
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   500
    def __init__(self, parser, name, attrs=None, parent=None,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   501
                 previous=None):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   502
        "Basic constructor."
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   503
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   504
        # We don't actually store the parser object: that lets extracted
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   505
        # chunks be garbage-collected
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   506
        self.parserClass = parser.__class__
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   507
        self.isSelfClosing = parser.isSelfClosingTag(name)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   508
        self.name = name
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   509
        if attrs == None:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   510
            attrs = []
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   511
        self.attrs = attrs
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   512
        self.contents = []
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   513
        self.setup(parent, previous)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   514
        self.hidden = False
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   515
        self.containsSubstitutions = False
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   516
        self.convertHTMLEntities = parser.convertHTMLEntities
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   517
        self.convertXMLEntities = parser.convertXMLEntities
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   518
        self.escapeUnrecognizedEntities = parser.escapeUnrecognizedEntities
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   519
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   520
        def convert(kval):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   521
            "Converts HTML, XML and numeric entities in the attribute value."
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   522
            k, val = kval
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   523
            if val is None:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   524
                return kval
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   525
            return (k, re.sub("&(#\d+|#x[0-9a-fA-F]+|\w+);",
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   526
                              self._convertEntities, val))
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   527
        self.attrs = map(convert, self.attrs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   528
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   529
    def get(self, key, default=None):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   530
        """Returns the value of the 'key' attribute for the tag, or
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   531
        the value given for 'default' if it doesn't have that
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   532
        attribute."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   533
        return self._getAttrMap().get(key, default)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   534
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   535
    def has_key(self, key):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   536
        return self._getAttrMap().has_key(key)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   537
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   538
    def __getitem__(self, key):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   539
        """tag[key] returns the value of the 'key' attribute for the tag,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   540
        and throws an exception if it's not there."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   541
        return self._getAttrMap()[key]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   542
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   543
    def __iter__(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   544
        "Iterating over a tag iterates over its contents."
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   545
        return iter(self.contents)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   546
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   547
    def __len__(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   548
        "The length of a tag is the length of its list of contents."
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   549
        return len(self.contents)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   550
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   551
    def __contains__(self, x):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   552
        return x in self.contents
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   553
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   554
    def __nonzero__(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   555
        "A tag is non-None even if it has no contents."
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   556
        return True
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   557
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   558
    def __setitem__(self, key, value):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   559
        """Setting tag[key] sets the value of the 'key' attribute for the
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   560
        tag."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   561
        self._getAttrMap()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   562
        self.attrMap[key] = value
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   563
        found = False
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   564
        for i in range(0, len(self.attrs)):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   565
            if self.attrs[i][0] == key:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   566
                self.attrs[i] = (key, value)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   567
                found = True
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   568
        if not found:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   569
            self.attrs.append((key, value))
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   570
        self._getAttrMap()[key] = value
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   571
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   572
    def __delitem__(self, key):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   573
        "Deleting tag[key] deletes all 'key' attributes for the tag."
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   574
        for item in self.attrs:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   575
            if item[0] == key:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   576
                self.attrs.remove(item)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   577
                #We don't break because bad HTML can define the same
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   578
                #attribute multiple times.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   579
            self._getAttrMap()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   580
            if self.attrMap.has_key(key):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   581
                del self.attrMap[key]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   582
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   583
    def __call__(self, *args, **kwargs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   584
        """Calling a tag like a function is the same as calling its
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   585
        findAll() method. Eg. tag('a') returns a list of all the A tags
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   586
        found within this tag."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   587
        return apply(self.findAll, args, kwargs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   588
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   589
    def __getattr__(self, tag):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   590
        #print "Getattr %s.%s" % (self.__class__, tag)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   591
        if len(tag) > 3 and tag.rfind('Tag') == len(tag)-3:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   592
            return self.find(tag[:-3])
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   593
        elif tag.find('__') != 0:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   594
            return self.find(tag)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   595
        raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__, tag)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   596
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   597
    def __eq__(self, other):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   598
        """Returns true iff this tag has the same name, the same attributes,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   599
        and the same contents (recursively) as the given tag.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   600
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   601
        NOTE: right now this will return false if two tags have the
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   602
        same attributes in a different order. Should this be fixed?"""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   603
        if not hasattr(other, 'name') or not hasattr(other, 'attrs') or not hasattr(other, 'contents') or self.name != other.name or self.attrs != other.attrs or len(self) != len(other):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   604
            return False
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   605
        for i in range(0, len(self.contents)):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   606
            if self.contents[i] != other.contents[i]:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   607
                return False
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   608
        return True
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   609
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   610
    def __ne__(self, other):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   611
        """Returns true iff this tag is not identical to the other tag,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   612
        as defined in __eq__."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   613
        return not self == other
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   614
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   615
    def __repr__(self, encoding=DEFAULT_OUTPUT_ENCODING):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   616
        """Renders this tag as a string."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   617
        return self.decode(eventualEncoding=encoding)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   618
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   619
    BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|"
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   620
                                           + "&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;)"
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   621
                                           + ")")
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   622
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   623
    def _sub_entity(self, x):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   624
        """Used with a regular expression to substitute the
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   625
        appropriate XML entity for an XML special character."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   626
        return "&" + self.XML_SPECIAL_CHARS_TO_ENTITIES[x.group(0)[0]] + ";"
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   627
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   628
    def __unicode__(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   629
        return self.decode()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   630
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   631
    def __str__(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   632
        return self.encode()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   633
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   634
    def encode(self, encoding=DEFAULT_OUTPUT_ENCODING,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   635
               prettyPrint=False, indentLevel=0):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   636
        return self.decode(prettyPrint, indentLevel, encoding).encode(encoding)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   637
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   638
    def decode(self, prettyPrint=False, indentLevel=0,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   639
               eventualEncoding=DEFAULT_OUTPUT_ENCODING):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   640
        """Returns a string or Unicode representation of this tag and
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   641
        its contents. To get Unicode, pass None for encoding."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   642
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   643
        attrs = []
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   644
        if self.attrs:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   645
            for key, val in self.attrs:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   646
                fmt = '%s="%s"'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   647
                if isString(val):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   648
                    if (self.containsSubstitutions
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   649
                        and eventualEncoding is not None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   650
                        and '%SOUP-ENCODING%' in val):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   651
                        val = self.substituteEncoding(val, eventualEncoding)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   652
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   653
                    # The attribute value either:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   654
                    #
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   655
                    # * Contains no embedded double quotes or single quotes.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   656
                    #   No problem: we enclose it in double quotes.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   657
                    # * Contains embedded single quotes. No problem:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   658
                    #   double quotes work here too.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   659
                    # * Contains embedded double quotes. No problem:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   660
                    #   we enclose it in single quotes.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   661
                    # * Embeds both single _and_ double quotes. This
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   662
                    #   can't happen naturally, but it can happen if
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   663
                    #   you modify an attribute value after parsing
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   664
                    #   the document. Now we have a bit of a
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   665
                    #   problem. We solve it by enclosing the
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   666
                    #   attribute in single quotes, and escaping any
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   667
                    #   embedded single quotes to XML entities.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   668
                    if '"' in val:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   669
                        fmt = "%s='%s'"
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   670
                        if "'" in val:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   671
                            # TODO: replace with apos when
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   672
                            # appropriate.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   673
                            val = val.replace("'", "&squot;")
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   674
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   675
                    # Now we're okay w/r/t quotes. But the attribute
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   676
                    # value might also contain angle brackets, or
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   677
                    # ampersands that aren't part of entities. We need
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   678
                    # to escape those to XML entities too.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   679
                    val = self.BARE_AMPERSAND_OR_BRACKET.sub(self._sub_entity, val)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   680
                if val is None:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   681
                    # Handle boolean attributes.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   682
                    decoded = key
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   683
                else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   684
                    decoded = fmt % (key, val)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   685
                attrs.append(decoded)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   686
        close = ''
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   687
        closeTag = ''
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   688
        if self.isSelfClosing:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   689
            close = ' /'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   690
        else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   691
            closeTag = '</%s>' % self.name
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   692
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   693
        indentTag, indentContents = 0, 0
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   694
        if prettyPrint:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   695
            indentTag = indentLevel
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   696
            space = (' ' * (indentTag-1))
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   697
            indentContents = indentTag + 1
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   698
        contents = self.decodeContents(prettyPrint, indentContents,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   699
                                       eventualEncoding)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   700
        if self.hidden:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   701
            s = contents
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   702
        else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   703
            s = []
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   704
            attributeString = ''
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   705
            if attrs:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   706
                attributeString = ' ' + ' '.join(attrs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   707
            if prettyPrint:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   708
                s.append(space)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   709
            s.append('<%s%s%s>' % (self.name, attributeString, close))
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   710
            if prettyPrint:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   711
                s.append("\n")
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   712
            s.append(contents)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   713
            if prettyPrint and contents and contents[-1] != "\n":
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   714
                s.append("\n")
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   715
            if prettyPrint and closeTag:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   716
                s.append(space)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   717
            s.append(closeTag)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   718
            if prettyPrint and closeTag and self.nextSibling:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   719
                s.append("\n")
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   720
            s = ''.join(s)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   721
        return s
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   722
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   723
    def decompose(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   724
        """Recursively destroys the contents of this tree."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   725
        contents = [i for i in self.contents]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   726
        for i in contents:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   727
            if isinstance(i, Tag):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   728
                i.decompose()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   729
            else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   730
                i.extract()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   731
        self.extract()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   732
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   733
    def prettify(self, encoding=DEFAULT_OUTPUT_ENCODING):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   734
        return self.encode(encoding, True)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   735
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   736
    def encodeContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   737
                       prettyPrint=False, indentLevel=0):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   738
        return self.decodeContents(prettyPrint, indentLevel).encode(encoding)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   739
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   740
    def decodeContents(self, prettyPrint=False, indentLevel=0,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   741
                       eventualEncoding=DEFAULT_OUTPUT_ENCODING):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   742
        """Renders the contents of this tag as a string in the given
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   743
        encoding. If encoding is None, returns a Unicode string.."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   744
        s=[]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   745
        for c in self:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   746
            text = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   747
            if isinstance(c, NavigableString):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   748
                text = c.decodeGivenEventualEncoding(eventualEncoding)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   749
            elif isinstance(c, Tag):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   750
                s.append(c.decode(prettyPrint, indentLevel, eventualEncoding))
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   751
            if text and prettyPrint:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   752
                text = text.strip()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   753
            if text:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   754
                if prettyPrint:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   755
                    s.append(" " * (indentLevel-1))
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   756
                s.append(text)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   757
                if prettyPrint:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   758
                    s.append("\n")
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   759
        return ''.join(s)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   760
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   761
    #Soup methods
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   762
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   763
    def find(self, name=None, attrs={}, recursive=True, text=None,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   764
             **kwargs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   765
        """Return only the first child of this Tag matching the given
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   766
        criteria."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   767
        r = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   768
        l = self.findAll(name, attrs, recursive, text, 1, **kwargs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   769
        if l:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   770
            r = l[0]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   771
        return r
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   772
    findChild = find
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   773
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   774
    def findAll(self, name=None, attrs={}, recursive=True, text=None,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   775
                limit=None, **kwargs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   776
        """Extracts a list of Tag objects that match the given
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   777
        criteria.  You can specify the name of the Tag and any
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   778
        attributes you want the Tag to have.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   779
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   780
        The value of a key-value pair in the 'attrs' map can be a
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   781
        string, a list of strings, a regular expression object, or a
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   782
        callable that takes a string and returns whether or not the
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   783
        string matches for some custom definition of 'matches'. The
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   784
        same is true of the tag name."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   785
        generator = self.recursiveChildGenerator
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   786
        if not recursive:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   787
            generator = self.childGenerator
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   788
        return self._findAll(name, attrs, text, limit, generator, **kwargs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   789
    findChildren = findAll
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   790
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   791
    # Pre-3.x compatibility methods. Will go away in 4.0.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   792
    first = find
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   793
    fetch = findAll
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   794
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   795
    def fetchText(self, text=None, recursive=True, limit=None):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   796
        return self.findAll(text=text, recursive=recursive, limit=limit)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   797
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   798
    def firstText(self, text=None, recursive=True):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   799
        return self.find(text=text, recursive=recursive)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   800
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   801
    # 3.x compatibility methods. Will go away in 4.0.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   802
    def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   803
                       prettyPrint=False, indentLevel=0):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   804
        if encoding is None:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   805
            return self.decodeContents(prettyPrint, indentLevel, encoding)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   806
        else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   807
            return self.encodeContents(encoding, prettyPrint, indentLevel)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   808
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   809
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   810
    #Private methods
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   811
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   812
    def _getAttrMap(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   813
        """Initializes a map representation of this tag's attributes,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   814
        if not already initialized."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   815
        if not getattr(self, 'attrMap'):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   816
            self.attrMap = {}
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   817
            for (key, value) in self.attrs:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   818
                self.attrMap[key] = value
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   819
        return self.attrMap
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   820
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   821
    #Generator methods
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   822
    def recursiveChildGenerator(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   823
        if not len(self.contents):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   824
            raise StopIteration
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   825
        stopNode = self._lastRecursiveChild().next
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   826
        current = self.contents[0]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   827
        while current is not stopNode:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   828
            yield current
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   829
            current = current.next
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   830
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   831
    def childGenerator(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   832
        if not len(self.contents):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   833
            raise StopIteration
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   834
        current = self.contents[0]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   835
        while current:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   836
            yield current
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   837
            current = current.nextSibling
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   838
        raise StopIteration
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   839
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   840
# Next, a couple classes to represent queries and their results.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   841
class SoupStrainer:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   842
    """Encapsulates a number of ways of matching a markup element (tag or
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   843
    text)."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   844
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   845
    def __init__(self, name=None, attrs={}, text=None, **kwargs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   846
        self.name = name
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   847
        if isString(attrs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   848
            kwargs['class'] = attrs
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   849
            attrs = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   850
        if kwargs:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   851
            if attrs:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   852
                attrs = attrs.copy()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   853
                attrs.update(kwargs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   854
            else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   855
                attrs = kwargs
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   856
        self.attrs = attrs
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   857
        self.text = text
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   858
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   859
    def __str__(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   860
        if self.text:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   861
            return self.text
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   862
        else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   863
            return "%s|%s" % (self.name, self.attrs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   864
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   865
    def searchTag(self, markupName=None, markupAttrs={}):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   866
        found = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   867
        markup = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   868
        if isinstance(markupName, Tag):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   869
            markup = markupName
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   870
            markupAttrs = markup
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   871
        callFunctionWithTagData = callable(self.name) \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   872
                                and not isinstance(markupName, Tag)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   873
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   874
        if (not self.name) \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   875
               or callFunctionWithTagData \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   876
               or (markup and self._matches(markup, self.name)) \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   877
               or (not markup and self._matches(markupName, self.name)):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   878
            if callFunctionWithTagData:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   879
                match = self.name(markupName, markupAttrs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   880
            else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   881
                match = True
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   882
                markupAttrMap = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   883
                for attr, matchAgainst in self.attrs.items():
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   884
                    if not markupAttrMap:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   885
                         if hasattr(markupAttrs, 'get'):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   886
                            markupAttrMap = markupAttrs
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   887
                         else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   888
                            markupAttrMap = {}
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   889
                            for k,v in markupAttrs:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   890
                                markupAttrMap[k] = v
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   891
                    attrValue = markupAttrMap.get(attr)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   892
                    if not self._matches(attrValue, matchAgainst):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   893
                        match = False
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   894
                        break
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   895
            if match:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   896
                if markup:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   897
                    found = markup
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   898
                else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   899
                    found = markupName
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   900
        return found
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   901
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   902
    def search(self, markup):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   903
        #print 'looking for %s in %s' % (self, markup)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   904
        found = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   905
        # If given a list of items, scan it for a text element that
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   906
        # matches.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   907
        if isList(markup) and not isinstance(markup, Tag):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   908
            for element in markup:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   909
                if isinstance(element, NavigableString) \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   910
                       and self.search(element):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   911
                    found = element
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   912
                    break
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   913
        # If it's a Tag, make sure its name or attributes match.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   914
        # Don't bother with Tags if we're searching for text.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   915
        elif isinstance(markup, Tag):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   916
            if not self.text:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   917
                found = self.searchTag(markup)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   918
        # If it's text, make sure the text matches.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   919
        elif isinstance(markup, NavigableString) or \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   920
                 isString(markup):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   921
            if self._matches(markup, self.text):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   922
                found = markup
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   923
        else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   924
            raise Exception, "I don't know how to match against a %s" \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   925
                  % markup.__class__
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   926
        return found
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   927
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   928
    def _matches(self, markup, matchAgainst):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   929
        #print "Matching %s against %s" % (markup, matchAgainst)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   930
        result = False
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   931
        if matchAgainst == True and type(matchAgainst) == types.BooleanType:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   932
            result = markup != None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   933
        elif callable(matchAgainst):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   934
            result = matchAgainst(markup)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   935
        else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   936
            #Custom match methods take the tag as an argument, but all
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   937
            #other ways of matching match the tag name as a string.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   938
            if isinstance(markup, Tag):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   939
                markup = markup.name
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   940
            if markup is not None and not isString(markup):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   941
                markup = unicode(markup)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   942
            #Now we know that chunk is either a string, or None.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   943
            if hasattr(matchAgainst, 'match'):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   944
                # It's a regexp object.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   945
                result = markup and matchAgainst.search(markup)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   946
            elif (isList(matchAgainst)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   947
                  and (markup is not None or not isString(matchAgainst))):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   948
                result = markup in matchAgainst
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   949
            elif hasattr(matchAgainst, 'items'):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   950
                result = markup.has_key(matchAgainst)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   951
            elif matchAgainst and isString(markup):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   952
                if isinstance(markup, unicode):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   953
                    matchAgainst = unicode(matchAgainst)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   954
                else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   955
                    matchAgainst = str(matchAgainst)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   956
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   957
            if not result:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   958
                result = matchAgainst == markup
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   959
        return result
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   960
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   961
class ResultSet(list):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   962
    """A ResultSet is just a list that keeps track of the SoupStrainer
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   963
    that created it."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   964
    def __init__(self, source):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   965
        list.__init__([])
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   966
        self.source = source
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   967
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   968
# Now, some helper functions.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   969
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   970
def isList(l):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   971
    """Convenience method that works with all 2.x versions of Python
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   972
    to determine whether or not something is listlike."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   973
    return ((hasattr(l, '__iter__') and not isString(l))
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   974
            or (type(l) in (types.ListType, types.TupleType)))
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   975
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   976
def isString(s):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   977
    """Convenience method that works with all 2.x versions of Python
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   978
    to determine whether or not something is stringlike."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   979
    try:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   980
        return isinstance(s, unicode) or isinstance(s, basestring)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   981
    except NameError:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   982
        return isinstance(s, str)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   983
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   984
def buildTagMap(default, *args):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   985
    """Turns a list of maps, lists, or scalars into a single map.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   986
    Used to build the SELF_CLOSING_TAGS, NESTABLE_TAGS, and
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   987
    NESTING_RESET_TAGS maps out of lists and partial maps."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   988
    built = {}
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   989
    for portion in args:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   990
        if hasattr(portion, 'items'):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   991
            #It's a map. Merge it.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   992
            for k,v in portion.items():
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   993
                built[k] = v
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   994
        elif isList(portion) and not isString(portion):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   995
            #It's a list. Map each item to the default.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   996
            for k in portion:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   997
                built[k] = default
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   998
        else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   999
            #It's a scalar. Map it to the default.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1000
            built[portion] = default
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1001
    return built
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1002
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1003
# Now, the parser classes.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1004
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1005
class HTMLParserBuilder(HTMLParser):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1006
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1007
    def __init__(self, soup):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1008
        HTMLParser.__init__(self)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1009
        self.soup = soup
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1010
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1011
    # We inherit feed() and reset().
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1012
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1013
    def handle_starttag(self, name, attrs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1014
        if name == 'meta':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1015
            self.soup.extractCharsetFromMeta(attrs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1016
        else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1017
            self.soup.unknown_starttag(name, attrs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1018
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1019
    def handle_endtag(self, name):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1020
        self.soup.unknown_endtag(name)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1021
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1022
    def handle_data(self, content):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1023
        self.soup.handle_data(content)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1024
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1025
    def _toStringSubclass(self, text, subclass):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1026
        """Adds a certain piece of text to the tree as a NavigableString
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1027
        subclass."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1028
        self.soup.endData()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1029
        self.handle_data(text)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1030
        self.soup.endData(subclass)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1031
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1032
    def handle_pi(self, text):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1033
        """Handle a processing instruction as a ProcessingInstruction
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1034
        object, possibly one with a %SOUP-ENCODING% slot into which an
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1035
        encoding will be plugged later."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1036
        if text[:3] == "xml":
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1037
            text = u"xml version='1.0' encoding='%SOUP-ENCODING%'"
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1038
        self._toStringSubclass(text, ProcessingInstruction)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1039
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1040
    def handle_comment(self, text):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1041
        "Handle comments as Comment objects."
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1042
        self._toStringSubclass(text, Comment)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1043
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1044
    def handle_charref(self, ref):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1045
        "Handle character references as data."
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1046
        if self.soup.convertEntities:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1047
            data = unichr(int(ref))
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1048
        else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1049
            data = '&#%s;' % ref
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1050
        self.handle_data(data)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1051
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1052
    def handle_entityref(self, ref):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1053
        """Handle entity references as data, possibly converting known
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1054
        HTML and/or XML entity references to the corresponding Unicode
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1055
        characters."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1056
        data = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1057
        if self.soup.convertHTMLEntities:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1058
            try:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1059
                data = unichr(name2codepoint[ref])
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1060
            except KeyError:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1061
                pass
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1062
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1063
        if not data and self.soup.convertXMLEntities:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1064
                data = self.soup.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1065
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1066
        if not data and self.soup.convertHTMLEntities and \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1067
            not self.soup.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1068
                # TODO: We've got a problem here. We're told this is
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1069
                # an entity reference, but it's not an XML entity
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1070
                # reference or an HTML entity reference. Nonetheless,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1071
                # the logical thing to do is to pass it through as an
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1072
                # unrecognized entity reference.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1073
                #
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1074
                # Except: when the input is "&carol;" this function
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1075
                # will be called with input "carol". When the input is
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1076
                # "AT&T", this function will be called with input
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1077
                # "T". We have no way of knowing whether a semicolon
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1078
                # was present originally, so we don't know whether
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1079
                # this is an unknown entity or just a misplaced
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1080
                # ampersand.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1081
                #
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1082
                # The more common case is a misplaced ampersand, so I
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1083
                # escape the ampersand and omit the trailing semicolon.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1084
                data = "&amp;%s" % ref
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1085
        if not data:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1086
            # This case is different from the one above, because we
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1087
            # haven't already gone through a supposedly comprehensive
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1088
            # mapping of entities to Unicode characters. We might not
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1089
            # have gone through any mapping at all. So the chances are
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1090
            # very high that this is a real entity, and not a
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1091
            # misplaced ampersand.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1092
            data = "&%s;" % ref
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1093
        self.handle_data(data)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1094
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1095
    def handle_decl(self, data):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1096
        "Handle DOCTYPEs and the like as Declaration objects."
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1097
        self._toStringSubclass(data, Declaration)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1098
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1099
    def parse_declaration(self, i):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1100
        """Treat a bogus SGML declaration as raw data. Treat a CDATA
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1101
        declaration as a CData object."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1102
        j = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1103
        if self.rawdata[i:i+9] == '<![CDATA[':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1104
             k = self.rawdata.find(']]>', i)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1105
             if k == -1:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1106
                 k = len(self.rawdata)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1107
             data = self.rawdata[i+9:k]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1108
             j = k+3
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1109
             self._toStringSubclass(data, CData)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1110
        else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1111
            try:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1112
                j = HTMLParser.parse_declaration(self, i)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1113
            except HTMLParseError:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1114
                toHandle = self.rawdata[i:]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1115
                self.handle_data(toHandle)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1116
                j = i + len(toHandle)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1117
        return j
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1118
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1119
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1120
class BeautifulStoneSoup(Tag):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1121
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1122
    """This class contains the basic parser and search code. It defines
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1123
    a parser that knows nothing about tag behavior except for the
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1124
    following:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1125
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1126
      You can't close a tag without closing all the tags it encloses.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1127
      That is, "<foo><bar></foo>" actually means
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1128
      "<foo><bar></bar></foo>".
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1129
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1130
    [Another possible explanation is "<foo><bar /></foo>", but since
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1131
    this class defines no SELF_CLOSING_TAGS, it will never use that
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1132
    explanation.]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1133
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1134
    This class is useful for parsing XML or made-up markup languages,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1135
    or when BeautifulSoup makes an assumption counter to what you were
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1136
    expecting."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1137
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1138
    SELF_CLOSING_TAGS = {}
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1139
    NESTABLE_TAGS = {}
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1140
    RESET_NESTING_TAGS = {}
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1141
    QUOTE_TAGS = {}
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1142
    PRESERVE_WHITESPACE_TAGS = []
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1143
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1144
    MARKUP_MASSAGE = [(re.compile('(<[^<>]*)/>'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1145
                       lambda x: x.group(1) + ' />'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1146
                      (re.compile('<!\s+([^<>]*)>'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1147
                       lambda x: '<!' + x.group(1) + '>')
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1148
                      ]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1149
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1150
    ROOT_TAG_NAME = u'[document]'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1151
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1152
    HTML_ENTITIES = "html"
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1153
    XML_ENTITIES = "xml"
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1154
    XHTML_ENTITIES = "xhtml"
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1155
    # TODO: This only exists for backwards-compatibility
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1156
    ALL_ENTITIES = XHTML_ENTITIES
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1157
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1158
    # Used when determining whether a text node is all whitespace and
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1159
    # can be replaced with a single space. A text node that contains
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1160
    # fancy Unicode spaces (usually non-breaking) should be left
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1161
    # alone.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1162
    STRIP_ASCII_SPACES = { 9: None, 10: None, 12: None, 13: None, 32: None, }
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1163
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1164
    def __init__(self, markup="", parseOnlyThese=None, fromEncoding=None,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1165
                 markupMassage=True, smartQuotesTo=XML_ENTITIES,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1166
                 convertEntities=None, selfClosingTags=None, isHTML=False,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1167
                 builder=HTMLParserBuilder):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1168
        """The Soup object is initialized as the 'root tag', and the
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1169
        provided markup (which can be a string or a file-like object)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1170
        is fed into the underlying parser.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1171
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1172
        HTMLParser will process most bad HTML, and the BeautifulSoup
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1173
        class has some tricks for dealing with some HTML that kills
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1174
        HTMLParser, but Beautiful Soup can nonetheless choke or lose data
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1175
        if your data uses self-closing tags or declarations
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1176
        incorrectly.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1177
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1178
        By default, Beautiful Soup uses regexes to sanitize input,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1179
        avoiding the vast majority of these problems. If the problems
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1180
        don't apply to you, pass in False for markupMassage, and
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1181
        you'll get better performance.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1182
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1183
        The default parser massage techniques fix the two most common
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1184
        instances of invalid HTML that choke HTMLParser:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1185
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1186
         <br/> (No space between name of closing tag and tag close)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1187
         <! --Comment--> (Extraneous whitespace in declaration)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1188
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1189
        You can pass in a custom list of (RE object, replace method)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1190
        tuples to get Beautiful Soup to scrub your input the way you
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1191
        want."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1192
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1193
        self.parseOnlyThese = parseOnlyThese
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1194
        self.fromEncoding = fromEncoding
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1195
        self.smartQuotesTo = smartQuotesTo
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1196
        self.convertEntities = convertEntities
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1197
        # Set the rules for how we'll deal with the entities we
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1198
        # encounter
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1199
        if self.convertEntities:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1200
            # It doesn't make sense to convert encoded characters to
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1201
            # entities even while you're converting entities to Unicode.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1202
            # Just convert it all to Unicode.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1203
            self.smartQuotesTo = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1204
            if convertEntities == self.HTML_ENTITIES:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1205
                self.convertXMLEntities = False
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1206
                self.convertHTMLEntities = True
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1207
                self.escapeUnrecognizedEntities = True
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1208
            elif convertEntities == self.XHTML_ENTITIES:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1209
                self.convertXMLEntities = True
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1210
                self.convertHTMLEntities = True
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1211
                self.escapeUnrecognizedEntities = False
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1212
            elif convertEntities == self.XML_ENTITIES:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1213
                self.convertXMLEntities = True
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1214
                self.convertHTMLEntities = False
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1215
                self.escapeUnrecognizedEntities = False
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1216
        else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1217
            self.convertXMLEntities = False
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1218
            self.convertHTMLEntities = False
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1219
            self.escapeUnrecognizedEntities = False
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1220
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1221
        self.instanceSelfClosingTags = buildTagMap(None, selfClosingTags)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1222
        self.builder = builder(self)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1223
        self.reset()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1224
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1225
        if hasattr(markup, 'read'):        # It's a file-type object.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1226
            markup = markup.read()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1227
        self.markup = markup
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1228
        self.markupMassage = markupMassage
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1229
        try:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1230
            self._feed(isHTML=isHTML)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1231
        except StopParsing:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1232
            pass
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1233
        self.markup = None                 # The markup can now be GCed.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1234
        self.builder = None                # So can the builder.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1235
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1236
    def _feed(self, inDocumentEncoding=None, isHTML=False):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1237
        # Convert the document to Unicode.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1238
        markup = self.markup
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1239
        if isinstance(markup, unicode):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1240
            if not hasattr(self, 'originalEncoding'):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1241
                self.originalEncoding = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1242
        else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1243
            dammit = UnicodeDammit\
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1244
                     (markup, [self.fromEncoding, inDocumentEncoding],
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1245
                      smartQuotesTo=self.smartQuotesTo, isHTML=isHTML)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1246
            markup = dammit.unicode
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1247
            self.originalEncoding = dammit.originalEncoding
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1248
            self.declaredHTMLEncoding = dammit.declaredHTMLEncoding
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1249
        if markup:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1250
            if self.markupMassage:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1251
                if not isList(self.markupMassage):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1252
                    self.markupMassage = self.MARKUP_MASSAGE
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1253
                for fix, m in self.markupMassage:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1254
                    markup = fix.sub(m, markup)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1255
                # TODO: We get rid of markupMassage so that the
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1256
                # soup object can be deepcopied later on. Some
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1257
                # Python installations can't copy regexes. If anyone
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1258
                # was relying on the existence of markupMassage, this
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1259
                # might cause problems.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1260
                del(self.markupMassage)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1261
        self.builder.reset()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1262
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1263
        self.builder.feed(markup)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1264
        # Close out any unfinished strings and close all the open tags.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1265
        self.endData()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1266
        while self.currentTag.name != self.ROOT_TAG_NAME:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1267
            self.popTag()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1268
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1269
    def isSelfClosingTag(self, name):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1270
        """Returns true iff the given string is the name of a
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1271
        self-closing tag according to this parser."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1272
        return self.SELF_CLOSING_TAGS.has_key(name) \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1273
               or self.instanceSelfClosingTags.has_key(name)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1274
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1275
    def reset(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1276
        Tag.__init__(self, self, self.ROOT_TAG_NAME)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1277
        self.hidden = 1
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1278
        self.builder.reset()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1279
        self.currentData = []
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1280
        self.currentTag = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1281
        self.tagStack = []
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1282
        self.quoteStack = []
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1283
        self.pushTag(self)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1284
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1285
    def popTag(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1286
        tag = self.tagStack.pop()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1287
        # Tags with just one string-owning child get the child as a
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1288
        # 'string' property, so that soup.tag.string is shorthand for
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1289
        # soup.tag.contents[0]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1290
        if len(self.currentTag.contents) == 1 and \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1291
           isinstance(self.currentTag.contents[0], NavigableString):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1292
            self.currentTag.string = self.currentTag.contents[0]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1293
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1294
        #print "Pop", tag.name
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1295
        if self.tagStack:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1296
            self.currentTag = self.tagStack[-1]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1297
        return self.currentTag
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1298
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1299
    def pushTag(self, tag):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1300
        #print "Push", tag.name
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1301
        if self.currentTag:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1302
            self.currentTag.contents.append(tag)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1303
        self.tagStack.append(tag)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1304
        self.currentTag = self.tagStack[-1]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1305
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1306
    def endData(self, containerClass=NavigableString):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1307
        if self.currentData:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1308
            currentData = u''.join(self.currentData)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1309
            if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1310
                not set([tag.name for tag in self.tagStack]).intersection(
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1311
                    self.PRESERVE_WHITESPACE_TAGS)):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1312
                if '\n' in currentData:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1313
                    currentData = '\n'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1314
                else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1315
                    currentData = ' '
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1316
            self.currentData = []
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1317
            if self.parseOnlyThese and len(self.tagStack) <= 1 and \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1318
                   (not self.parseOnlyThese.text or \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1319
                    not self.parseOnlyThese.search(currentData)):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1320
                return
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1321
            o = containerClass(currentData)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1322
            o.setup(self.currentTag, self.previous)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1323
            if self.previous:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1324
                self.previous.next = o
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1325
            self.previous = o
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1326
            self.currentTag.contents.append(o)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1327
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1328
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1329
    def _popToTag(self, name, inclusivePop=True):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1330
        """Pops the tag stack up to and including the most recent
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1331
        instance of the given tag. If inclusivePop is false, pops the tag
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1332
        stack up to but *not* including the most recent instqance of
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1333
        the given tag."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1334
        #print "Popping to %s" % name
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1335
        if name == self.ROOT_TAG_NAME:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1336
            return
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1337
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1338
        numPops = 0
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1339
        mostRecentTag = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1340
        for i in range(len(self.tagStack)-1, 0, -1):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1341
            if name == self.tagStack[i].name:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1342
                numPops = len(self.tagStack)-i
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1343
                break
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1344
        if not inclusivePop:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1345
            numPops = numPops - 1
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1346
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1347
        for i in range(0, numPops):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1348
            mostRecentTag = self.popTag()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1349
        return mostRecentTag
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1350
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1351
    def _smartPop(self, name):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1352
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1353
        """We need to pop up to the previous tag of this type, unless
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1354
        one of this tag's nesting reset triggers comes between this
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1355
        tag and the previous tag of this type, OR unless this tag is a
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1356
        generic nesting trigger and another generic nesting trigger
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1357
        comes between this tag and the previous tag of this type.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1358
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1359
        Examples:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1360
         <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1361
         <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1362
         <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1363
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1364
         <li><ul><li> *<li>* should pop to 'ul', not the first 'li'.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1365
         <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1366
         <td><tr><td> *<td>* should pop to 'tr', not the first 'td'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1367
        """
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1368
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1369
        nestingResetTriggers = self.NESTABLE_TAGS.get(name)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1370
        isNestable = nestingResetTriggers != None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1371
        isResetNesting = self.RESET_NESTING_TAGS.has_key(name)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1372
        popTo = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1373
        inclusive = True
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1374
        for i in range(len(self.tagStack)-1, 0, -1):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1375
            p = self.tagStack[i]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1376
            if (not p or p.name == name) and not isNestable:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1377
                #Non-nestable tags get popped to the top or to their
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1378
                #last occurance.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1379
                popTo = name
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1380
                break
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1381
            if (nestingResetTriggers != None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1382
                and p.name in nestingResetTriggers) \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1383
                or (nestingResetTriggers == None and isResetNesting
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1384
                    and self.RESET_NESTING_TAGS.has_key(p.name)):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1385
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1386
                #If we encounter one of the nesting reset triggers
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1387
                #peculiar to this tag, or we encounter another tag
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1388
                #that causes nesting to reset, pop up to but not
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1389
                #including that tag.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1390
                popTo = p.name
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1391
                inclusive = False
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1392
                break
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1393
            p = p.parent
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1394
        if popTo:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1395
            self._popToTag(popTo, inclusive)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1396
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1397
    def unknown_starttag(self, name, attrs, selfClosing=0):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1398
        #print "Start tag %s: %s" % (name, attrs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1399
        if self.quoteStack:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1400
            #This is not a real tag.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1401
            #print "<%s> is not real!" % name
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1402
            attrs = ''.join(map(lambda(x, y): ' %s="%s"' % (x, y), attrs))
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1403
            self.handle_data('<%s%s>' % (name, attrs))
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1404
            return
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1405
        self.endData()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1406
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1407
        if not self.isSelfClosingTag(name) and not selfClosing:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1408
            self._smartPop(name)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1409
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1410
        if self.parseOnlyThese and len(self.tagStack) <= 1 \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1411
               and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1412
            return
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1413
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1414
        tag = Tag(self, name, attrs, self.currentTag, self.previous)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1415
        if self.previous:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1416
            self.previous.next = tag
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1417
        self.previous = tag
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1418
        self.pushTag(tag)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1419
        if selfClosing or self.isSelfClosingTag(name):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1420
            self.popTag()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1421
        if name in self.QUOTE_TAGS:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1422
            #print "Beginning quote (%s)" % name
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1423
            self.quoteStack.append(name)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1424
            self.literal = 1
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1425
        return tag
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1426
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1427
    def unknown_endtag(self, name):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1428
        #print "End tag %s" % name
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1429
        if self.quoteStack and self.quoteStack[-1] != name:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1430
            #This is not a real end tag.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1431
            #print "</%s> is not real!" % name
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1432
            self.handle_data('</%s>' % name)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1433
            return
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1434
        self.endData()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1435
        self._popToTag(name)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1436
        if self.quoteStack and self.quoteStack[-1] == name:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1437
            self.quoteStack.pop()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1438
            self.literal = (len(self.quoteStack) > 0)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1439
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1440
    def handle_data(self, data):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1441
        self.currentData.append(data)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1442
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1443
    def extractCharsetFromMeta(self, attrs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1444
        self.unknown_starttag('meta', attrs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1445
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1446
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1447
class BeautifulSoup(BeautifulStoneSoup):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1448
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1449
    """This parser knows the following facts about HTML:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1450
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1451
    * Some tags have no closing tag and should be interpreted as being
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1452
      closed as soon as they are encountered.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1453
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1454
    * The text inside some tags (ie. 'script') may contain tags which
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1455
      are not really part of the document and which should be parsed
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1456
      as text, not tags. If you want to parse the text as tags, you can
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1457
      always fetch it and parse it explicitly.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1458
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1459
    * Tag nesting rules:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1460
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1461
      Most tags can't be nested at all. For instance, the occurance of
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1462
      a <p> tag should implicitly close the previous <p> tag.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1463
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1464
       <p>Para1<p>Para2
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1465
        should be transformed into:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1466
       <p>Para1</p><p>Para2
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1467
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1468
      Some tags can be nested arbitrarily. For instance, the occurance
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1469
      of a <blockquote> tag should _not_ implicitly close the previous
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1470
      <blockquote> tag.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1471
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1472
       Alice said: <blockquote>Bob said: <blockquote>Blah
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1473
        should NOT be transformed into:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1474
       Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1475
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1476
      Some tags can be nested, but the nesting is reset by the
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1477
      interposition of other tags. For instance, a <tr> tag should
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1478
      implicitly close the previous <tr> tag within the same <table>,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1479
      but not close a <tr> tag in another table.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1480
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1481
       <table><tr>Blah<tr>Blah
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1482
        should be transformed into:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1483
       <table><tr>Blah</tr><tr>Blah
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1484
        but,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1485
       <tr>Blah<table><tr>Blah
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1486
        should NOT be transformed into
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1487
       <tr>Blah<table></tr><tr>Blah
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1488
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1489
    Differing assumptions about tag nesting rules are a major source
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1490
    of problems with the BeautifulSoup class. If BeautifulSoup is not
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1491
    treating as nestable a tag your page author treats as nestable,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1492
    try ICantBelieveItsBeautifulSoup, MinimalSoup, or
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1493
    BeautifulStoneSoup before writing your own subclass."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1494
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1495
    def __init__(self, *args, **kwargs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1496
        if not kwargs.has_key('smartQuotesTo'):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1497
            kwargs['smartQuotesTo'] = self.HTML_ENTITIES
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1498
        kwargs['isHTML'] = True
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1499
        BeautifulStoneSoup.__init__(self, *args, **kwargs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1500
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1501
    SELF_CLOSING_TAGS = buildTagMap(None,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1502
                                    ['br' , 'hr', 'input', 'img', 'meta',
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1503
                                    'spacer', 'link', 'frame', 'base'])
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1504
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1505
    PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1506
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1507
    QUOTE_TAGS = {'script' : None, 'textarea' : None}
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1508
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1509
    #According to the HTML standard, each of these inline tags can
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1510
    #contain another tag of the same type. Furthermore, it's common
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1511
    #to actually use these tags this way.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1512
    NESTABLE_INLINE_TAGS = ['span', 'font', 'q', 'object', 'bdo', 'sub', 'sup',
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1513
                            'center']
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1514
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1515
    #According to the HTML standard, these block tags can contain
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1516
    #another tag of the same type. Furthermore, it's common
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1517
    #to actually use these tags this way.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1518
    NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1519
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1520
    #Lists can contain other lists, but there are restrictions.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1521
    NESTABLE_LIST_TAGS = { 'ol' : [],
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1522
                           'ul' : [],
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1523
                           'li' : ['ul', 'ol'],
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1524
                           'dl' : [],
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1525
                           'dd' : ['dl'],
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1526
                           'dt' : ['dl'] }
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1527
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1528
    #Tables can contain other tables, but there are restrictions.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1529
    NESTABLE_TABLE_TAGS = {'table' : [],
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1530
                           'tr' : ['table', 'tbody', 'tfoot', 'thead'],
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1531
                           'td' : ['tr'],
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1532
                           'th' : ['tr'],
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1533
                           'thead' : ['table'],
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1534
                           'tbody' : ['table'],
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1535
                           'tfoot' : ['table'],
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1536
                           }
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1537
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1538
    NON_NESTABLE_BLOCK_TAGS = ['address', 'form', 'p', 'pre']
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1539
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1540
    #If one of these tags is encountered, all tags up to the next tag of
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1541
    #this type are popped.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1542
    RESET_NESTING_TAGS = buildTagMap(None, NESTABLE_BLOCK_TAGS, 'noscript',
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1543
                                     NON_NESTABLE_BLOCK_TAGS,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1544
                                     NESTABLE_LIST_TAGS,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1545
                                     NESTABLE_TABLE_TAGS)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1546
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1547
    NESTABLE_TAGS = buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE_BLOCK_TAGS,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1548
                                NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1549
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1550
    # Used to detect the charset in a META tag; see start_meta
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1551
    CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1552
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1553
    def extractCharsetFromMeta(self, attrs):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1554
        """Beautiful Soup can detect a charset included in a META tag,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1555
        try to convert the document to that charset, and re-parse the
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1556
        document from the beginning."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1557
        httpEquiv = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1558
        contentType = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1559
        contentTypeIndex = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1560
        tagNeedsEncodingSubstitution = False
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1561
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1562
        for i in range(0, len(attrs)):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1563
            key, value = attrs[i]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1564
            key = key.lower()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1565
            if key == 'http-equiv':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1566
                httpEquiv = value
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1567
            elif key == 'content':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1568
                contentType = value
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1569
                contentTypeIndex = i
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1570
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1571
        if httpEquiv and contentType: # It's an interesting meta tag.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1572
            match = self.CHARSET_RE.search(contentType)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1573
            if match:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1574
                if (self.declaredHTMLEncoding is not None or
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1575
                    self.originalEncoding == self.fromEncoding):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1576
                    # An HTML encoding was sniffed while converting
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1577
                    # the document to Unicode, or an HTML encoding was
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1578
                    # sniffed during a previous pass through the
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1579
                    # document, or an encoding was specified
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1580
                    # explicitly and it worked. Rewrite the meta tag.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1581
                    def rewrite(match):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1582
                        return match.group(1) + "%SOUP-ENCODING%"
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1583
                    newAttr = self.CHARSET_RE.sub(rewrite, contentType)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1584
                    attrs[contentTypeIndex] = (attrs[contentTypeIndex][0],
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1585
                                               newAttr)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1586
                    tagNeedsEncodingSubstitution = True
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1587
                else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1588
                    # This is our first pass through the document.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1589
                    # Go through it again with the encoding information.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1590
                    newCharset = match.group(3)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1591
                    if newCharset and newCharset != self.originalEncoding:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1592
                        self.declaredHTMLEncoding = newCharset
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1593
                        self._feed(self.declaredHTMLEncoding)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1594
                        raise StopParsing
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1595
                    pass
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1596
        tag = self.unknown_starttag("meta", attrs)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1597
        if tag and tagNeedsEncodingSubstitution:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1598
            tag.containsSubstitutions = True
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1599
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1600
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1601
class StopParsing(Exception):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1602
    pass
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1603
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1604
class ICantBelieveItsBeautifulSoup(BeautifulSoup):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1605
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1606
    """The BeautifulSoup class is oriented towards skipping over
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1607
    common HTML errors like unclosed tags. However, sometimes it makes
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1608
    errors of its own. For instance, consider this fragment:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1609
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1610
     <b>Foo<b>Bar</b></b>
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1611
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1612
    This is perfectly valid (if bizarre) HTML. However, the
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1613
    BeautifulSoup class will implicitly close the first b tag when it
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1614
    encounters the second 'b'. It will think the author wrote
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1615
    "<b>Foo<b>Bar", and didn't close the first 'b' tag, because
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1616
    there's no real-world reason to bold something that's already
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1617
    bold. When it encounters '</b></b>' it will close two more 'b'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1618
    tags, for a grand total of three tags closed instead of two. This
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1619
    can throw off the rest of your document structure. The same is
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1620
    true of a number of other tags, listed below.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1621
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1622
    It's much more common for someone to forget to close a 'b' tag
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1623
    than to actually use nested 'b' tags, and the BeautifulSoup class
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1624
    handles the common case. This class handles the not-co-common
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1625
    case: where you can't believe someone wrote what they did, but
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1626
    it's valid HTML and BeautifulSoup screwed up by assuming it
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1627
    wouldn't be."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1628
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1629
    I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS = \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1630
     ['em', 'big', 'i', 'small', 'tt', 'abbr', 'acronym', 'strong',
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1631
      'cite', 'code', 'dfn', 'kbd', 'samp', 'strong', 'var', 'b',
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1632
      'big']
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1633
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1634
    I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS = ['noscript']
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1635
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1636
    NESTABLE_TAGS = buildTagMap([], BeautifulSoup.NESTABLE_TAGS,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1637
                                I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1638
                                I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1639
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1640
class MinimalSoup(BeautifulSoup):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1641
    """The MinimalSoup class is for parsing HTML that contains
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1642
    pathologically bad markup. It makes no assumptions about tag
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1643
    nesting, but it does know which tags are self-closing, that
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1644
    <script> tags contain Javascript and should not be parsed, that
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1645
    META tags may contain encoding information, and so on.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1646
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1647
    This also makes it better for subclassing than BeautifulStoneSoup
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1648
    or BeautifulSoup."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1649
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1650
    RESET_NESTING_TAGS = buildTagMap('noscript')
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1651
    NESTABLE_TAGS = {}
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1652
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1653
class BeautifulSOAP(BeautifulStoneSoup):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1654
    """This class will push a tag with only a single string child into
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1655
    the tag's parent as an attribute. The attribute's name is the tag
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1656
    name, and the value is the string child. An example should give
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1657
    the flavor of the change:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1658
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1659
    <foo><bar>baz</bar></foo>
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1660
     =>
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1661
    <foo bar="baz"><bar>baz</bar></foo>
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1662
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1663
    You can then access fooTag['bar'] instead of fooTag.barTag.string.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1664
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1665
    This is, of course, useful for scraping structures that tend to
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1666
    use subelements instead of attributes, such as SOAP messages. Note
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1667
    that it modifies its input, so don't print the modified version
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1668
    out.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1669
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1670
    I'm not sure how many people really want to use this class; let me
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1671
    know if you do. Mainly I like the name."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1672
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1673
    def popTag(self):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1674
        if len(self.tagStack) > 1:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1675
            tag = self.tagStack[-1]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1676
            parent = self.tagStack[-2]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1677
            parent._getAttrMap()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1678
            if (isinstance(tag, Tag) and len(tag.contents) == 1 and
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1679
                isinstance(tag.contents[0], NavigableString) and
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1680
                not parent.attrMap.has_key(tag.name)):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1681
                parent[tag.name] = tag.contents[0]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1682
        BeautifulStoneSoup.popTag(self)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1683
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1684
#Enterprise class names! It has come to our attention that some people
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1685
#think the names of the Beautiful Soup parser classes are too silly
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1686
#and "unprofessional" for use in enterprise screen-scraping. We feel
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1687
#your pain! For such-minded folk, the Beautiful Soup Consortium And
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1688
#All-Night Kosher Bakery recommends renaming this file to
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1689
#"RobustParser.py" (or, in cases of extreme enterprisiness,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1690
#"RobustParserBeanInterface.class") and using the following
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1691
#enterprise-friendly class aliases:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1692
class RobustXMLParser(BeautifulStoneSoup):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1693
    pass
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1694
class RobustHTMLParser(BeautifulSoup):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1695
    pass
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1696
class RobustWackAssHTMLParser(ICantBelieveItsBeautifulSoup):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1697
    pass
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1698
class RobustInsanelyWackAssHTMLParser(MinimalSoup):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1699
    pass
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1700
class SimplifyingSOAPParser(BeautifulSOAP):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1701
    pass
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1702
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1703
######################################################
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1704
#
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1705
# Bonus library: Unicode, Dammit
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1706
#
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1707
# This class forces XML data into a standard format (usually to UTF-8
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1708
# or Unicode).  It is heavily based on code from Mark Pilgrim's
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1709
# Universal Feed Parser. It does not rewrite the XML or HTML to
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1710
# reflect a new encoding: that happens in BeautifulStoneSoup.handle_pi
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1711
# (XML) and BeautifulSoup.start_meta (HTML).
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1712
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1713
# Autodetects character encodings.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1714
# Download from http://chardet.feedparser.org/
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1715
try:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1716
    import chardet
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1717
#    import chardet.constants
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1718
#    chardet.constants._debug = 1
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1719
except ImportError:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1720
    chardet = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1721
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1722
# cjkcodecs and iconv_codec make Python know about more character encodings.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1723
# Both are available from http://cjkpython.i18n.org/
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1724
# They're built in if you use Python 2.4.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1725
try:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1726
    import cjkcodecs.aliases
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1727
except ImportError:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1728
    pass
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1729
try:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1730
    import iconv_codec
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1731
except ImportError:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1732
    pass
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1733
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1734
class UnicodeDammit:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1735
    """A class for detecting the encoding of a *ML document and
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1736
    converting it to a Unicode string. If the source encoding is
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1737
    windows-1252, can replace MS smart quotes with their HTML or XML
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1738
    equivalents."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1739
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1740
    # This dictionary maps commonly seen values for "charset" in HTML
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1741
    # meta tags to the corresponding Python codec names. It only covers
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1742
    # values that aren't in Python's aliases and can't be determined
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1743
    # by the heuristics in find_codec.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1744
    CHARSET_ALIASES = { "macintosh" : "mac-roman",
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1745
                        "x-sjis" : "shift-jis" }
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1746
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1747
    def __init__(self, markup, overrideEncodings=[],
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1748
                 smartQuotesTo='xml', isHTML=False):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1749
        self.declaredHTMLEncoding = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1750
        self.markup, documentEncoding, sniffedEncoding = \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1751
                     self._detectEncoding(markup, isHTML)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1752
        self.smartQuotesTo = smartQuotesTo
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1753
        self.triedEncodings = []
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1754
        if markup == '' or isinstance(markup, unicode):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1755
            self.originalEncoding = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1756
            self.unicode = unicode(markup)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1757
            return
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1758
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1759
        u = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1760
        for proposedEncoding in overrideEncodings:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1761
            u = self._convertFrom(proposedEncoding)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1762
            if u: break
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1763
        if not u:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1764
            for proposedEncoding in (documentEncoding, sniffedEncoding):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1765
                u = self._convertFrom(proposedEncoding)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1766
                if u: break
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1767
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1768
        # If no luck and we have auto-detection library, try that:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1769
        if not u and chardet and not isinstance(self.markup, unicode):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1770
            u = self._convertFrom(chardet.detect(self.markup)['encoding'])
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1771
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1772
        # As a last resort, try utf-8 and windows-1252:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1773
        if not u:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1774
            for proposed_encoding in ("utf-8", "windows-1252"):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1775
                u = self._convertFrom(proposed_encoding)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1776
                if u: break
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1777
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1778
        self.unicode = u
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1779
        if not u: self.originalEncoding = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1780
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1781
    def _subMSChar(self, match):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1782
        """Changes a MS smart quote character to an XML or HTML
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1783
        entity."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1784
        orig = match.group(1)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1785
        sub = self.MS_CHARS.get(orig)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1786
        if type(sub) == types.TupleType:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1787
            if self.smartQuotesTo == 'xml':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1788
                sub = '&#x'.encode() + sub[1].encode() + ';'.encode()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1789
            else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1790
                sub = '&'.encode() + sub[0].encode() + ';'.encode()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1791
        else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1792
            sub = sub.encode()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1793
        return sub
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1794
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1795
    def _convertFrom(self, proposed):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1796
        proposed = self.find_codec(proposed)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1797
        if not proposed or proposed in self.triedEncodings:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1798
            return None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1799
        self.triedEncodings.append(proposed)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1800
        markup = self.markup
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1801
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1802
        # Convert smart quotes to HTML if coming from an encoding
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1803
        # that might have them.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1804
        if self.smartQuotesTo and proposed.lower() in("windows-1252",
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1805
                                                      "iso-8859-1",
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1806
                                                      "iso-8859-2"):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1807
            smart_quotes_re = "([\x80-\x9f])"
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1808
            smart_quotes_compiled = re.compile(smart_quotes_re)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1809
            markup = smart_quotes_compiled.sub(self._subMSChar, markup)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1810
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1811
        try:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1812
            # print "Trying to convert document to %s" % proposed
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1813
            u = self._toUnicode(markup, proposed)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1814
            self.markup = u
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1815
            self.originalEncoding = proposed
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1816
        except Exception, e:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1817
            # print "That didn't work!"
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1818
            # print e
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1819
            return None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1820
        #print "Correct encoding: %s" % proposed
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1821
        return self.markup
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1822
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1823
    def _toUnicode(self, data, encoding):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1824
        '''Given a string and its encoding, decodes the string into Unicode.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1825
        %encoding is a string recognized by encodings.aliases'''
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1826
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1827
        # strip Byte Order Mark (if present)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1828
        if (len(data) >= 4) and (data[:2] == '\xfe\xff') \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1829
               and (data[2:4] != '\x00\x00'):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1830
            encoding = 'utf-16be'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1831
            data = data[2:]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1832
        elif (len(data) >= 4) and (data[:2] == '\xff\xfe') \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1833
                 and (data[2:4] != '\x00\x00'):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1834
            encoding = 'utf-16le'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1835
            data = data[2:]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1836
        elif data[:3] == '\xef\xbb\xbf':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1837
            encoding = 'utf-8'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1838
            data = data[3:]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1839
        elif data[:4] == '\x00\x00\xfe\xff':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1840
            encoding = 'utf-32be'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1841
            data = data[4:]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1842
        elif data[:4] == '\xff\xfe\x00\x00':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1843
            encoding = 'utf-32le'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1844
            data = data[4:]
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1845
        newdata = unicode(data, encoding)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1846
        return newdata
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1847
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1848
    def _detectEncoding(self, xml_data, isHTML=False):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1849
        """Given a document, tries to detect its XML encoding."""
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1850
        xml_encoding = sniffed_xml_encoding = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1851
        try:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1852
            if xml_data[:4] == '\x4c\x6f\xa7\x94':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1853
                # EBCDIC
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1854
                xml_data = self._ebcdic_to_ascii(xml_data)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1855
            elif xml_data[:4] == '\x00\x3c\x00\x3f':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1856
                # UTF-16BE
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1857
                sniffed_xml_encoding = 'utf-16be'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1858
                xml_data = unicode(xml_data, 'utf-16be').encode('utf-8')
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1859
            elif (len(xml_data) >= 4) and (xml_data[:2] == '\xfe\xff') \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1860
                     and (xml_data[2:4] != '\x00\x00'):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1861
                # UTF-16BE with BOM
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1862
                sniffed_xml_encoding = 'utf-16be'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1863
                xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8')
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1864
            elif xml_data[:4] == '\x3c\x00\x3f\x00':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1865
                # UTF-16LE
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1866
                sniffed_xml_encoding = 'utf-16le'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1867
                xml_data = unicode(xml_data, 'utf-16le').encode('utf-8')
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1868
            elif (len(xml_data) >= 4) and (xml_data[:2] == '\xff\xfe') and \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1869
                     (xml_data[2:4] != '\x00\x00'):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1870
                # UTF-16LE with BOM
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1871
                sniffed_xml_encoding = 'utf-16le'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1872
                xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8')
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1873
            elif xml_data[:4] == '\x00\x00\x00\x3c':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1874
                # UTF-32BE
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1875
                sniffed_xml_encoding = 'utf-32be'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1876
                xml_data = unicode(xml_data, 'utf-32be').encode('utf-8')
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1877
            elif xml_data[:4] == '\x3c\x00\x00\x00':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1878
                # UTF-32LE
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1879
                sniffed_xml_encoding = 'utf-32le'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1880
                xml_data = unicode(xml_data, 'utf-32le').encode('utf-8')
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1881
            elif xml_data[:4] == '\x00\x00\xfe\xff':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1882
                # UTF-32BE with BOM
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1883
                sniffed_xml_encoding = 'utf-32be'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1884
                xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8')
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1885
            elif xml_data[:4] == '\xff\xfe\x00\x00':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1886
                # UTF-32LE with BOM
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1887
                sniffed_xml_encoding = 'utf-32le'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1888
                xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8')
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1889
            elif xml_data[:3] == '\xef\xbb\xbf':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1890
                # UTF-8 with BOM
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1891
                sniffed_xml_encoding = 'utf-8'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1892
                xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8')
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1893
            else:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1894
                sniffed_xml_encoding = 'ascii'
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1895
                pass
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1896
        except:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1897
            xml_encoding_match = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1898
        xml_encoding_re = '^<\?.*encoding=[\'"](.*?)[\'"].*\?>'.encode()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1899
        xml_encoding_match = re.compile(xml_encoding_re).match(xml_data)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1900
        if not xml_encoding_match and isHTML:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1901
            meta_re = '<\s*meta[^>]+charset=([^>]*?)[;\'">]'.encode()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1902
            regexp = re.compile(meta_re, re.I)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1903
            xml_encoding_match = regexp.search(xml_data)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1904
        if xml_encoding_match is not None:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1905
            xml_encoding = xml_encoding_match.groups()[0].decode(
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1906
                'ascii').lower()
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1907
            if isHTML:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1908
                self.declaredHTMLEncoding = xml_encoding
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1909
            if sniffed_xml_encoding and \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1910
               (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode',
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1911
                                 'iso-10646-ucs-4', 'ucs-4', 'csucs4',
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1912
                                 'utf-16', 'utf-32', 'utf_16', 'utf_32',
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1913
                                 'utf16', 'u16')):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1914
                xml_encoding = sniffed_xml_encoding
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1915
        return xml_data, xml_encoding, sniffed_xml_encoding
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1916
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1917
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1918
    def find_codec(self, charset):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1919
        return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1920
               or (charset and self._codec(charset.replace("-", ""))) \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1921
               or (charset and self._codec(charset.replace("-", "_"))) \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1922
               or charset
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1923
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1924
    def _codec(self, charset):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1925
        if not charset: return charset
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1926
        codec = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1927
        try:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1928
            codecs.lookup(charset)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1929
            codec = charset
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1930
        except (LookupError, ValueError):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1931
            pass
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1932
        return codec
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1933
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1934
    EBCDIC_TO_ASCII_MAP = None
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1935
    def _ebcdic_to_ascii(self, s):
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1936
        c = self.__class__
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1937
        if not c.EBCDIC_TO_ASCII_MAP:
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1938
            emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1939
                    16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1940
                    128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1941
                    144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1942
                    32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1943
                    38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1944
                    45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1945
                    186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1946
                    195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1947
                    201,202,106,107,108,109,110,111,112,113,114,203,204,205,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1948
                    206,207,208,209,126,115,116,117,118,119,120,121,122,210,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1949
                    211,212,213,214,215,216,217,218,219,220,221,222,223,224,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1950
                    225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1951
                    73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1952
                    82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1953
                    90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57,
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1954
                    250,251,252,253,254,255)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1955
            import string
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1956
            c.EBCDIC_TO_ASCII_MAP = string.maketrans( \
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1957
            ''.join(map(chr, range(256))), ''.join(map(chr, emap)))
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1958
        return s.translate(c.EBCDIC_TO_ASCII_MAP)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1959
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1960
    MS_CHARS = { '\x80' : ('euro', '20AC'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1961
                 '\x81' : ' ',
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1962
                 '\x82' : ('sbquo', '201A'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1963
                 '\x83' : ('fnof', '192'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1964
                 '\x84' : ('bdquo', '201E'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1965
                 '\x85' : ('hellip', '2026'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1966
                 '\x86' : ('dagger', '2020'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1967
                 '\x87' : ('Dagger', '2021'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1968
                 '\x88' : ('circ', '2C6'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1969
                 '\x89' : ('permil', '2030'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1970
                 '\x8A' : ('Scaron', '160'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1971
                 '\x8B' : ('lsaquo', '2039'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1972
                 '\x8C' : ('OElig', '152'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1973
                 '\x8D' : '?',
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1974
                 '\x8E' : ('#x17D', '17D'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1975
                 '\x8F' : '?',
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1976
                 '\x90' : '?',
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1977
                 '\x91' : ('lsquo', '2018'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1978
                 '\x92' : ('rsquo', '2019'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1979
                 '\x93' : ('ldquo', '201C'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1980
                 '\x94' : ('rdquo', '201D'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1981
                 '\x95' : ('bull', '2022'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1982
                 '\x96' : ('ndash', '2013'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1983
                 '\x97' : ('mdash', '2014'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1984
                 '\x98' : ('tilde', '2DC'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1985
                 '\x99' : ('trade', '2122'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1986
                 '\x9a' : ('scaron', '161'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1987
                 '\x9b' : ('rsaquo', '203A'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1988
                 '\x9c' : ('oelig', '153'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1989
                 '\x9d' : '?',
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1990
                 '\x9e' : ('#x17E', '17E'),
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1991
                 '\x9f' : ('Yuml', ''),}
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1992
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1993
#######################################################################
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1994
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1995
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1996
#By default, act as an HTML pretty-printer.
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1997
if __name__ == '__main__':
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1998
    import sys
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  1999
    soup = BeautifulSoup(sys.stdin)
b3daada52dd3 Add BeautifulSoup Python HTML/XML parser to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
  2000
    print soup.prettify()