app/htmlsanitizer/HtmlSanitizer.py
author Daniel Hans <Daniel.M.Hans@gmail.com>
Mon, 02 Nov 2009 23:38:43 +0100
changeset 3074 ebda36efbd61
parent 2555 b7f14c803619
permissions -rw-r--r--
HtmlSanitizer becomes Python 2.6 compatible. The Cleaner class must not have any arguments when calling __init__ function for the object class, because in this case Python 2.6 raises TypeError (while previous versions just ignored them).
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
2324
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     1
# -*- coding: UTF-8 -*-
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     2
"""
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     3
some input filters, for regularising the html fragments from screen scraping and 
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     4
browser-based editors into some semblance of sanity
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     5
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     6
TODO: turn the messy setting[method_name]=True filter syntax into a list of cleaning methods to invoke, so that they can be invoked in a specific order and multiple times.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     7
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     8
AUTHORS:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     9
Dan MacKinlay - https://launchpad.net/~dan-possumpalace
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    10
Collin Grady - http://launchpad.net/~collin-collingrady
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    11
Andreas Gustafsson - https://bugs.launchpad.net/~gson
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    12
HÃ¥kan W - https://launchpad.net/~hwaara-gmail
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    13
"""
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    14
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    15
import BeautifulSoup
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    16
import re
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    17
import sys
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    18
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    19
# Python 2.4 compatibility
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    20
try: any
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    21
except NameError:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    22
    def any(iterable):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    23
        for element in iterable:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    24
            if element:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    25
                return True
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    26
        return False
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    27
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    28
"""
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    29
html5lib compatibility. Basically, we need to know that this still works whether html5lib
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    30
is imported or not. Should run complete suites of tests for both possible configs -
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    31
or test in virtual environments, but for now a basic sanity check will do.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    32
>>> if html5:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    33
>>>     c=Cleaner(html5=False)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    34
>>>     c(u'<p>foo</p>)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    35
u'<p>foo</p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    36
"""
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    37
try:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    38
    import html5lib
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    39
    from html5lib import sanitizer, treebuilders
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    40
    parser = html5lib.HTMLParser(
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    41
        tree=treebuilders.getTreeBuilder("beautifulsoup"),
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    42
        tokenizer=sanitizer.HTMLSanitizer
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    43
    )
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    44
    html5 = True
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    45
except ImportError:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    46
    html5 = False
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    47
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    48
ANTI_JS_RE=re.compile('j\s*a\s*v\s*a\s*s\s*c\s*r\s*i\s*p\s*t\s*:', re.IGNORECASE)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    49
#These tags and attrs are sufficently liberal to let microformats through...
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    50
#it ruthlessly culls all the rdf, dublin core metadata and so on.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    51
valid_tags = dict.fromkeys('p i em strong b u a h1 h2 h3 pre abbr br img dd dt ol ul li span sub sup ins del blockquote table tr td th address cite'.split()) #div?
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    52
valid_attrs = dict.fromkeys('href src rel title'.split())
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    53
valid_schemes = dict.fromkeys('http https'.split())
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    54
elem_map = {'b' : 'strong', 'i': 'em'}
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    55
attrs_considered_links = dict.fromkeys("src href".split()) #should include
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    56
#courtesy http://developer.mozilla.org/en/docs/HTML:Block-level_elements
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    57
block_elements = dict.fromkeys(["p", "h1","h2", "h3", "h4", "h5", "h6", "ol", "ul", "pre", "address", "blockquote", "dl", "div", "fieldset", "form", "hr", "noscript", "table"])
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    58
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    59
#convenient default filter lists.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    60
paranoid_filters = ["strip_comments", "strip_tags", "strip_attrs",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    61
  "strip_schemes", "rename_tags", "wrap_string", "strip_empty_tags", "strip_empty_tags", ]
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    62
complete_filters = ["strip_comments", "rename_tags", "strip_tags", "strip_attrs",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    63
    "strip_cdata", "strip_schemes",  "wrap_string", "strip_empty_tags", "rebase_links", "reparse"]
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    64
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    65
#set some conservative default string processings
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    66
default_settings = {
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    67
    "filters" : paranoid_filters,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    68
    "block_elements" : block_elements, #xml or None for a more liberal version
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    69
    "convert_entities" : "html", #xml or None for a more liberal version
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    70
    "valid_tags" : valid_tags,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    71
    "valid_attrs" : valid_attrs,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    72
    "valid_schemes" : valid_schemes,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    73
    "attrs_considered_links" : attrs_considered_links,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    74
    "elem_map" : elem_map,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    75
    "wrapping_element" : "p",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    76
    "auto_clean" : False,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    77
    "original_url" : "",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    78
    "new_url" : "",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    79
    "html5" : html5
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    80
}
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    81
#processes I'd like but haven't implemented            
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    82
#"encode_xml_specials", "ensure complete xhtml doc", "ensure_xhtml_fragment_only"
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    83
# and some handling of permitted namespaces for tags. for RDF, say. maybe.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    84
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    85
XML_ENTITIES = { u"'" : u"&apos;",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    86
                 u'"' : u"&quot;",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    87
                 u"&" : u"&amp;",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    88
                 u"<" : u"&lt;",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    89
                 u">" : u"&gt;"
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    90
               }
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    91
LINE_EXTRACTION_RE = re.compile(".+", re.MULTILINE)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    92
BR_EXTRACTION_RE = re.compile("</?br ?/?>", re.MULTILINE)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    93
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    94
class Stop:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    95
    """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    96
    handy class that we use as a stop input for our state machine in lieu of falling
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    97
    off the end of lists
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    98
    """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    99
    pass
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   100
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   101
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   102
class Cleaner(object):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   103
    r"""
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   104
    powerful and slow arbitrary HTML sanitisation. can deal (i hope) with most XSS
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   105
    vectors and layout-breaking badness.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   106
    Probably overkill for content from trusted sources; defaults are accordingly
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   107
    set to be paranoid.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   108
    >>> bad_html = '<p style="forbidden markup"><!-- XSS attach -->content</p'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   109
    >>> good_html = u'<p>content</p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   110
    >>> c = Cleaner()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   111
    >>> c.string = bad_html
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   112
    >>> c.clean()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   113
    >>> c.string == good_html
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   114
    True
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   115
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   116
    Also supports shorthand syntax:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   117
    >>> c = Cleaner()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   118
    >>> c(bad_html) == c(good_html)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   119
    True
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   120
    """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   121
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   122
    def __init__(self, string_or_soup="", *args,  **kwargs):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   123
        self.settings=default_settings.copy()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   124
        self.settings.update(kwargs)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   125
        if args :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   126
            self.settings['filters'] = args
3074
ebda36efbd61 HtmlSanitizer becomes Python 2.6 compatible.
Daniel Hans <Daniel.M.Hans@gmail.com>
parents: 2555
diff changeset
   127
        super(Cleaner, self).__init__()
2324
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   128
        self.string = string_or_soup
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   129
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   130
    def __call__(self, string = None, **kwargs):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   131
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   132
        convenience method allowing one-step calling of an instance and returning
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   133
        a cleaned string.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   134
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   135
        TODO: make this method preserve internal state- perhaps by creating a new
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   136
        instance.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   137
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   138
        >>> s = 'input string'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   139
        >>> c1 = Cleaner(s, auto_clean=True)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   140
        >>> c2 = Cleaner("")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   141
        >>> c1.string == c2(s)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   142
        True
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   143
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   144
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   145
        self.settings.update(kwargs)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   146
        if not string == None :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   147
            self.string = string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   148
        self.clean()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   149
        return self.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   150
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   151
    def _set_contents(self, string_or_soup):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   152
        if isinstance(string_or_soup, BeautifulSoup.BeautifulSoup) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   153
            self._set_soup(string_or_soup)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   154
        else :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   155
            self._set_string(string_or_soup)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   156
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   157
    def _set_string(self, html_fragment_string):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   158
        if self.settings['html5']:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   159
            s = parser.parse(html_fragment_string).body
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   160
        else:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   161
            s = BeautifulSoup.BeautifulSoup(
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   162
                    html_fragment_string,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   163
                    convertEntities=self.settings['convert_entities'])
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   164
        self._set_soup(s)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   165
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   166
    def _set_soup(self, soup):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   167
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   168
        Does all the work of set_string, but bypasses a potential autoclean to avoid 
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   169
        loops upon internal string setting ops.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   170
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   171
        self._soup = BeautifulSoup.BeautifulSoup(
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   172
            '<rootrootroot></rootrootroot>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   173
        )
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   174
        self.root=self._soup.contents[0]
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   175
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   176
        if len(soup.contents) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   177
            backwards_soup = [i for i in soup.contents]
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   178
            backwards_soup.reverse()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   179
        else :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   180
            backwards_soup = []
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   181
        for i in backwards_soup :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   182
            i.extract()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   183
            self.root.insert(0, i)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   184
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   185
    def set_string(self, string) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   186
        ur"""
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   187
            sets the string to process and does the necessary input encoding too
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   188
        really intended to be invoked as a property.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   189
        note the godawful rootrootroot element which we need because the
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   190
        BeautifulSoup object has all the same methods as a Tag, but
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   191
        behaves differently, silently failing on some inserts and appends
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   192
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   193
        >>> c = Cleaner(convert_entities="html")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   194
        >>> c.string = '&eacute;'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   195
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   196
        u'\xe9'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   197
        >>> c = Cleaner(convert_entities="xml")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   198
        >>> c.string = u'&eacute;'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   199
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   200
        u'&eacute;'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   201
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   202
        self._set_string(string)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   203
        if len(string) and self.settings['auto_clean'] : self.clean()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   204
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   205
    def get_string(self):
2555
b7f14c803619 Fix HtmlSanitizer to return cleaned string in proper encoding.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents: 2324
diff changeset
   206
        return self.root.renderContents().decode('utf-8')
2324
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   207
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   208
    string = property(get_string, set_string)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   209
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   210
    def clean(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   211
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   212
        invoke all cleaning processes stipulated in the settings
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   213
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   214
        for method in self.settings['filters'] :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   215
            try :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   216
                getattr(self, method)()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   217
            except NotImplementedError :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   218
                sys.stderr.write('Warning, called unimplemented method %s' % method + '\n')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   219
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   220
    def strip_comments(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   221
        r"""
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   222
        XHTML comments are used as an XSS attack vector. they must die.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   223
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   224
        >>> c = Cleaner("", "strip_comments")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   225
        >>> c('<p>text<!-- comment --> More text</p>')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   226
        u'<p>text More text</p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   227
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   228
        for comment in self.root.findAll(
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   229
            text = lambda text: isinstance(text, BeautifulSoup.Comment)):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   230
            comment.extract()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   231
            
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   232
    def strip_cdata(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   233
        for cdata in self.root.findAll(
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   234
          text = lambda text: isinstance(text, BeautifulSoup.CData)):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   235
            cdata.extract()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   236
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   237
    def strip_tags(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   238
        r"""
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   239
        ill-considered tags break our layout. they must die.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   240
        >>> c = Cleaner("", "strip_tags", auto_clean=True)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   241
        >>> c.string = '<div>A <strong>B C</strong></div>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   242
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   243
        u'A <strong>B C</strong>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   244
        >>> c.string = '<div>A <div>B C</div></div>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   245
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   246
        u'A B C'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   247
        >>> c.string = '<div>A <br /><div>B C</div></div>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   248
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   249
        u'A <br />B C'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   250
        >>> c.string = '<p>A <div>B C</div></p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   251
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   252
        u'<p>A B C</p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   253
        >>> c.string = 'A<div>B<div>C<div>D</div>E</div>F</div>G'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   254
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   255
        u'ABCDEFG'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   256
        >>> c.string = '<div>B<div>C<div>D</div>E</div>F</div>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   257
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   258
        u'BCDEF'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   259
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   260
        # Beautiful Soup doesn't support dynamic .findAll results when the tree is
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   261
        # modified in place.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   262
        # going backwards doesn't seem to help.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   263
        # so find one at a time
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   264
        while True :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   265
            next_bad_tag = self.root.find(
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   266
              lambda tag : not tag.name in (self.settings['valid_tags'])
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   267
            )
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   268
            if next_bad_tag :                
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   269
                self.disgorge_elem(next_bad_tag)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   270
            else:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   271
                break
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   272
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   273
    def strip_attrs(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   274
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   275
        preserve only those attributes we need in the soup
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   276
        >>> c = Cleaner("", "strip_attrs")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   277
        >>> c('<div title="v" bad="v">A <strong title="v" bad="v">B C</strong></div>')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   278
        u'<div title="v">A <strong title="v">B C</strong></div>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   279
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   280
        for tag in self.root.findAll(True):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   281
            tag.attrs = [(attr, val) for attr, val in tag.attrs
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   282
                         if attr in self.settings['valid_attrs']]
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   283
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   284
    def _all_links(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   285
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   286
        finds all tags with link attributes sequentially. safe against modification
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   287
        of said attributes in-place.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   288
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   289
        start = self.root
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   290
        while True: 
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   291
            tag = start.findNext(
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   292
              lambda tag : any(
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   293
                [(tag.get(i) for i in self.settings['attrs_considered_links'])]
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   294
              ))
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   295
            if tag: 
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   296
                start = tag
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   297
                yield tag
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   298
            else :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   299
                break
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   300
            
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   301
    def strip_schemes(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   302
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   303
        >>> c = Cleaner("", "strip_schemes")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   304
        >>> c('<img src="javascript:alert();" />')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   305
        u'<img />'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   306
        >>> c('<a href="javascript:alert();">foo</a>')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   307
        u'<a>foo</a>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   308
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   309
        for tag in self._all_links() :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   310
            for key in self.settings['attrs_considered_links'] :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   311
                scheme_bits = tag.get(key, u"").split(u':',1)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   312
                if len(scheme_bits) == 1 : 
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   313
                    pass #relative link
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   314
                else:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   315
		    if not scheme_bits[0] in self.settings['valid_schemes'] :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   316
			del(tag[key])
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   317
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   318
    def br_to_p(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   319
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   320
        >>> c = Cleaner("", "br_to_p")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   321
        >>> c('<p>A<br />B</p>')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   322
        u'<p>A</p><p>B</p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   323
        >>> c('A<br />B')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   324
        u'<p>A</p><p>B</p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   325
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   326
        block_elems = self.settings['block_elements']
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   327
        block_elems['br'] = None
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   328
        block_elems['p'] = None
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   329
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   330
        while True :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   331
            next_br = self.root.find('br')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   332
            if not next_br: break
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   333
            parent = next_br.parent
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   334
            self.wrap_string('p', start_at=parent, block_elems = block_elems)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   335
            while True:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   336
                useless_br=parent.find('br', recursive=False)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   337
                if not useless_br: break
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   338
                useless_br.extract()        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   339
            if parent.name == 'p':
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   340
                self.disgorge_elem(parent)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   341
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   342
    def rename_tags(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   343
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   344
        >>> c = Cleaner("", "rename_tags", elem_map={'i': 'em'})
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   345
        >>> c('<b>A<i>B</i></b>')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   346
        u'<b>A<em>B</em></b>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   347
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   348
        for tag in self.root.findAll(self.settings['elem_map']) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   349
            tag.name = self.settings['elem_map'][tag.name]
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   350
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   351
    def wrap_string(self, wrapping_element = None, start_at=None, block_elems=None):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   352
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   353
        takes an html fragment, which may or may not have a single containing element,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   354
        and guarantees what the tag name of the topmost elements are.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   355
        TODO: is there some simpler way than a state machine to do this simple thing?
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   356
        >>> c = Cleaner("", "wrap_string")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   357
        >>> c('A <strong>B C</strong>D')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   358
        u'<p>A <strong>B C</strong>D</p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   359
        >>> c('A <p>B C</p>D')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   360
        u'<p>A </p><p>B C</p><p>D</p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   361
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   362
        if not start_at : start_at = self.root
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   363
        if not block_elems : block_elems = self.settings['block_elements']
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   364
        e = (wrapping_element or self.settings['wrapping_element'])
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   365
        paragraph_list = []
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   366
        children = [elem for elem in start_at.contents]
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   367
        children.append(Stop())
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   368
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   369
        last_state = 'block'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   370
        paragraph = BeautifulSoup.Tag(self._soup, e)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   371
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   372
        for node in children :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   373
            if isinstance(node, Stop) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   374
                state = 'end'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   375
            elif hasattr(node, 'name') and node.name in block_elems:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   376
                state = 'block'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   377
            else:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   378
                state = 'inline'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   379
                
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   380
            if last_state == 'block' and state == 'inline':
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   381
                #collate inline elements
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   382
                paragraph = BeautifulSoup.Tag(self._soup, e)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   383
                
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   384
            if state == 'inline' :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   385
                paragraph.append(node)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   386
                
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   387
            if ((state <> 'inline') and last_state == 'inline') :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   388
                paragraph_list.append(paragraph)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   389
                
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   390
            if state == 'block' :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   391
                paragraph_list.append(node)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   392
            
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   393
            last_state = state
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   394
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   395
        #can't use append since it doesn't work on empty elements...
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   396
        paragraph_list.reverse()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   397
        for paragraph in paragraph_list:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   398
            start_at.insert(0, paragraph)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   399
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   400
    def strip_empty_tags(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   401
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   402
        strip out all empty tags
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   403
        TODO: depth-first search
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   404
        >>> c = Cleaner("", "strip_empty_tags")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   405
        >>> c('<p>A</p><p></p><p>B</p><p></p>')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   406
        u'<p>A</p><p>B</p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   407
        >>> c('<p><a></a></p>')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   408
        u'<p></p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   409
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   410
        tag = self.root
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   411
        while True:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   412
            next_tag = tag.findNext(True)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   413
            if not next_tag: break
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   414
            if next_tag.contents or next_tag.attrs:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   415
                tag = next_tag
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   416
                continue
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   417
            next_tag.extract()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   418
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   419
    def rebase_links(self, original_url="", new_url ="") :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   420
        if not original_url : original_url = self.settings.get('original_url', '')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   421
        if not new_url : new_url = self.settings.get('new_url', '')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   422
        raise NotImplementedError
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   423
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   424
    # Because of its internal character set handling,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   425
    # the following will not work in Beautiful soup and is hopefully redundant.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   426
    # def encode_xml_specials(self, original_url="", new_url ="") :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   427
    #     """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   428
    #     BeautifulSoup will let some dangerous xml entities hang around
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   429
    #     in the navigable strings. destroy all monsters.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   430
    #     >>> c = Cleaner(auto_clean=True, encode_xml_specials=True)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   431
    #     >>> c('<<<<<')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   432
    #     u'&lt;&lt;&lt;&lt;'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   433
    #     """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   434
    #     for string in self.root.findAll(text=True) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   435
    #         sys.stderr.write("root" +"\n")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   436
    #         sys.stderr.write(str(self.root) +"\n")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   437
    #         sys.stderr.write("parent" +"\n")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   438
    #         sys.stderr.write(str(string.parent) +"\n")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   439
    #         new_string = unicode(string)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   440
    #         sys.stderr.write(string +"\n")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   441
    #         for special_char in XML_ENTITIES.keys() :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   442
    #             sys.stderr.write(special_char +"\n")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   443
    #         string.replaceWith(
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   444
    #           new_string.replace(special_char, XML_ENTITIES[special_char])
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   445
    #         )
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   446
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   447
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   448
    def disgorge_elem(self, elem):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   449
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   450
        remove the given element from the soup and replaces it with its own contents
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   451
        actually tricky, since you can't replace an element with an list of elements
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   452
        using replaceWith
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   453
        >>> disgorgeable_string = '<body>A <em>B</em> C</body>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   454
        >>> c = Cleaner()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   455
        >>> c.string = disgorgeable_string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   456
        >>> elem = c._soup.find('em')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   457
        >>> c.disgorge_elem(elem)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   458
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   459
        u'<body>A B C</body>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   460
        >>> c.string = disgorgeable_string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   461
        >>> elem = c._soup.find('body')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   462
        >>> c.disgorge_elem(elem)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   463
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   464
        u'A <em>B</em> C'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   465
        >>> c.string = '<div>A <div id="inner">B C</div></div>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   466
        >>> elem = c._soup.find(id="inner")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   467
        >>> c.disgorge_elem(elem)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   468
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   469
        u'<div>A B C</div>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   470
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   471
        if elem == self.root :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   472
            raise AttributeError, "Can't disgorge root"  
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   473
                      
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   474
        # With in-place modification, BeautifulSoup occasionally can return
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   475
        # elements that think they are orphans
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   476
        # this lib is full of workarounds, but it's worth checking
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   477
        parent = elem.parent
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   478
        if parent == None: 
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   479
            raise AttributeError, "AAAAAAAAGH! NO PARENTS! DEATH!"
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   480
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   481
        i = None
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   482
        for i in range(len(parent.contents)) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   483
            if parent.contents[i] == elem :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   484
                index = i
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   485
                break
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   486
                
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   487
        elem.extract()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   488
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   489
        #the proceeding method breaks horribly, sporadically.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   490
        # for i in range(len(elem.contents)) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   491
        #     elem.contents[i].extract()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   492
        #     parent.contents.insert(index+i, elem.contents[i])
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   493
        # return
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   494
        self._safe_inject(parent, index, elem.contents)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   495
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   496
    def _safe_inject(self, dest, dest_index, node_list):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   497
        #BeautifulSoup result sets look like lists but don't behave right
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   498
        # i.e. empty ones are still True,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   499
        if not len(node_list) : return
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   500
        node_list = [i for i in node_list]
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   501
        node_list.reverse()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   502
        for i in node_list :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   503
            dest.insert(dest_index, i)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   504
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   505
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   506
class Htmlator(object) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   507
    """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   508
    converts a string into a series of html paragraphs
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   509
    """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   510
    settings = {
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   511
        "encode_xml_specials" : True,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   512
        "is_plaintext" : True,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   513
        "convert_newlines" : False,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   514
        "make_links" : True,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   515
        "auto_convert" : False,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   516
        "valid_schemes" : valid_schemes,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   517
    }
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   518
    def __init__(self, string = "",  **kwargs):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   519
        self.settings.update(kwargs)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   520
        super(Htmlator, self).__init__(string, **kwargs)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   521
        self.string = string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   522
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   523
    def _set_string(self, string):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   524
        self.string = string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   525
        if self.settings['auto_convert'] : self.convert()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   526
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   527
    def _get_string(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   528
        return unicode(self._soup)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   529
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   530
    string = property(_get_string, _set_string)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   531
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   532
    def __call__(self, string):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   533
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   534
        convenience method supporting one-step calling of an instance
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   535
        as a string cleaning function
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   536
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   537
        self.string = string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   538
        self.convert()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   539
        return self.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   540
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   541
    def convert(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   542
        for method in ["encode_xml_specials", "convert_newlines",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   543
          "make_links"] :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   544
            if self.settings(method) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   545
                getattr(self, method)()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   546
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   547
    def encode_xml_specials(self) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   548
        for char in XML_ENTITIES.keys() :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   549
            self.string.replace(char, XML_ENTITIES[char])
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   550
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   551
    def make_links(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   552
        raise NotImplementedError
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   553
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   554
    def convert_newlines(self) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   555
        self.string = ''.join([
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   556
            '<p>' + line + '</p>' for line in LINE_EXTRACTION_RE.findall(self.string)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   557
        ])
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   558
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   559
def _test():
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   560
    import doctest
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   561
    doctest.testmod()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   562
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   563
if __name__ == "__main__":
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   564
    _test()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   565
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   566
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   567
# def cast_input_to_soup(fn):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   568
#     """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   569
#     Decorate function to handle strings as BeautifulSoups transparently
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   570
#     """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   571
#     def stringy_version(input, *args, **kwargs) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   572
#         if not isinstance(input,BeautifulSoup) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   573
#             input=BeautifulSoup(input)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   574
#         return fn(input, *args, **kwargs)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   575
#     return stringy_version