app/htmlsanitizer/HtmlSanitizer.py
author Lennard de Rijk <ljvderijk@gmail.com>
Fri, 24 Jul 2009 21:00:04 +0200
changeset 2678 a525a55833f1
parent 2555 b7f14c803619
child 3074 ebda36efbd61
permissions -rw-r--r--
Send out a Notification upon creation of a new Request entity. The receivers are specified by the corresponding Role logic this Role Request is for. Currently Organization and Club Administrators will receive "new request" messages about respectively Mentor and Club Membership requests. Fixes Issue 442.
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
2324
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     1
# -*- coding: UTF-8 -*-
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     2
"""
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     3
some input filters, for regularising the html fragments from screen scraping and 
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     4
browser-based editors into some semblance of sanity
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     5
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     6
TODO: turn the messy setting[method_name]=True filter syntax into a list of cleaning methods to invoke, so that they can be invoked in a specific order and multiple times.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     7
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     8
AUTHORS:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
     9
Dan MacKinlay - https://launchpad.net/~dan-possumpalace
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    10
Collin Grady - http://launchpad.net/~collin-collingrady
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    11
Andreas Gustafsson - https://bugs.launchpad.net/~gson
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    12
HÃ¥kan W - https://launchpad.net/~hwaara-gmail
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    13
"""
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    14
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    15
import BeautifulSoup
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    16
import re
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    17
import sys
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    18
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    19
# Python 2.4 compatibility
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    20
try: any
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    21
except NameError:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    22
    def any(iterable):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    23
        for element in iterable:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    24
            if element:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    25
                return True
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    26
        return False
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    27
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    28
"""
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    29
html5lib compatibility. Basically, we need to know that this still works whether html5lib
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    30
is imported or not. Should run complete suites of tests for both possible configs -
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    31
or test in virtual environments, but for now a basic sanity check will do.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    32
>>> if html5:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    33
>>>     c=Cleaner(html5=False)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    34
>>>     c(u'<p>foo</p>)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    35
u'<p>foo</p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    36
"""
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    37
try:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    38
    import html5lib
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    39
    from html5lib import sanitizer, treebuilders
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    40
    parser = html5lib.HTMLParser(
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    41
        tree=treebuilders.getTreeBuilder("beautifulsoup"),
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    42
        tokenizer=sanitizer.HTMLSanitizer
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    43
    )
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    44
    html5 = True
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    45
except ImportError:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    46
    html5 = False
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    47
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    48
ANTI_JS_RE=re.compile('j\s*a\s*v\s*a\s*s\s*c\s*r\s*i\s*p\s*t\s*:', re.IGNORECASE)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    49
#These tags and attrs are sufficently liberal to let microformats through...
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    50
#it ruthlessly culls all the rdf, dublin core metadata and so on.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    51
valid_tags = dict.fromkeys('p i em strong b u a h1 h2 h3 pre abbr br img dd dt ol ul li span sub sup ins del blockquote table tr td th address cite'.split()) #div?
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    52
valid_attrs = dict.fromkeys('href src rel title'.split())
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    53
valid_schemes = dict.fromkeys('http https'.split())
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    54
elem_map = {'b' : 'strong', 'i': 'em'}
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    55
attrs_considered_links = dict.fromkeys("src href".split()) #should include
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    56
#courtesy http://developer.mozilla.org/en/docs/HTML:Block-level_elements
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    57
block_elements = dict.fromkeys(["p", "h1","h2", "h3", "h4", "h5", "h6", "ol", "ul", "pre", "address", "blockquote", "dl", "div", "fieldset", "form", "hr", "noscript", "table"])
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    58
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    59
#convenient default filter lists.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    60
paranoid_filters = ["strip_comments", "strip_tags", "strip_attrs",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    61
  "strip_schemes", "rename_tags", "wrap_string", "strip_empty_tags", "strip_empty_tags", ]
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    62
complete_filters = ["strip_comments", "rename_tags", "strip_tags", "strip_attrs",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    63
    "strip_cdata", "strip_schemes",  "wrap_string", "strip_empty_tags", "rebase_links", "reparse"]
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    64
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    65
#set some conservative default string processings
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    66
default_settings = {
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    67
    "filters" : paranoid_filters,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    68
    "block_elements" : block_elements, #xml or None for a more liberal version
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    69
    "convert_entities" : "html", #xml or None for a more liberal version
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    70
    "valid_tags" : valid_tags,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    71
    "valid_attrs" : valid_attrs,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    72
    "valid_schemes" : valid_schemes,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    73
    "attrs_considered_links" : attrs_considered_links,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    74
    "elem_map" : elem_map,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    75
    "wrapping_element" : "p",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    76
    "auto_clean" : False,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    77
    "original_url" : "",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    78
    "new_url" : "",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    79
    "html5" : html5
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    80
}
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    81
#processes I'd like but haven't implemented            
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    82
#"encode_xml_specials", "ensure complete xhtml doc", "ensure_xhtml_fragment_only"
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    83
# and some handling of permitted namespaces for tags. for RDF, say. maybe.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    84
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    85
XML_ENTITIES = { u"'" : u"&apos;",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    86
                 u'"' : u"&quot;",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    87
                 u"&" : u"&amp;",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    88
                 u"<" : u"&lt;",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    89
                 u">" : u"&gt;"
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    90
               }
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    91
LINE_EXTRACTION_RE = re.compile(".+", re.MULTILINE)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    92
BR_EXTRACTION_RE = re.compile("</?br ?/?>", re.MULTILINE)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    93
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    94
class Stop:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    95
    """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    96
    handy class that we use as a stop input for our state machine in lieu of falling
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    97
    off the end of lists
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    98
    """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
    99
    pass
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   100
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   101
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   102
class Cleaner(object):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   103
    r"""
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   104
    powerful and slow arbitrary HTML sanitisation. can deal (i hope) with most XSS
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   105
    vectors and layout-breaking badness.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   106
    Probably overkill for content from trusted sources; defaults are accordingly
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   107
    set to be paranoid.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   108
    >>> bad_html = '<p style="forbidden markup"><!-- XSS attach -->content</p'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   109
    >>> good_html = u'<p>content</p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   110
    >>> c = Cleaner()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   111
    >>> c.string = bad_html
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   112
    >>> c.clean()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   113
    >>> c.string == good_html
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   114
    True
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   115
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   116
    Also supports shorthand syntax:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   117
    >>> c = Cleaner()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   118
    >>> c(bad_html) == c(good_html)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   119
    True
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   120
    """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   121
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   122
    def __init__(self, string_or_soup="", *args,  **kwargs):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   123
        self.settings=default_settings.copy()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   124
        self.settings.update(kwargs)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   125
        if args :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   126
            self.settings['filters'] = args
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   127
        super(Cleaner, self).__init__(string_or_soup, *args, **kwargs)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   128
        self.string = string_or_soup
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   129
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   130
    def __call__(self, string = None, **kwargs):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   131
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   132
        convenience method allowing one-step calling of an instance and returning
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   133
        a cleaned string.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   134
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   135
        TODO: make this method preserve internal state- perhaps by creating a new
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   136
        instance.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   137
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   138
        >>> s = 'input string'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   139
        >>> c1 = Cleaner(s, auto_clean=True)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   140
        >>> c2 = Cleaner("")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   141
        >>> c1.string == c2(s)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   142
        True
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   143
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   144
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   145
        self.settings.update(kwargs)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   146
        if not string == None :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   147
            self.string = string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   148
        self.clean()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   149
        return self.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   150
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   151
    def _set_contents(self, string_or_soup):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   152
        if isinstance(string_or_soup, BeautifulSoup.BeautifulSoup) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   153
            self._set_soup(string_or_soup)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   154
        else :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   155
            self._set_string(string_or_soup)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   156
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   157
    def _set_string(self, html_fragment_string):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   158
        if self.settings['html5']:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   159
            s = parser.parse(html_fragment_string).body
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   160
        else:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   161
            s = BeautifulSoup.BeautifulSoup(
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   162
                    html_fragment_string,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   163
                    convertEntities=self.settings['convert_entities'])
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   164
        self._set_soup(s)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   165
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   166
    def _set_soup(self, soup):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   167
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   168
        Does all the work of set_string, but bypasses a potential autoclean to avoid 
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   169
        loops upon internal string setting ops.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   170
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   171
        self._soup = BeautifulSoup.BeautifulSoup(
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   172
            '<rootrootroot></rootrootroot>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   173
        )
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   174
        self.root=self._soup.contents[0]
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   175
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   176
        if len(soup.contents) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   177
            backwards_soup = [i for i in soup.contents]
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   178
            backwards_soup.reverse()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   179
        else :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   180
            backwards_soup = []
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   181
        for i in backwards_soup :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   182
            i.extract()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   183
            self.root.insert(0, i)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   184
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   185
    def set_string(self, string) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   186
        ur"""
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   187
            sets the string to process and does the necessary input encoding too
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   188
        really intended to be invoked as a property.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   189
        note the godawful rootrootroot element which we need because the
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   190
        BeautifulSoup object has all the same methods as a Tag, but
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   191
        behaves differently, silently failing on some inserts and appends
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   192
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   193
        >>> c = Cleaner(convert_entities="html")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   194
        >>> c.string = '&eacute;'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   195
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   196
        u'\xe9'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   197
        >>> c = Cleaner(convert_entities="xml")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   198
        >>> c.string = u'&eacute;'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   199
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   200
        u'&eacute;'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   201
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   202
        self._set_string(string)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   203
        if len(string) and self.settings['auto_clean'] : self.clean()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   204
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   205
    def get_string(self):
2555
b7f14c803619 Fix HtmlSanitizer to return cleaned string in proper encoding.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents: 2324
diff changeset
   206
        return self.root.renderContents().decode('utf-8')
2324
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   207
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   208
    string = property(get_string, set_string)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   209
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   210
    def clean(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   211
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   212
        invoke all cleaning processes stipulated in the settings
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   213
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   214
        for method in self.settings['filters'] :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   215
            try :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   216
                getattr(self, method)()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   217
            except NotImplementedError :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   218
                sys.stderr.write('Warning, called unimplemented method %s' % method + '\n')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   219
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   220
    def strip_comments(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   221
        r"""
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   222
        XHTML comments are used as an XSS attack vector. they must die.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   223
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   224
        >>> c = Cleaner("", "strip_comments")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   225
        >>> c('<p>text<!-- comment --> More text</p>')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   226
        u'<p>text More text</p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   227
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   228
        for comment in self.root.findAll(
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   229
            text = lambda text: isinstance(text, BeautifulSoup.Comment)):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   230
            comment.extract()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   231
            
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   232
    def strip_cdata(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   233
        for cdata in self.root.findAll(
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   234
          text = lambda text: isinstance(text, BeautifulSoup.CData)):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   235
            cdata.extract()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   236
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   237
    def strip_tags(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   238
        r"""
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   239
        ill-considered tags break our layout. they must die.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   240
        >>> c = Cleaner("", "strip_tags", auto_clean=True)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   241
        >>> c.string = '<div>A <strong>B C</strong></div>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   242
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   243
        u'A <strong>B C</strong>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   244
        >>> c.string = '<div>A <div>B C</div></div>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   245
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   246
        u'A B C'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   247
        >>> c.string = '<div>A <br /><div>B C</div></div>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   248
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   249
        u'A <br />B C'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   250
        >>> c.string = '<p>A <div>B C</div></p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   251
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   252
        u'<p>A B C</p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   253
        >>> c.string = 'A<div>B<div>C<div>D</div>E</div>F</div>G'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   254
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   255
        u'ABCDEFG'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   256
        >>> c.string = '<div>B<div>C<div>D</div>E</div>F</div>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   257
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   258
        u'BCDEF'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   259
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   260
        # Beautiful Soup doesn't support dynamic .findAll results when the tree is
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   261
        # modified in place.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   262
        # going backwards doesn't seem to help.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   263
        # so find one at a time
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   264
        while True :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   265
            next_bad_tag = self.root.find(
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   266
              lambda tag : not tag.name in (self.settings['valid_tags'])
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   267
            )
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   268
            if next_bad_tag :                
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   269
                self.disgorge_elem(next_bad_tag)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   270
            else:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   271
                break
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   272
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   273
    def strip_attrs(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   274
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   275
        preserve only those attributes we need in the soup
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   276
        >>> c = Cleaner("", "strip_attrs")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   277
        >>> c('<div title="v" bad="v">A <strong title="v" bad="v">B C</strong></div>')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   278
        u'<div title="v">A <strong title="v">B C</strong></div>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   279
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   280
        for tag in self.root.findAll(True):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   281
            tag.attrs = [(attr, val) for attr, val in tag.attrs
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   282
                         if attr in self.settings['valid_attrs']]
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   283
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   284
    def _all_links(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   285
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   286
        finds all tags with link attributes sequentially. safe against modification
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   287
        of said attributes in-place.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   288
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   289
        start = self.root
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   290
        while True: 
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   291
            tag = start.findNext(
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   292
              lambda tag : any(
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   293
                [(tag.get(i) for i in self.settings['attrs_considered_links'])]
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   294
              ))
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   295
            if tag: 
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   296
                start = tag
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   297
                yield tag
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   298
            else :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   299
                break
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   300
            
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   301
    def strip_schemes(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   302
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   303
        >>> c = Cleaner("", "strip_schemes")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   304
        >>> c('<img src="javascript:alert();" />')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   305
        u'<img />'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   306
        >>> c('<a href="javascript:alert();">foo</a>')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   307
        u'<a>foo</a>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   308
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   309
        for tag in self._all_links() :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   310
            for key in self.settings['attrs_considered_links'] :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   311
                scheme_bits = tag.get(key, u"").split(u':',1)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   312
                if len(scheme_bits) == 1 : 
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   313
                    pass #relative link
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   314
                else:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   315
		    if not scheme_bits[0] in self.settings['valid_schemes'] :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   316
			del(tag[key])
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   317
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   318
    def br_to_p(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   319
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   320
        >>> c = Cleaner("", "br_to_p")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   321
        >>> c('<p>A<br />B</p>')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   322
        u'<p>A</p><p>B</p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   323
        >>> c('A<br />B')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   324
        u'<p>A</p><p>B</p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   325
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   326
        block_elems = self.settings['block_elements']
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   327
        block_elems['br'] = None
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   328
        block_elems['p'] = None
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   329
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   330
        while True :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   331
            next_br = self.root.find('br')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   332
            if not next_br: break
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   333
            parent = next_br.parent
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   334
            self.wrap_string('p', start_at=parent, block_elems = block_elems)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   335
            while True:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   336
                useless_br=parent.find('br', recursive=False)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   337
                if not useless_br: break
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   338
                useless_br.extract()        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   339
            if parent.name == 'p':
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   340
                self.disgorge_elem(parent)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   341
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   342
    def rename_tags(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   343
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   344
        >>> c = Cleaner("", "rename_tags", elem_map={'i': 'em'})
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   345
        >>> c('<b>A<i>B</i></b>')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   346
        u'<b>A<em>B</em></b>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   347
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   348
        for tag in self.root.findAll(self.settings['elem_map']) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   349
            tag.name = self.settings['elem_map'][tag.name]
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   350
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   351
    def wrap_string(self, wrapping_element = None, start_at=None, block_elems=None):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   352
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   353
        takes an html fragment, which may or may not have a single containing element,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   354
        and guarantees what the tag name of the topmost elements are.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   355
        TODO: is there some simpler way than a state machine to do this simple thing?
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   356
        >>> c = Cleaner("", "wrap_string")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   357
        >>> c('A <strong>B C</strong>D')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   358
        u'<p>A <strong>B C</strong>D</p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   359
        >>> c('A <p>B C</p>D')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   360
        u'<p>A </p><p>B C</p><p>D</p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   361
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   362
        if not start_at : start_at = self.root
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   363
        if not block_elems : block_elems = self.settings['block_elements']
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   364
        e = (wrapping_element or self.settings['wrapping_element'])
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   365
        paragraph_list = []
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   366
        children = [elem for elem in start_at.contents]
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   367
        children.append(Stop())
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   368
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   369
        last_state = 'block'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   370
        paragraph = BeautifulSoup.Tag(self._soup, e)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   371
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   372
        for node in children :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   373
            if isinstance(node, Stop) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   374
                state = 'end'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   375
            elif hasattr(node, 'name') and node.name in block_elems:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   376
                state = 'block'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   377
            else:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   378
                state = 'inline'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   379
                
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   380
            if last_state == 'block' and state == 'inline':
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   381
                #collate inline elements
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   382
                paragraph = BeautifulSoup.Tag(self._soup, e)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   383
                
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   384
            if state == 'inline' :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   385
                paragraph.append(node)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   386
                
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   387
            if ((state <> 'inline') and last_state == 'inline') :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   388
                paragraph_list.append(paragraph)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   389
                
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   390
            if state == 'block' :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   391
                paragraph_list.append(node)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   392
            
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   393
            last_state = state
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   394
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   395
        #can't use append since it doesn't work on empty elements...
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   396
        paragraph_list.reverse()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   397
        for paragraph in paragraph_list:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   398
            start_at.insert(0, paragraph)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   399
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   400
    def strip_empty_tags(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   401
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   402
        strip out all empty tags
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   403
        TODO: depth-first search
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   404
        >>> c = Cleaner("", "strip_empty_tags")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   405
        >>> c('<p>A</p><p></p><p>B</p><p></p>')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   406
        u'<p>A</p><p>B</p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   407
        >>> c('<p><a></a></p>')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   408
        u'<p></p>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   409
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   410
        tag = self.root
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   411
        while True:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   412
            next_tag = tag.findNext(True)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   413
            if not next_tag: break
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   414
            if next_tag.contents or next_tag.attrs:
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   415
                tag = next_tag
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   416
                continue
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   417
            next_tag.extract()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   418
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   419
    def rebase_links(self, original_url="", new_url ="") :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   420
        if not original_url : original_url = self.settings.get('original_url', '')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   421
        if not new_url : new_url = self.settings.get('new_url', '')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   422
        raise NotImplementedError
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   423
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   424
    # Because of its internal character set handling,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   425
    # the following will not work in Beautiful soup and is hopefully redundant.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   426
    # def encode_xml_specials(self, original_url="", new_url ="") :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   427
    #     """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   428
    #     BeautifulSoup will let some dangerous xml entities hang around
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   429
    #     in the navigable strings. destroy all monsters.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   430
    #     >>> c = Cleaner(auto_clean=True, encode_xml_specials=True)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   431
    #     >>> c('<<<<<')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   432
    #     u'&lt;&lt;&lt;&lt;'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   433
    #     """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   434
    #     for string in self.root.findAll(text=True) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   435
    #         sys.stderr.write("root" +"\n")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   436
    #         sys.stderr.write(str(self.root) +"\n")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   437
    #         sys.stderr.write("parent" +"\n")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   438
    #         sys.stderr.write(str(string.parent) +"\n")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   439
    #         new_string = unicode(string)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   440
    #         sys.stderr.write(string +"\n")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   441
    #         for special_char in XML_ENTITIES.keys() :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   442
    #             sys.stderr.write(special_char +"\n")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   443
    #         string.replaceWith(
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   444
    #           new_string.replace(special_char, XML_ENTITIES[special_char])
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   445
    #         )
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   446
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   447
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   448
    def disgorge_elem(self, elem):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   449
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   450
        remove the given element from the soup and replaces it with its own contents
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   451
        actually tricky, since you can't replace an element with an list of elements
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   452
        using replaceWith
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   453
        >>> disgorgeable_string = '<body>A <em>B</em> C</body>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   454
        >>> c = Cleaner()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   455
        >>> c.string = disgorgeable_string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   456
        >>> elem = c._soup.find('em')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   457
        >>> c.disgorge_elem(elem)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   458
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   459
        u'<body>A B C</body>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   460
        >>> c.string = disgorgeable_string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   461
        >>> elem = c._soup.find('body')
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   462
        >>> c.disgorge_elem(elem)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   463
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   464
        u'A <em>B</em> C'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   465
        >>> c.string = '<div>A <div id="inner">B C</div></div>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   466
        >>> elem = c._soup.find(id="inner")
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   467
        >>> c.disgorge_elem(elem)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   468
        >>> c.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   469
        u'<div>A B C</div>'
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   470
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   471
        if elem == self.root :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   472
            raise AttributeError, "Can't disgorge root"  
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   473
                      
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   474
        # With in-place modification, BeautifulSoup occasionally can return
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   475
        # elements that think they are orphans
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   476
        # this lib is full of workarounds, but it's worth checking
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   477
        parent = elem.parent
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   478
        if parent == None: 
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   479
            raise AttributeError, "AAAAAAAAGH! NO PARENTS! DEATH!"
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   480
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   481
        i = None
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   482
        for i in range(len(parent.contents)) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   483
            if parent.contents[i] == elem :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   484
                index = i
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   485
                break
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   486
                
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   487
        elem.extract()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   488
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   489
        #the proceeding method breaks horribly, sporadically.
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   490
        # for i in range(len(elem.contents)) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   491
        #     elem.contents[i].extract()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   492
        #     parent.contents.insert(index+i, elem.contents[i])
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   493
        # return
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   494
        self._safe_inject(parent, index, elem.contents)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   495
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   496
    def _safe_inject(self, dest, dest_index, node_list):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   497
        #BeautifulSoup result sets look like lists but don't behave right
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   498
        # i.e. empty ones are still True,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   499
        if not len(node_list) : return
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   500
        node_list = [i for i in node_list]
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   501
        node_list.reverse()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   502
        for i in node_list :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   503
            dest.insert(dest_index, i)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   504
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   505
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   506
class Htmlator(object) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   507
    """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   508
    converts a string into a series of html paragraphs
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   509
    """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   510
    settings = {
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   511
        "encode_xml_specials" : True,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   512
        "is_plaintext" : True,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   513
        "convert_newlines" : False,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   514
        "make_links" : True,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   515
        "auto_convert" : False,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   516
        "valid_schemes" : valid_schemes,
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   517
    }
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   518
    def __init__(self, string = "",  **kwargs):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   519
        self.settings.update(kwargs)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   520
        super(Htmlator, self).__init__(string, **kwargs)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   521
        self.string = string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   522
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   523
    def _set_string(self, string):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   524
        self.string = string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   525
        if self.settings['auto_convert'] : self.convert()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   526
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   527
    def _get_string(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   528
        return unicode(self._soup)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   529
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   530
    string = property(_get_string, _set_string)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   531
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   532
    def __call__(self, string):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   533
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   534
        convenience method supporting one-step calling of an instance
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   535
        as a string cleaning function
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   536
        """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   537
        self.string = string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   538
        self.convert()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   539
        return self.string
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   540
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   541
    def convert(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   542
        for method in ["encode_xml_specials", "convert_newlines",
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   543
          "make_links"] :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   544
            if self.settings(method) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   545
                getattr(self, method)()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   546
    
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   547
    def encode_xml_specials(self) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   548
        for char in XML_ENTITIES.keys() :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   549
            self.string.replace(char, XML_ENTITIES[char])
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   550
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   551
    def make_links(self):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   552
        raise NotImplementedError
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   553
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   554
    def convert_newlines(self) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   555
        self.string = ''.join([
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   556
            '<p>' + line + '</p>' for line in LINE_EXTRACTION_RE.findall(self.string)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   557
        ])
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   558
        
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   559
def _test():
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   560
    import doctest
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   561
    doctest.testmod()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   562
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   563
if __name__ == "__main__":
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   564
    _test()
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   565
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   566
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   567
# def cast_input_to_soup(fn):
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   568
#     """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   569
#     Decorate function to handle strings as BeautifulSoups transparently
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   570
#     """
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   571
#     def stringy_version(input, *args, **kwargs) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   572
#         if not isinstance(input,BeautifulSoup) :
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   573
#             input=BeautifulSoup(input)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   574
#         return fn(input, *args, **kwargs)
9698749e2375 Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff changeset
   575
#     return stringy_version