author | Lennard de Rijk <ljvderijk@gmail.com> |
Thu, 04 Jun 2009 21:58:05 +0200 | |
changeset 2384 | 71780864a5ed |
parent 2324 | 9698749e2375 |
child 2555 | b7f14c803619 |
permissions | -rw-r--r-- |
2324
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
1 |
# -*- coding: UTF-8 -*- |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
2 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
3 |
some input filters, for regularising the html fragments from screen scraping and |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
4 |
browser-based editors into some semblance of sanity |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
5 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
6 |
TODO: turn the messy setting[method_name]=True filter syntax into a list of cleaning methods to invoke, so that they can be invoked in a specific order and multiple times. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
7 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
8 |
AUTHORS: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
9 |
Dan MacKinlay - https://launchpad.net/~dan-possumpalace |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
10 |
Collin Grady - http://launchpad.net/~collin-collingrady |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
11 |
Andreas Gustafsson - https://bugs.launchpad.net/~gson |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
12 |
HÃ¥kan W - https://launchpad.net/~hwaara-gmail |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
13 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
14 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
15 |
import BeautifulSoup |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
16 |
import re |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
17 |
import sys |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
18 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
19 |
# Python 2.4 compatibility |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
20 |
try: any |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
21 |
except NameError: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
22 |
def any(iterable): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
23 |
for element in iterable: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
24 |
if element: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
25 |
return True |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
26 |
return False |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
27 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
28 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
29 |
html5lib compatibility. Basically, we need to know that this still works whether html5lib |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
30 |
is imported or not. Should run complete suites of tests for both possible configs - |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
31 |
or test in virtual environments, but for now a basic sanity check will do. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
32 |
>>> if html5: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
33 |
>>> c=Cleaner(html5=False) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
34 |
>>> c(u'<p>foo</p>) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
35 |
u'<p>foo</p>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
36 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
37 |
try: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
38 |
import html5lib |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
39 |
from html5lib import sanitizer, treebuilders |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
40 |
parser = html5lib.HTMLParser( |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
41 |
tree=treebuilders.getTreeBuilder("beautifulsoup"), |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
42 |
tokenizer=sanitizer.HTMLSanitizer |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
43 |
) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
44 |
html5 = True |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
45 |
except ImportError: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
46 |
html5 = False |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
47 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
48 |
ANTI_JS_RE=re.compile('j\s*a\s*v\s*a\s*s\s*c\s*r\s*i\s*p\s*t\s*:', re.IGNORECASE) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
49 |
#These tags and attrs are sufficently liberal to let microformats through... |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
50 |
#it ruthlessly culls all the rdf, dublin core metadata and so on. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
51 |
valid_tags = dict.fromkeys('p i em strong b u a h1 h2 h3 pre abbr br img dd dt ol ul li span sub sup ins del blockquote table tr td th address cite'.split()) #div? |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
52 |
valid_attrs = dict.fromkeys('href src rel title'.split()) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
53 |
valid_schemes = dict.fromkeys('http https'.split()) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
54 |
elem_map = {'b' : 'strong', 'i': 'em'} |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
55 |
attrs_considered_links = dict.fromkeys("src href".split()) #should include |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
56 |
#courtesy http://developer.mozilla.org/en/docs/HTML:Block-level_elements |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
57 |
block_elements = dict.fromkeys(["p", "h1","h2", "h3", "h4", "h5", "h6", "ol", "ul", "pre", "address", "blockquote", "dl", "div", "fieldset", "form", "hr", "noscript", "table"]) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
58 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
59 |
#convenient default filter lists. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
60 |
paranoid_filters = ["strip_comments", "strip_tags", "strip_attrs", |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
61 |
"strip_schemes", "rename_tags", "wrap_string", "strip_empty_tags", "strip_empty_tags", ] |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
62 |
complete_filters = ["strip_comments", "rename_tags", "strip_tags", "strip_attrs", |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
63 |
"strip_cdata", "strip_schemes", "wrap_string", "strip_empty_tags", "rebase_links", "reparse"] |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
64 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
65 |
#set some conservative default string processings |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
66 |
default_settings = { |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
67 |
"filters" : paranoid_filters, |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
68 |
"block_elements" : block_elements, #xml or None for a more liberal version |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
69 |
"convert_entities" : "html", #xml or None for a more liberal version |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
70 |
"valid_tags" : valid_tags, |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
71 |
"valid_attrs" : valid_attrs, |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
72 |
"valid_schemes" : valid_schemes, |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
73 |
"attrs_considered_links" : attrs_considered_links, |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
74 |
"elem_map" : elem_map, |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
75 |
"wrapping_element" : "p", |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
76 |
"auto_clean" : False, |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
77 |
"original_url" : "", |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
78 |
"new_url" : "", |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
79 |
"html5" : html5 |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
80 |
} |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
81 |
#processes I'd like but haven't implemented |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
82 |
#"encode_xml_specials", "ensure complete xhtml doc", "ensure_xhtml_fragment_only" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
83 |
# and some handling of permitted namespaces for tags. for RDF, say. maybe. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
84 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
85 |
XML_ENTITIES = { u"'" : u"'", |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
86 |
u'"' : u""", |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
87 |
u"&" : u"&", |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
88 |
u"<" : u"<", |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
89 |
u">" : u">" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
90 |
} |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
91 |
LINE_EXTRACTION_RE = re.compile(".+", re.MULTILINE) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
92 |
BR_EXTRACTION_RE = re.compile("</?br ?/?>", re.MULTILINE) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
93 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
94 |
class Stop: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
95 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
96 |
handy class that we use as a stop input for our state machine in lieu of falling |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
97 |
off the end of lists |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
98 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
99 |
pass |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
100 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
101 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
102 |
class Cleaner(object): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
103 |
r""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
104 |
powerful and slow arbitrary HTML sanitisation. can deal (i hope) with most XSS |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
105 |
vectors and layout-breaking badness. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
106 |
Probably overkill for content from trusted sources; defaults are accordingly |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
107 |
set to be paranoid. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
108 |
>>> bad_html = '<p style="forbidden markup"><!-- XSS attach -->content</p' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
109 |
>>> good_html = u'<p>content</p>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
110 |
>>> c = Cleaner() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
111 |
>>> c.string = bad_html |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
112 |
>>> c.clean() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
113 |
>>> c.string == good_html |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
114 |
True |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
115 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
116 |
Also supports shorthand syntax: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
117 |
>>> c = Cleaner() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
118 |
>>> c(bad_html) == c(good_html) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
119 |
True |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
120 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
121 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
122 |
def __init__(self, string_or_soup="", *args, **kwargs): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
123 |
self.settings=default_settings.copy() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
124 |
self.settings.update(kwargs) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
125 |
if args : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
126 |
self.settings['filters'] = args |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
127 |
super(Cleaner, self).__init__(string_or_soup, *args, **kwargs) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
128 |
self.string = string_or_soup |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
129 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
130 |
def __call__(self, string = None, **kwargs): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
131 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
132 |
convenience method allowing one-step calling of an instance and returning |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
133 |
a cleaned string. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
134 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
135 |
TODO: make this method preserve internal state- perhaps by creating a new |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
136 |
instance. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
137 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
138 |
>>> s = 'input string' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
139 |
>>> c1 = Cleaner(s, auto_clean=True) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
140 |
>>> c2 = Cleaner("") |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
141 |
>>> c1.string == c2(s) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
142 |
True |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
143 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
144 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
145 |
self.settings.update(kwargs) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
146 |
if not string == None : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
147 |
self.string = string |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
148 |
self.clean() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
149 |
return self.string |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
150 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
151 |
def _set_contents(self, string_or_soup): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
152 |
if isinstance(string_or_soup, BeautifulSoup.BeautifulSoup) : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
153 |
self._set_soup(string_or_soup) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
154 |
else : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
155 |
self._set_string(string_or_soup) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
156 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
157 |
def _set_string(self, html_fragment_string): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
158 |
if self.settings['html5']: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
159 |
s = parser.parse(html_fragment_string).body |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
160 |
else: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
161 |
s = BeautifulSoup.BeautifulSoup( |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
162 |
html_fragment_string, |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
163 |
convertEntities=self.settings['convert_entities']) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
164 |
self._set_soup(s) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
165 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
166 |
def _set_soup(self, soup): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
167 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
168 |
Does all the work of set_string, but bypasses a potential autoclean to avoid |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
169 |
loops upon internal string setting ops. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
170 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
171 |
self._soup = BeautifulSoup.BeautifulSoup( |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
172 |
'<rootrootroot></rootrootroot>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
173 |
) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
174 |
self.root=self._soup.contents[0] |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
175 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
176 |
if len(soup.contents) : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
177 |
backwards_soup = [i for i in soup.contents] |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
178 |
backwards_soup.reverse() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
179 |
else : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
180 |
backwards_soup = [] |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
181 |
for i in backwards_soup : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
182 |
i.extract() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
183 |
self.root.insert(0, i) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
184 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
185 |
def set_string(self, string) : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
186 |
ur""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
187 |
sets the string to process and does the necessary input encoding too |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
188 |
really intended to be invoked as a property. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
189 |
note the godawful rootrootroot element which we need because the |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
190 |
BeautifulSoup object has all the same methods as a Tag, but |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
191 |
behaves differently, silently failing on some inserts and appends |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
192 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
193 |
>>> c = Cleaner(convert_entities="html") |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
194 |
>>> c.string = 'é' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
195 |
>>> c.string |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
196 |
u'\xe9' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
197 |
>>> c = Cleaner(convert_entities="xml") |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
198 |
>>> c.string = u'é' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
199 |
>>> c.string |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
200 |
u'é' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
201 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
202 |
self._set_string(string) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
203 |
if len(string) and self.settings['auto_clean'] : self.clean() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
204 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
205 |
def get_string(self): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
206 |
return unicode(self.root.renderContents()) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
207 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
208 |
string = property(get_string, set_string) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
209 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
210 |
def clean(self): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
211 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
212 |
invoke all cleaning processes stipulated in the settings |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
213 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
214 |
for method in self.settings['filters'] : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
215 |
try : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
216 |
getattr(self, method)() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
217 |
except NotImplementedError : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
218 |
sys.stderr.write('Warning, called unimplemented method %s' % method + '\n') |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
219 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
220 |
def strip_comments(self): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
221 |
r""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
222 |
XHTML comments are used as an XSS attack vector. they must die. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
223 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
224 |
>>> c = Cleaner("", "strip_comments") |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
225 |
>>> c('<p>text<!-- comment --> More text</p>') |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
226 |
u'<p>text More text</p>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
227 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
228 |
for comment in self.root.findAll( |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
229 |
text = lambda text: isinstance(text, BeautifulSoup.Comment)): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
230 |
comment.extract() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
231 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
232 |
def strip_cdata(self): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
233 |
for cdata in self.root.findAll( |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
234 |
text = lambda text: isinstance(text, BeautifulSoup.CData)): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
235 |
cdata.extract() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
236 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
237 |
def strip_tags(self): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
238 |
r""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
239 |
ill-considered tags break our layout. they must die. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
240 |
>>> c = Cleaner("", "strip_tags", auto_clean=True) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
241 |
>>> c.string = '<div>A <strong>B C</strong></div>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
242 |
>>> c.string |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
243 |
u'A <strong>B C</strong>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
244 |
>>> c.string = '<div>A <div>B C</div></div>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
245 |
>>> c.string |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
246 |
u'A B C' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
247 |
>>> c.string = '<div>A <br /><div>B C</div></div>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
248 |
>>> c.string |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
249 |
u'A <br />B C' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
250 |
>>> c.string = '<p>A <div>B C</div></p>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
251 |
>>> c.string |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
252 |
u'<p>A B C</p>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
253 |
>>> c.string = 'A<div>B<div>C<div>D</div>E</div>F</div>G' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
254 |
>>> c.string |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
255 |
u'ABCDEFG' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
256 |
>>> c.string = '<div>B<div>C<div>D</div>E</div>F</div>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
257 |
>>> c.string |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
258 |
u'BCDEF' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
259 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
260 |
# Beautiful Soup doesn't support dynamic .findAll results when the tree is |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
261 |
# modified in place. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
262 |
# going backwards doesn't seem to help. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
263 |
# so find one at a time |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
264 |
while True : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
265 |
next_bad_tag = self.root.find( |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
266 |
lambda tag : not tag.name in (self.settings['valid_tags']) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
267 |
) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
268 |
if next_bad_tag : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
269 |
self.disgorge_elem(next_bad_tag) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
270 |
else: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
271 |
break |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
272 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
273 |
def strip_attrs(self): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
274 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
275 |
preserve only those attributes we need in the soup |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
276 |
>>> c = Cleaner("", "strip_attrs") |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
277 |
>>> c('<div title="v" bad="v">A <strong title="v" bad="v">B C</strong></div>') |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
278 |
u'<div title="v">A <strong title="v">B C</strong></div>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
279 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
280 |
for tag in self.root.findAll(True): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
281 |
tag.attrs = [(attr, val) for attr, val in tag.attrs |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
282 |
if attr in self.settings['valid_attrs']] |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
283 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
284 |
def _all_links(self): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
285 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
286 |
finds all tags with link attributes sequentially. safe against modification |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
287 |
of said attributes in-place. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
288 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
289 |
start = self.root |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
290 |
while True: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
291 |
tag = start.findNext( |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
292 |
lambda tag : any( |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
293 |
[(tag.get(i) for i in self.settings['attrs_considered_links'])] |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
294 |
)) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
295 |
if tag: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
296 |
start = tag |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
297 |
yield tag |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
298 |
else : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
299 |
break |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
300 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
301 |
def strip_schemes(self): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
302 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
303 |
>>> c = Cleaner("", "strip_schemes") |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
304 |
>>> c('<img src="javascript:alert();" />') |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
305 |
u'<img />' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
306 |
>>> c('<a href="javascript:alert();">foo</a>') |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
307 |
u'<a>foo</a>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
308 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
309 |
for tag in self._all_links() : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
310 |
for key in self.settings['attrs_considered_links'] : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
311 |
scheme_bits = tag.get(key, u"").split(u':',1) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
312 |
if len(scheme_bits) == 1 : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
313 |
pass #relative link |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
314 |
else: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
315 |
if not scheme_bits[0] in self.settings['valid_schemes'] : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
316 |
del(tag[key]) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
317 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
318 |
def br_to_p(self): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
319 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
320 |
>>> c = Cleaner("", "br_to_p") |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
321 |
>>> c('<p>A<br />B</p>') |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
322 |
u'<p>A</p><p>B</p>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
323 |
>>> c('A<br />B') |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
324 |
u'<p>A</p><p>B</p>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
325 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
326 |
block_elems = self.settings['block_elements'] |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
327 |
block_elems['br'] = None |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
328 |
block_elems['p'] = None |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
329 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
330 |
while True : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
331 |
next_br = self.root.find('br') |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
332 |
if not next_br: break |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
333 |
parent = next_br.parent |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
334 |
self.wrap_string('p', start_at=parent, block_elems = block_elems) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
335 |
while True: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
336 |
useless_br=parent.find('br', recursive=False) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
337 |
if not useless_br: break |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
338 |
useless_br.extract() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
339 |
if parent.name == 'p': |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
340 |
self.disgorge_elem(parent) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
341 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
342 |
def rename_tags(self): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
343 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
344 |
>>> c = Cleaner("", "rename_tags", elem_map={'i': 'em'}) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
345 |
>>> c('<b>A<i>B</i></b>') |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
346 |
u'<b>A<em>B</em></b>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
347 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
348 |
for tag in self.root.findAll(self.settings['elem_map']) : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
349 |
tag.name = self.settings['elem_map'][tag.name] |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
350 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
351 |
def wrap_string(self, wrapping_element = None, start_at=None, block_elems=None): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
352 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
353 |
takes an html fragment, which may or may not have a single containing element, |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
354 |
and guarantees what the tag name of the topmost elements are. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
355 |
TODO: is there some simpler way than a state machine to do this simple thing? |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
356 |
>>> c = Cleaner("", "wrap_string") |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
357 |
>>> c('A <strong>B C</strong>D') |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
358 |
u'<p>A <strong>B C</strong>D</p>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
359 |
>>> c('A <p>B C</p>D') |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
360 |
u'<p>A </p><p>B C</p><p>D</p>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
361 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
362 |
if not start_at : start_at = self.root |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
363 |
if not block_elems : block_elems = self.settings['block_elements'] |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
364 |
e = (wrapping_element or self.settings['wrapping_element']) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
365 |
paragraph_list = [] |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
366 |
children = [elem for elem in start_at.contents] |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
367 |
children.append(Stop()) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
368 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
369 |
last_state = 'block' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
370 |
paragraph = BeautifulSoup.Tag(self._soup, e) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
371 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
372 |
for node in children : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
373 |
if isinstance(node, Stop) : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
374 |
state = 'end' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
375 |
elif hasattr(node, 'name') and node.name in block_elems: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
376 |
state = 'block' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
377 |
else: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
378 |
state = 'inline' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
379 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
380 |
if last_state == 'block' and state == 'inline': |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
381 |
#collate inline elements |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
382 |
paragraph = BeautifulSoup.Tag(self._soup, e) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
383 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
384 |
if state == 'inline' : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
385 |
paragraph.append(node) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
386 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
387 |
if ((state <> 'inline') and last_state == 'inline') : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
388 |
paragraph_list.append(paragraph) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
389 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
390 |
if state == 'block' : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
391 |
paragraph_list.append(node) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
392 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
393 |
last_state = state |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
394 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
395 |
#can't use append since it doesn't work on empty elements... |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
396 |
paragraph_list.reverse() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
397 |
for paragraph in paragraph_list: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
398 |
start_at.insert(0, paragraph) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
399 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
400 |
def strip_empty_tags(self): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
401 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
402 |
strip out all empty tags |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
403 |
TODO: depth-first search |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
404 |
>>> c = Cleaner("", "strip_empty_tags") |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
405 |
>>> c('<p>A</p><p></p><p>B</p><p></p>') |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
406 |
u'<p>A</p><p>B</p>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
407 |
>>> c('<p><a></a></p>') |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
408 |
u'<p></p>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
409 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
410 |
tag = self.root |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
411 |
while True: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
412 |
next_tag = tag.findNext(True) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
413 |
if not next_tag: break |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
414 |
if next_tag.contents or next_tag.attrs: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
415 |
tag = next_tag |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
416 |
continue |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
417 |
next_tag.extract() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
418 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
419 |
def rebase_links(self, original_url="", new_url ="") : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
420 |
if not original_url : original_url = self.settings.get('original_url', '') |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
421 |
if not new_url : new_url = self.settings.get('new_url', '') |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
422 |
raise NotImplementedError |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
423 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
424 |
# Because of its internal character set handling, |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
425 |
# the following will not work in Beautiful soup and is hopefully redundant. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
426 |
# def encode_xml_specials(self, original_url="", new_url ="") : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
427 |
# """ |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
428 |
# BeautifulSoup will let some dangerous xml entities hang around |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
429 |
# in the navigable strings. destroy all monsters. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
430 |
# >>> c = Cleaner(auto_clean=True, encode_xml_specials=True) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
431 |
# >>> c('<<<<<') |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
432 |
# u'<<<<' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
433 |
# """ |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
434 |
# for string in self.root.findAll(text=True) : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
435 |
# sys.stderr.write("root" +"\n") |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
436 |
# sys.stderr.write(str(self.root) +"\n") |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
437 |
# sys.stderr.write("parent" +"\n") |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
438 |
# sys.stderr.write(str(string.parent) +"\n") |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
439 |
# new_string = unicode(string) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
440 |
# sys.stderr.write(string +"\n") |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
441 |
# for special_char in XML_ENTITIES.keys() : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
442 |
# sys.stderr.write(special_char +"\n") |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
443 |
# string.replaceWith( |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
444 |
# new_string.replace(special_char, XML_ENTITIES[special_char]) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
445 |
# ) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
446 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
447 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
448 |
def disgorge_elem(self, elem): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
449 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
450 |
remove the given element from the soup and replaces it with its own contents |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
451 |
actually tricky, since you can't replace an element with an list of elements |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
452 |
using replaceWith |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
453 |
>>> disgorgeable_string = '<body>A <em>B</em> C</body>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
454 |
>>> c = Cleaner() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
455 |
>>> c.string = disgorgeable_string |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
456 |
>>> elem = c._soup.find('em') |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
457 |
>>> c.disgorge_elem(elem) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
458 |
>>> c.string |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
459 |
u'<body>A B C</body>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
460 |
>>> c.string = disgorgeable_string |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
461 |
>>> elem = c._soup.find('body') |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
462 |
>>> c.disgorge_elem(elem) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
463 |
>>> c.string |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
464 |
u'A <em>B</em> C' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
465 |
>>> c.string = '<div>A <div id="inner">B C</div></div>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
466 |
>>> elem = c._soup.find(id="inner") |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
467 |
>>> c.disgorge_elem(elem) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
468 |
>>> c.string |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
469 |
u'<div>A B C</div>' |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
470 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
471 |
if elem == self.root : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
472 |
raise AttributeError, "Can't disgorge root" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
473 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
474 |
# With in-place modification, BeautifulSoup occasionally can return |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
475 |
# elements that think they are orphans |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
476 |
# this lib is full of workarounds, but it's worth checking |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
477 |
parent = elem.parent |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
478 |
if parent == None: |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
479 |
raise AttributeError, "AAAAAAAAGH! NO PARENTS! DEATH!" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
480 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
481 |
i = None |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
482 |
for i in range(len(parent.contents)) : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
483 |
if parent.contents[i] == elem : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
484 |
index = i |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
485 |
break |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
486 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
487 |
elem.extract() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
488 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
489 |
#the proceeding method breaks horribly, sporadically. |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
490 |
# for i in range(len(elem.contents)) : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
491 |
# elem.contents[i].extract() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
492 |
# parent.contents.insert(index+i, elem.contents[i]) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
493 |
# return |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
494 |
self._safe_inject(parent, index, elem.contents) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
495 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
496 |
def _safe_inject(self, dest, dest_index, node_list): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
497 |
#BeautifulSoup result sets look like lists but don't behave right |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
498 |
# i.e. empty ones are still True, |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
499 |
if not len(node_list) : return |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
500 |
node_list = [i for i in node_list] |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
501 |
node_list.reverse() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
502 |
for i in node_list : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
503 |
dest.insert(dest_index, i) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
504 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
505 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
506 |
class Htmlator(object) : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
507 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
508 |
converts a string into a series of html paragraphs |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
509 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
510 |
settings = { |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
511 |
"encode_xml_specials" : True, |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
512 |
"is_plaintext" : True, |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
513 |
"convert_newlines" : False, |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
514 |
"make_links" : True, |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
515 |
"auto_convert" : False, |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
516 |
"valid_schemes" : valid_schemes, |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
517 |
} |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
518 |
def __init__(self, string = "", **kwargs): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
519 |
self.settings.update(kwargs) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
520 |
super(Htmlator, self).__init__(string, **kwargs) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
521 |
self.string = string |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
522 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
523 |
def _set_string(self, string): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
524 |
self.string = string |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
525 |
if self.settings['auto_convert'] : self.convert() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
526 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
527 |
def _get_string(self): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
528 |
return unicode(self._soup) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
529 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
530 |
string = property(_get_string, _set_string) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
531 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
532 |
def __call__(self, string): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
533 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
534 |
convenience method supporting one-step calling of an instance |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
535 |
as a string cleaning function |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
536 |
""" |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
537 |
self.string = string |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
538 |
self.convert() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
539 |
return self.string |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
540 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
541 |
def convert(self): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
542 |
for method in ["encode_xml_specials", "convert_newlines", |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
543 |
"make_links"] : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
544 |
if self.settings(method) : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
545 |
getattr(self, method)() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
546 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
547 |
def encode_xml_specials(self) : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
548 |
for char in XML_ENTITIES.keys() : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
549 |
self.string.replace(char, XML_ENTITIES[char]) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
550 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
551 |
def make_links(self): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
552 |
raise NotImplementedError |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
553 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
554 |
def convert_newlines(self) : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
555 |
self.string = ''.join([ |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
556 |
'<p>' + line + '</p>' for line in LINE_EXTRACTION_RE.findall(self.string) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
557 |
]) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
558 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
559 |
def _test(): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
560 |
import doctest |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
561 |
doctest.testmod() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
562 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
563 |
if __name__ == "__main__": |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
564 |
_test() |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
565 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
566 |
|
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
567 |
# def cast_input_to_soup(fn): |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
568 |
# """ |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
569 |
# Decorate function to handle strings as BeautifulSoups transparently |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
570 |
# """ |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
571 |
# def stringy_version(input, *args, **kwargs) : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
572 |
# if not isinstance(input,BeautifulSoup) : |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
573 |
# input=BeautifulSoup(input) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
574 |
# return fn(input, *args, **kwargs) |
9698749e2375
Add HtmlSanitizer python module to Melange repository.
Pawel Solyga <Pawel.Solyga@gmail.com>
parents:
diff
changeset
|
575 |
# return stringy_version |