author | Lennard de Rijk <ljvderijk@gmail.com> |
Mon, 16 Mar 2009 18:00:39 +0000 | |
changeset 1892 | 51cdacd67ef1 |
parent 54 | 03e267d67478 |
permissions | -rw-r--r-- |
54
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
1 |
# Performance note: I benchmarked this code using a set instead of |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
2 |
# a list for the stopwords and was surprised to find that the list |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
3 |
# performed /better/ than the set - maybe because it's only a small |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
4 |
# list. |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
5 |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
6 |
stopwords = ''' |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
7 |
i |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
8 |
a |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
9 |
an |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
10 |
are |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
11 |
as |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
12 |
at |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
13 |
be |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
14 |
by |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
15 |
for |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
16 |
from |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
17 |
how |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
18 |
in |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
19 |
is |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
20 |
it |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
21 |
of |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
22 |
on |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
23 |
or |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
24 |
that |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
25 |
the |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
26 |
this |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
27 |
to |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
28 |
was |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
29 |
what |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
30 |
when |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
31 |
where |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
32 |
'''.split() |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
33 |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
34 |
def strip_stopwords(sentence): |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
35 |
"Removes stopwords - also normalizes whitespace" |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
36 |
words = sentence.split() |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
37 |
sentence = [] |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
38 |
for word in words: |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
39 |
if word.lower() not in stopwords: |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
40 |
sentence.append(word) |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
41 |
return u' '.join(sentence) |
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
42 |