| author | Todd Larsen <tlarsen@google.com> |
| Thu, 28 Aug 2008 21:35:34 +0000 | |
| changeset 114 | 9998e95ce609 |
| parent 54 | 03e267d67478 |
| permissions | -rw-r--r-- |
|
54
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
1 |
# Performance note: I benchmarked this code using a set instead of |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
2 |
# a list for the stopwords and was surprised to find that the list |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
3 |
# performed /better/ than the set - maybe because it's only a small |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
4 |
# list. |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
5 |
|
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
6 |
stopwords = ''' |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
7 |
i |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
8 |
a |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
9 |
an |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
10 |
are |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
11 |
as |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
12 |
at |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
13 |
be |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
14 |
by |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
15 |
for |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
16 |
from |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
17 |
how |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
18 |
in |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
19 |
is |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
20 |
it |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
21 |
of |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
22 |
on |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
23 |
or |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
24 |
that |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
25 |
the |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
26 |
this |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
27 |
to |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
28 |
was |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
29 |
what |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
30 |
when |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
31 |
where |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
32 |
'''.split() |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
33 |
|
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
34 |
def strip_stopwords(sentence): |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
35 |
"Removes stopwords - also normalizes whitespace" |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
36 |
words = sentence.split() |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
37 |
sentence = [] |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
38 |
for word in words: |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
39 |
if word.lower() not in stopwords: |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
40 |
sentence.append(word) |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
41 |
return u' '.join(sentence) |
|
03e267d67478
Major reorganization of the soc svn repo, to merge into a single App Engine
Todd Larsen <tlarsen@google.com>
parents:
diff
changeset
|
42 |