1 More text processing |
|
2 ==================== |
|
3 |
|
4 ``sort`` |
|
5 -------- |
|
6 Let's say we have a file which lists a few of the stalwarts of the open source community and a few details about them, like their "other" name, their homepage address, and what they are well known for or their claim to fame. |
|
7 |
|
8 :: |
|
9 |
|
10 Richard Stallman%rms%GNU Project |
|
11 Eric Raymond%ESR%Jargon File |
|
12 Ian Murdock% %Debian |
|
13 Lawrence Lessig% %Creative Commons |
|
14 Linus Torvalds% %Linux Kernel |
|
15 Guido van Rossum%BDFL%Python |
|
16 Larry Wall% %Perl |
|
17 |
|
18 |
|
19 The sort command enables us to do this in a flash! Just running the sort command with the file name as a parameter sorts the lines of the file alphabetically and prints the output on the terminal. |
|
20 :: |
|
21 |
|
22 $ sort stalwarts.txt |
|
23 Eric Raymond%ESR%Jargon File |
|
24 Guido van Rossum%BDFL%Python |
|
25 Ian Murdock% %Debian |
|
26 Larry Wall% %Perl |
|
27 Lawrence Lessig% %Creative Commons |
|
28 Linus Torvalds% %Linux Kernel |
|
29 Richard Stallman%rms%GNU Project |
|
30 |
|
31 If you wish to sort them reverse alphabetically, you just need to pass the ``-r`` option. Now, you might want to sort the lines, based on each person's claim to fame or their "other" name. What do we do in that case? |
|
32 |
|
33 Below is an example that sorts the file based on "other" names. |
|
34 :: |
|
35 |
|
36 $ sort -t % -k 2,2 stalwarts.txt |
|
37 |
|
38 Ian Murdock% %Debian |
|
39 Larry Wall% %Perl |
|
40 Lawrence Lessig% %Creative Commons |
|
41 Linus Torvalds% %Linux Kernel |
|
42 Guido van Rossum%BDFL%Python |
|
43 Eric Raymond%ESR%Jargon File |
|
44 Richard Stallman%rms%GNU Project |
|
45 |
|
46 Sort command assumes white space to be the default delimiter for columns in each line. The ``-t`` option specifies the delimiting character, which is ``%`` in this case. |
|
47 |
|
48 The ``-k`` option starts a key at position 2 and ends it at 2, essentially telling the sort command that it should sort based on the 2nd column, which is the other name. ``sort`` also supports conflict resolution using multiple columns for sorting. You can see that the first three lines have nothing in the "other" names column. We could resolve the conflict by sorting based on the project names (the 3rd column). |
|
49 |
|
50 :: |
|
51 |
|
52 $ sort -t % -k 2,2 -k 3,3 stalwarts.txt |
|
53 |
|
54 Lawrence Lessig% %Creative Commons |
|
55 Ian Murdock% %Debian |
|
56 Linus Torvalds% %Linux Kernel |
|
57 Larry Wall% %Perl |
|
58 Guido van Rossum%BDFL%Python |
|
59 Eric Raymond%ESR%Jargon File |
|
60 Richard Stallman%rms%GNU Project |
|
61 |
|
62 ``sort`` also has a lot of other options like ignoring case differences, month sort(JAN<FEB<...), merging already sorted files. ``man sort`` would give you a lot of information. |
|
63 |
|
64 |
|
65 ``uniq`` |
|
66 -------- |
|
67 |
|
68 Suppose we have a list of items, say books, and we wish to obtain a list which names of all the books only once, without any duplicates. We use the ``uniq`` command to achieve this. |
|
69 |
|
70 :: |
|
71 |
|
72 Programming Pearls |
|
73 The C Programming Language |
|
74 The Mythical Man Month: Essays on Software Engineering |
|
75 Programming Pearls |
|
76 The C Programming Language |
|
77 Structure and Interpretation of Computer Programs |
|
78 Programming Pearls |
|
79 Compilers: Principles, Techniques, and Tools |
|
80 The C Programming Language |
|
81 The Art of UNIX Programming |
|
82 Programming Pearls |
|
83 The Art of Computer Programming |
|
84 Introduction to Algorithms |
|
85 The Art of UNIX Programming |
|
86 The Pragmatic Programmer: From Journeyman to Master |
|
87 Programming Pearls |
|
88 Unix Power Tools |
|
89 The Art of UNIX Programming |
|
90 |
|
91 Let us try and get rid of the duplicate lines from this file using the ``uniq`` command. |
|
92 |
|
93 :: |
|
94 |
|
95 $ uniq items.txt |
|
96 Programming Pearls |
|
97 The C Programming Language |
|
98 The Mythical Man Month: Essays on Software Engineering |
|
99 Programming Pearls |
|
100 The C Programming Language |
|
101 Structure and Interpretation of Computer Programs |
|
102 Programming Pearls |
|
103 Compilers: Principles, Techniques, and Tools |
|
104 The C Programming Language |
|
105 The Art of UNIX Programming |
|
106 Programming Pearls |
|
107 The Art of Computer Programming |
|
108 Introduction to Algorithms |
|
109 The Art of UNIX Programming |
|
110 The Pragmatic Programmer: From Journeyman to Master |
|
111 Programming Pearls |
|
112 Unix Power Tools |
|
113 The Art of UNIX Programming |
|
114 |
|
115 Nothing happens! Why? The ``uniq`` command removes duplicate lines only when they are next to each other. So, we get a sorted file from the original file and work with that file, henceforth. |
|
116 |
|
117 :: |
|
118 |
|
119 $ sort items.txt > items-sorted.txt |
|
120 $ uniq items-sorted.txt |
|
121 Compilers: Principles, Techniques, and Tools |
|
122 Introduction to Algorithms |
|
123 Programming Pearls |
|
124 Structure and Interpretation of Computer Programs |
|
125 The Art of Computer Programming |
|
126 The Art of UNIX Programming |
|
127 The C Programming Language |
|
128 The Mythical Man Month: Essays on Software Engineering |
|
129 The Pragmatic Programmer: From Journeyman to Master |
|
130 Unix Power Tools |
|
131 |
|
132 ``uniq -u`` command gives the lines which are unique and do not have any duplicates in the file. ``uniq -d`` outputs only those lines which have duplicates. The ``-c`` option displays the number of times each line occurs in the file. |
|
133 :: |
|
134 |
|
135 $ uniq -u items-sorted.txt |
|
136 Compilers: Principles, Techniques, and Tools |
|
137 Introduction to Algorithms |
|
138 Structure and Interpretation of Computer Programs |
|
139 The Art of Computer Programming |
|
140 The Mythical Man Month: Essays on Software Engineering |
|
141 The Pragmatic Programmer: From Journeyman to Master |
|
142 Unix Power Tools |
|
143 |
|
144 $ uniq -dc items-sorted.txt |
|
145 5 Programming Pearls |
|
146 3 The Art of UNIX Programming |
|
147 3 The C Programming Language |
|
148 |
|
149 |
|
150 ``join`` |
|
151 -------- |
|
152 |
|
153 Now suppose we had the file ``stalwarts1.txt``, which lists the home pages of all the people listed in ``stalwarts.txt``. |
|
154 :: |
|
155 |
|
156 Richard Stallman%http://www.stallman.org |
|
157 Eric Raymond%http://www.catb.org/~esr/ |
|
158 Ian Murdock%http://ianmurdock.com/ |
|
159 Lawrence Lessig%http://lessig.org |
|
160 Linus Torvalds%http://torvalds-family.blogspot.com/ |
|
161 Guido van Rossum%http://www.python.org/~guido/ |
|
162 Larry Wall%http://www.wall.org/~larry/ |
|
163 |
|
164 It would be nice to have a single file with the information in both the files. To achieve this we use the ``join`` command. |
|
165 :: |
|
166 |
|
167 $ join stalwarts.txt stalwarts1.txt -t % |
|
168 Richard Stallman%rms%GNU Project%http://www.stallman.org |
|
169 Eric Raymond%ESR%Jargon File%http://www.catb.org/~esr/ |
|
170 Ian Murdock% %Debian%http://ianmurdock.com/ |
|
171 Lawrence Lessig% %Creative Commons%http://lessig.org |
|
172 Linus Torvalds% %Linux Kernel%http://torvalds-family.blogspot.com/ |
|
173 Guido van Rossum%BDFL%Python%http://www.python.org/~guido/ |
|
174 Larry Wall% %Perl%http://www.wall.org/~larry/ |
|
175 |
|
176 The ``join`` command joins the two files, based on the common field present in both the files, which is the name, in this case. |
|
177 |
|
178 The ``-t`` option again specifies the delimiting character. Unless that is specified, join assumes that the fields are separated by spaces. |
|
179 |
|
180 Note that, for ``join`` to work, the common field should be in the same order in both the files. If this is not so, you could use ``sort``, to sort the files on the common field and then join the files. In the above example, we have the common field to be the first column in both the files. If this is not the case we could use the ``-1`` and ``-2`` options to specify the field to be used for joining the files. |
|
181 :: |
|
182 |
|
183 $ join -2 2 stalwarts.txt stalwarts2.txt -t % |
|
184 Richard Stallman%rms%GNU Project%http://www.stallman.org |
|
185 Eric Raymond%ESR%Jargon File%http://www.catb.org/~esr/ |
|
186 Ian Murdock% %Debian%http://ianmurdock.com/ |
|
187 Lawrence Lessig% %Creative Commons%http://lessig.org |
|
188 Linus Torvalds% %Linux Kernel%http://torvalds-family.blogspot.com/ |
|
189 Guido van Rossum%BDFL%Python%http://www.python.org/~guido/ |
|
190 Larry Wall% %Perl%http://www.wall.org/~larry/ |
|
191 |
|
192 |
|
193 Generating a word frequency list |
|
194 ================================ |
|
195 |
|
196 Now, let us use the tools we have learnt to use, to generate a word frequency list of a text file. We shall use the free text of Alice in Wonderland. |
|
197 |
|
198 The basic steps to achieve this task would be - |
|
199 |
|
200 1. Eliminate the punctuation and spaces from the document. |
|
201 2. Generate a list of words. |
|
202 3. Count the words. |
|
203 |
|
204 We first use ``grep`` and some elementary ``regex`` to eliminate the non-alpha-characters. |
|
205 :: |
|
206 |
|
207 $ grep "[A-Za-z]*" alice-in-wonderland.txt |
|
208 |
|
209 This outputs all the lines which has any alphabetic characters on it. This isn't of much use, since we haven't done anything with the code. We only require the alphabetic characters, without any of the other junk. ``man grep`` shows us the ``-o`` option for outputting only the text which matches the regular expression. |
|
210 :: |
|
211 |
|
212 $ grep "[A-Za-z]*" -o alice-in-wonderland.txt |
|
213 |
|
214 Not very surprisingly, we have all the words, spit out in the form of a list! Now that we have a list of words, it is quite simple to count the occurrences of the words. You would've realized that we can make use of ``sort`` and ``uniq`` commands. We pipe the output from the ``grep`` to the ``sort`` and then pipe it's output to ``uniq``. |
|
215 :: |
|
216 |
|
217 $ grep "[A-Za-z]*" -o alice-in-wonderland.txt | sort | uniq -c |
|
218 |
|
219 Notice that you get the list of all words in the document in the alphabetical order, with it's frequency written next to it. But, you might have observed that Capitalized words and lower case words are being counted as different words. We therefore, replace all the Upper case characters with lower case ones, using the ``tr`` command. |
|
220 :: |
|
221 |
|
222 $ grep "[A-Za-z]*" -o alice-in-wonderland.txt | tr 'A-Z' 'a-z' | sort | uniq -c |
|
223 |
|
224 Now, it would also be nice to have the list ordered in the decreasing order of the frequency of the appearance of the words. We sort the output of the ``uniq`` command with ``-n`` and ``-r`` options, to get the desired output. |
|
225 :: |
|
226 |
|
227 $ grep "[A-Za-z]*" -o alice-in-wonderland.txt | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr |
|
228 |
|