author | amit@thunder |
Mon, 12 Jul 2010 15:39:29 +0530 | |
changeset 100 | 344a1d6f1e64 |
parent 99 | 799f1c2a0689 |
permissions | -rw-r--r-- |
57
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
1 |
More text processing |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
2 |
==================== |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
3 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
4 |
``sort`` |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
5 |
-------- |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
6 |
Let's say we have a file which lists a few of the stalwarts of the open source community and a few details about them, like their "other" name, their homepage address, and what they are well known for or their claim to fame. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
7 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
8 |
:: |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
9 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
10 |
Richard Stallman%rms%GNU Project |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
11 |
Eric Raymond%ESR%Jargon File |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
12 |
Ian Murdock% %Debian |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
13 |
Lawrence Lessig% %Creative Commons |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
14 |
Linus Torvalds% %Linux Kernel |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
15 |
Guido van Rossum%BDFL%Python |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
16 |
Larry Wall% %Perl |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
17 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
18 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
19 |
The sort command enables us to do this in a flash! Just running the sort command with the file name as a parameter sorts the lines of the file alphabetically and prints the output on the terminal. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
20 |
:: |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
21 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
22 |
$ sort stalwarts.txt |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
23 |
Eric Raymond%ESR%Jargon File |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
24 |
Guido van Rossum%BDFL%Python |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
25 |
Ian Murdock% %Debian |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
26 |
Larry Wall% %Perl |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
27 |
Lawrence Lessig% %Creative Commons |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
28 |
Linus Torvalds% %Linux Kernel |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
29 |
Richard Stallman%rms%GNU Project |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
30 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
31 |
If you wish to sort them reverse alphabetically, you just need to pass the ``-r`` option. Now, you might want to sort the lines, based on each person's claim to fame or their "other" name. What do we do in that case? |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
32 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
33 |
Below is an example that sorts the file based on "other" names. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
34 |
:: |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
35 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
36 |
$ sort -t % -k 2,2 stalwarts.txt |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
37 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
38 |
Ian Murdock% %Debian |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
39 |
Larry Wall% %Perl |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
40 |
Lawrence Lessig% %Creative Commons |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
41 |
Linus Torvalds% %Linux Kernel |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
42 |
Guido van Rossum%BDFL%Python |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
43 |
Eric Raymond%ESR%Jargon File |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
44 |
Richard Stallman%rms%GNU Project |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
45 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
46 |
Sort command assumes white space to be the default delimiter for columns in each line. The ``-t`` option specifies the delimiting character, which is ``%`` in this case. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
47 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
48 |
The ``-k`` option starts a key at position 2 and ends it at 2, essentially telling the sort command that it should sort based on the 2nd column, which is the other name. ``sort`` also supports conflict resolution using multiple columns for sorting. You can see that the first three lines have nothing in the "other" names column. We could resolve the conflict by sorting based on the project names (the 3rd column). |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
49 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
50 |
:: |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
51 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
52 |
$ sort -t % -k 2,2 -k 3,3 stalwarts.txt |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
53 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
54 |
Lawrence Lessig% %Creative Commons |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
55 |
Ian Murdock% %Debian |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
56 |
Linus Torvalds% %Linux Kernel |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
57 |
Larry Wall% %Perl |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
58 |
Guido van Rossum%BDFL%Python |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
59 |
Eric Raymond%ESR%Jargon File |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
60 |
Richard Stallman%rms%GNU Project |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
61 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
62 |
``sort`` also has a lot of other options like ignoring case differences, month sort(JAN<FEB<...), merging already sorted files. ``man sort`` would give you a lot of information. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
63 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
64 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
65 |
``uniq`` |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
66 |
-------- |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
67 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
68 |
Suppose we have a list of items, say books, and we wish to obtain a list which names of all the books only once, without any duplicates. We use the ``uniq`` command to achieve this. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
69 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
70 |
:: |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
71 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
72 |
Programming Pearls |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
73 |
The C Programming Language |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
74 |
The Mythical Man Month: Essays on Software Engineering |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
75 |
Programming Pearls |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
76 |
The C Programming Language |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
77 |
Structure and Interpretation of Computer Programs |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
78 |
Programming Pearls |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
79 |
Compilers: Principles, Techniques, and Tools |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
80 |
The C Programming Language |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
81 |
The Art of UNIX Programming |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
82 |
Programming Pearls |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
83 |
The Art of Computer Programming |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
84 |
Introduction to Algorithms |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
85 |
The Art of UNIX Programming |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
86 |
The Pragmatic Programmer: From Journeyman to Master |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
87 |
Programming Pearls |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
88 |
Unix Power Tools |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
89 |
The Art of UNIX Programming |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
90 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
91 |
Let us try and get rid of the duplicate lines from this file using the ``uniq`` command. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
92 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
93 |
:: |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
94 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
95 |
$ uniq items.txt |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
96 |
Programming Pearls |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
97 |
The C Programming Language |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
98 |
The Mythical Man Month: Essays on Software Engineering |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
99 |
Programming Pearls |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
100 |
The C Programming Language |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
101 |
Structure and Interpretation of Computer Programs |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
102 |
Programming Pearls |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
103 |
Compilers: Principles, Techniques, and Tools |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
104 |
The C Programming Language |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
105 |
The Art of UNIX Programming |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
106 |
Programming Pearls |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
107 |
The Art of Computer Programming |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
108 |
Introduction to Algorithms |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
109 |
The Art of UNIX Programming |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
110 |
The Pragmatic Programmer: From Journeyman to Master |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
111 |
Programming Pearls |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
112 |
Unix Power Tools |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
113 |
The Art of UNIX Programming |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
114 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
115 |
Nothing happens! Why? The ``uniq`` command removes duplicate lines only when they are next to each other. So, we get a sorted file from the original file and work with that file, henceforth. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
116 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
117 |
:: |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
118 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
119 |
$ sort items.txt > items-sorted.txt |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
120 |
$ uniq items-sorted.txt |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
121 |
Compilers: Principles, Techniques, and Tools |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
122 |
Introduction to Algorithms |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
123 |
Programming Pearls |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
124 |
Structure and Interpretation of Computer Programs |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
125 |
The Art of Computer Programming |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
126 |
The Art of UNIX Programming |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
127 |
The C Programming Language |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
128 |
The Mythical Man Month: Essays on Software Engineering |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
129 |
The Pragmatic Programmer: From Journeyman to Master |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
130 |
Unix Power Tools |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
131 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
132 |
``uniq -u`` command gives the lines which are unique and do not have any duplicates in the file. ``uniq -d`` outputs only those lines which have duplicates. The ``-c`` option displays the number of times each line occurs in the file. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
133 |
:: |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
134 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
135 |
$ uniq -u items-sorted.txt |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
136 |
Compilers: Principles, Techniques, and Tools |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
137 |
Introduction to Algorithms |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
138 |
Structure and Interpretation of Computer Programs |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
139 |
The Art of Computer Programming |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
140 |
The Mythical Man Month: Essays on Software Engineering |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
141 |
The Pragmatic Programmer: From Journeyman to Master |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
142 |
Unix Power Tools |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
143 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
144 |
$ uniq -dc items-sorted.txt |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
145 |
5 Programming Pearls |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
146 |
3 The Art of UNIX Programming |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
147 |
3 The C Programming Language |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
148 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
149 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
150 |
``join`` |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
151 |
-------- |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
152 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
153 |
Now suppose we had the file ``stalwarts1.txt``, which lists the home pages of all the people listed in ``stalwarts.txt``. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
154 |
:: |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
155 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
156 |
Richard Stallman%http://www.stallman.org |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
157 |
Eric Raymond%http://www.catb.org/~esr/ |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
158 |
Ian Murdock%http://ianmurdock.com/ |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
159 |
Lawrence Lessig%http://lessig.org |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
160 |
Linus Torvalds%http://torvalds-family.blogspot.com/ |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
161 |
Guido van Rossum%http://www.python.org/~guido/ |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
162 |
Larry Wall%http://www.wall.org/~larry/ |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
163 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
164 |
It would be nice to have a single file with the information in both the files. To achieve this we use the ``join`` command. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
165 |
:: |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
166 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
167 |
$ join stalwarts.txt stalwarts1.txt -t % |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
168 |
Richard Stallman%rms%GNU Project%http://www.stallman.org |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
169 |
Eric Raymond%ESR%Jargon File%http://www.catb.org/~esr/ |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
170 |
Ian Murdock% %Debian%http://ianmurdock.com/ |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
171 |
Lawrence Lessig% %Creative Commons%http://lessig.org |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
172 |
Linus Torvalds% %Linux Kernel%http://torvalds-family.blogspot.com/ |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
173 |
Guido van Rossum%BDFL%Python%http://www.python.org/~guido/ |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
174 |
Larry Wall% %Perl%http://www.wall.org/~larry/ |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
175 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
176 |
The ``join`` command joins the two files, based on the common field present in both the files, which is the name, in this case. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
177 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
178 |
The ``-t`` option again specifies the delimiting character. Unless that is specified, join assumes that the fields are separated by spaces. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
179 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
180 |
Note that, for ``join`` to work, the common field should be in the same order in both the files. If this is not so, you could use ``sort``, to sort the files on the common field and then join the files. In the above example, we have the common field to be the first column in both the files. If this is not the case we could use the ``-1`` and ``-2`` options to specify the field to be used for joining the files. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
181 |
:: |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
182 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
183 |
$ join -2 2 stalwarts.txt stalwarts2.txt -t % |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
184 |
Richard Stallman%rms%GNU Project%http://www.stallman.org |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
185 |
Eric Raymond%ESR%Jargon File%http://www.catb.org/~esr/ |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
186 |
Ian Murdock% %Debian%http://ianmurdock.com/ |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
187 |
Lawrence Lessig% %Creative Commons%http://lessig.org |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
188 |
Linus Torvalds% %Linux Kernel%http://torvalds-family.blogspot.com/ |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
189 |
Guido van Rossum%BDFL%Python%http://www.python.org/~guido/ |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
190 |
Larry Wall% %Perl%http://www.wall.org/~larry/ |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
191 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
192 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
193 |
Generating a word frequency list |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
194 |
================================ |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
195 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
196 |
Now, let us use the tools we have learnt to use, to generate a word frequency list of a text file. We shall use the free text of Alice in Wonderland. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
197 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
198 |
The basic steps to achieve this task would be - |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
199 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
200 |
1. Eliminate the punctuation and spaces from the document. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
201 |
2. Generate a list of words. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
202 |
3. Count the words. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
203 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
204 |
We first use ``grep`` and some elementary ``regex`` to eliminate the non-alpha-characters. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
205 |
:: |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
206 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
207 |
$ grep "[A-Za-z]*" alice-in-wonderland.txt |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
208 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
209 |
This outputs all the lines which has any alphabetic characters on it. This isn't of much use, since we haven't done anything with the code. We only require the alphabetic characters, without any of the other junk. ``man grep`` shows us the ``-o`` option for outputting only the text which matches the regular expression. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
210 |
:: |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
211 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
212 |
$ grep "[A-Za-z]*" -o alice-in-wonderland.txt |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
213 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
214 |
Not very surprisingly, we have all the words, spit out in the form of a list! Now that we have a list of words, it is quite simple to count the occurrences of the words. You would've realized that we can make use of ``sort`` and ``uniq`` commands. We pipe the output from the ``grep`` to the ``sort`` and then pipe it's output to ``uniq``. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
215 |
:: |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
216 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
217 |
$ grep "[A-Za-z]*" -o alice-in-wonderland.txt | sort | uniq -c |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
218 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
219 |
Notice that you get the list of all words in the document in the alphabetical order, with it's frequency written next to it. But, you might have observed that Capitalized words and lower case words are being counted as different words. We therefore, replace all the Upper case characters with lower case ones, using the ``tr`` command. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
220 |
:: |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
221 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
222 |
$ grep "[A-Za-z]*" -o alice-in-wonderland.txt | tr 'A-Z' 'a-z' | sort | uniq -c |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
223 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
224 |
Now, it would also be nice to have the list ordered in the decreasing order of the frequency of the appearance of the words. We sort the output of the ``uniq`` command with ``-n`` and ``-r`` options, to get the desired output. |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
225 |
:: |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
226 |
|
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
227 |
$ grep "[A-Za-z]*" -o alice-in-wonderland.txt | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr |
042767d3dd0d
Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff
changeset
|
228 |