ult/session4.rst
changeset 125 343a405d0aca
parent 124 fe7f10950014
child 126 b9c6563869ef
child 132 24cec0337e81
equal deleted inserted replaced
124:fe7f10950014 125:343a405d0aca
     1 More text processing
       
     2 ====================
       
     3 
       
     4 ``sort``
       
     5 --------
       
     6 Let's say we have a file which lists a few of the stalwarts of the open source community and a few details about them, like their "other" name, their homepage address, and what they are well known for or their claim to fame. 
       
     7 
       
     8 ::
       
     9 
       
    10   Richard Stallman%rms%GNU Project
       
    11   Eric Raymond%ESR%Jargon File
       
    12   Ian Murdock% %Debian
       
    13   Lawrence Lessig% %Creative Commons
       
    14   Linus Torvalds% %Linux Kernel
       
    15   Guido van Rossum%BDFL%Python
       
    16   Larry Wall% %Perl
       
    17 
       
    18 
       
    19 The sort command enables us to do this in a flash! Just running the sort command with the file name as a parameter sorts the lines of the file alphabetically and prints the output on the terminal. 
       
    20 ::
       
    21 
       
    22   $ sort stalwarts.txt 
       
    23   Eric Raymond%ESR%Jargon File
       
    24   Guido van Rossum%BDFL%Python
       
    25   Ian Murdock% %Debian
       
    26   Larry Wall% %Perl
       
    27   Lawrence Lessig% %Creative Commons
       
    28   Linus Torvalds% %Linux Kernel
       
    29   Richard Stallman%rms%GNU Project
       
    30 
       
    31 If you wish to sort them reverse alphabetically, you just need to pass the ``-r`` option. Now, you might want to sort the lines, based on each person's claim to fame or their "other" name. What do we do in that case? 
       
    32 
       
    33 Below is an example that sorts the file based on "other" names. 
       
    34 ::
       
    35 
       
    36   $ sort -t % -k 2,2  stalwarts.txt
       
    37 
       
    38   Ian Murdock% %Debian
       
    39   Larry Wall% %Perl
       
    40   Lawrence Lessig% %Creative Commons
       
    41   Linus Torvalds% %Linux Kernel
       
    42   Guido van Rossum%BDFL%Python
       
    43   Eric Raymond%ESR%Jargon File
       
    44   Richard Stallman%rms%GNU Project
       
    45 
       
    46 Sort command assumes white space to be the default delimiter for columns in each line. The ``-t`` option specifies the delimiting character, which is ``%`` in this case. 
       
    47 
       
    48 The ``-k`` option starts a key at position 2 and ends it at 2, essentially telling the sort command that it should sort based on the 2nd column, which is the other name. ``sort`` also supports conflict resolution using multiple columns for sorting. You can see that the first three lines have nothing in the "other" names column. We could resolve the conflict by sorting based on the project names (the 3rd column). 
       
    49 
       
    50 ::
       
    51 
       
    52   $ sort -t % -k 2,2 -k 3,3  stalwarts.txt
       
    53   
       
    54   Lawrence Lessig% %Creative Commons
       
    55   Ian Murdock% %Debian
       
    56   Linus Torvalds% %Linux Kernel
       
    57   Larry Wall% %Perl
       
    58   Guido van Rossum%BDFL%Python
       
    59   Eric Raymond%ESR%Jargon File
       
    60   Richard Stallman%rms%GNU Project
       
    61 
       
    62 ``sort`` also has a lot of other options like ignoring case differences, month sort(JAN<FEB<...), merging already sorted files. ``man sort`` would give you a lot of information. 
       
    63 
       
    64 
       
    65 ``uniq``
       
    66 --------
       
    67 
       
    68 Suppose we have a list of items, say books, and we wish to obtain a list which names of all the books only once, without any duplicates. We use the ``uniq`` command to achieve this. 
       
    69 
       
    70 ::
       
    71 
       
    72   Programming Pearls
       
    73   The C Programming Language
       
    74   The Mythical Man Month: Essays on Software Engineering 
       
    75   Programming Pearls
       
    76   The C Programming Language
       
    77   Structure and Interpretation of Computer Programs
       
    78   Programming Pearls
       
    79   Compilers: Principles, Techniques, and Tools
       
    80   The C Programming Language
       
    81   The Art of UNIX Programming
       
    82   Programming Pearls
       
    83   The Art of Computer Programming
       
    84   Introduction to Algorithms
       
    85   The Art of UNIX Programming
       
    86   The Pragmatic Programmer: From Journeyman to Master
       
    87   Programming Pearls
       
    88   Unix Power Tools
       
    89   The Art of UNIX Programming
       
    90 
       
    91 Let us try and get rid of the duplicate lines from this file using the ``uniq`` command. 
       
    92 
       
    93 ::
       
    94 
       
    95   $ uniq items.txt 
       
    96   Programming Pearls
       
    97   The C Programming Language
       
    98   The Mythical Man Month: Essays on Software Engineering 
       
    99   Programming Pearls
       
   100   The C Programming Language
       
   101   Structure and Interpretation of Computer Programs
       
   102   Programming Pearls
       
   103   Compilers: Principles, Techniques, and Tools
       
   104   The C Programming Language
       
   105   The Art of UNIX Programming
       
   106   Programming Pearls
       
   107   The Art of Computer Programming
       
   108   Introduction to Algorithms
       
   109   The Art of UNIX Programming
       
   110   The Pragmatic Programmer: From Journeyman to Master
       
   111   Programming Pearls
       
   112   Unix Power Tools
       
   113   The Art of UNIX Programming
       
   114 
       
   115 Nothing happens! Why? The ``uniq`` command removes duplicate lines only when they are next to each other. So, we get a sorted file from the original file and work with that file, henceforth. 
       
   116 
       
   117 ::
       
   118 
       
   119   $ sort items.txt > items-sorted.txt
       
   120   $ uniq items-sorted.txt
       
   121   Compilers: Principles, Techniques, and Tools
       
   122   Introduction to Algorithms
       
   123   Programming Pearls
       
   124   Structure and Interpretation of Computer Programs
       
   125   The Art of Computer Programming
       
   126   The Art of UNIX Programming
       
   127   The C Programming Language
       
   128   The Mythical Man Month: Essays on Software Engineering 
       
   129   The Pragmatic Programmer: From Journeyman to Master
       
   130   Unix Power Tools
       
   131 
       
   132 ``uniq -u`` command gives the lines which are unique and do not have any duplicates in the file. ``uniq -d`` outputs only those lines which have duplicates. The ``-c`` option displays the number of times each line occurs in the file. 
       
   133 ::
       
   134 
       
   135   $ uniq -u items-sorted.txt 
       
   136   Compilers: Principles, Techniques, and Tools
       
   137   Introduction to Algorithms
       
   138   Structure and Interpretation of Computer Programs
       
   139   The Art of Computer Programming
       
   140   The Mythical Man Month: Essays on Software Engineering 
       
   141   The Pragmatic Programmer: From Journeyman to Master
       
   142   Unix Power Tools
       
   143 
       
   144   $ uniq -dc items-sorted.txt      
       
   145   5 Programming Pearls
       
   146   3 The Art of UNIX Programming
       
   147   3 The C Programming Language
       
   148 
       
   149 
       
   150 ``join``
       
   151 --------
       
   152 
       
   153 Now suppose we had the file ``stalwarts1.txt``, which lists the home pages of all the people listed in ``stalwarts.txt``.
       
   154 ::
       
   155 
       
   156   Richard Stallman%http://www.stallman.org
       
   157   Eric Raymond%http://www.catb.org/~esr/
       
   158   Ian Murdock%http://ianmurdock.com/
       
   159   Lawrence Lessig%http://lessig.org
       
   160   Linus Torvalds%http://torvalds-family.blogspot.com/
       
   161   Guido van Rossum%http://www.python.org/~guido/
       
   162   Larry Wall%http://www.wall.org/~larry/
       
   163 
       
   164 It would be nice to have a single file with the information in both the files. To achieve this we use the ``join`` command. 
       
   165 ::
       
   166 
       
   167   $ join stalwarts.txt stalwarts1.txt -t %
       
   168   Richard Stallman%rms%GNU Project%http://www.stallman.org
       
   169   Eric Raymond%ESR%Jargon File%http://www.catb.org/~esr/
       
   170   Ian Murdock% %Debian%http://ianmurdock.com/
       
   171   Lawrence Lessig% %Creative Commons%http://lessig.org
       
   172   Linus Torvalds% %Linux Kernel%http://torvalds-family.blogspot.com/
       
   173   Guido van Rossum%BDFL%Python%http://www.python.org/~guido/
       
   174   Larry Wall% %Perl%http://www.wall.org/~larry/
       
   175 
       
   176 The ``join`` command joins the two files, based on the common field present in both the files, which is the name, in this case. 
       
   177 
       
   178 The ``-t`` option again specifies the delimiting character. Unless that is specified, join assumes that the fields are separated by spaces. 
       
   179 
       
   180 Note that, for ``join`` to work, the common field should be in the same order in both the files. If this is not so, you could use ``sort``, to sort the files on the common field and then join the files. In the above example, we have the common field to be the first column in both the files. If this is not the case we could use the ``-1`` and ``-2`` options to specify the field to be used for joining the files. 
       
   181 ::
       
   182 
       
   183   $ join -2 2 stalwarts.txt stalwarts2.txt -t %
       
   184   Richard Stallman%rms%GNU Project%http://www.stallman.org
       
   185   Eric Raymond%ESR%Jargon File%http://www.catb.org/~esr/
       
   186   Ian Murdock% %Debian%http://ianmurdock.com/
       
   187   Lawrence Lessig% %Creative Commons%http://lessig.org
       
   188   Linus Torvalds% %Linux Kernel%http://torvalds-family.blogspot.com/
       
   189   Guido van Rossum%BDFL%Python%http://www.python.org/~guido/
       
   190   Larry Wall% %Perl%http://www.wall.org/~larry/
       
   191 
       
   192 
       
   193 Generating a word frequency list
       
   194 ================================
       
   195 
       
   196 Now, let us use the tools we have learnt to use, to generate a word frequency list of a text file. We shall use the free text of Alice in Wonderland.
       
   197 
       
   198 The basic steps to achieve this task would be -
       
   199 
       
   200 1. Eliminate the punctuation and spaces from the document. 
       
   201 2. Generate a list of words.
       
   202 3. Count the words.
       
   203 
       
   204 We first use ``grep`` and some elementary ``regex`` to eliminate the non-alpha-characters. 
       
   205 ::
       
   206 
       
   207   $ grep "[A-Za-z]*" alice-in-wonderland.txt
       
   208 
       
   209 This outputs all the lines which has any alphabetic characters on it. This isn't of much use, since we haven't done anything with the code. We only require the alphabetic characters, without any of the other junk. ``man grep`` shows us the ``-o`` option for outputting only the text which matches the regular expression.
       
   210 ::
       
   211 
       
   212   $ grep "[A-Za-z]*" -o alice-in-wonderland.txt
       
   213 
       
   214 Not very surprisingly, we have all the words, spit out in the form of a list! Now that we have a list of words, it is quite simple to count the occurrences of the words. You would've realized that we can make use of ``sort`` and ``uniq`` commands. We pipe the output from the ``grep`` to the ``sort`` and then pipe it's output to ``uniq``.
       
   215 ::
       
   216   
       
   217   $ grep "[A-Za-z]*" -o alice-in-wonderland.txt | sort | uniq -c 
       
   218 
       
   219 Notice that you get the list of all words in the document in the alphabetical order, with it's frequency written next to it. But, you might have observed that Capitalized words and lower case words are being counted as different words. We therefore, replace all the Upper case characters with lower case ones, using the ``tr`` command. 
       
   220 ::
       
   221 
       
   222   $ grep  "[A-Za-z]*" -o alice-in-wonderland.txt | tr 'A-Z' 'a-z' | sort | uniq -c 
       
   223 
       
   224 Now, it would also be nice to have the list ordered in the decreasing order of the frequency of the appearance of the words. We sort the output of the ``uniq`` command with ``-n`` and ``-r`` options, to get the desired output. 
       
   225 ::
       
   226 
       
   227   $ grep  "[A-Za-z]*" -o alice-in-wonderland.txt | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr
       
   228