ult/session4.rst
author amit@thunder
Mon, 12 Jul 2010 15:37:06 +0530
changeset 99 799f1c2a0689
parent 64 fb96a1e1c38c
permissions -rw-r--r--
Changes to More on text Processing
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
57
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
     1
More text processing
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
     2
====================
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
     3
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
     4
``sort``
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
     5
--------
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
     6
Let's say we have a file which lists a few of the stalwarts of the open source community and a few details about them, like their "other" name, their homepage address, and what they are well known for or their claim to fame. 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
     7
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
     8
::
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
     9
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    10
  Richard Stallman%rms%GNU Project
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    11
  Eric Raymond%ESR%Jargon File
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    12
  Ian Murdock% %Debian
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    13
  Lawrence Lessig% %Creative Commons
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    14
  Linus Torvalds% %Linux Kernel
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    15
  Guido van Rossum%BDFL%Python
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    16
  Larry Wall% %Perl
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    17
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    18
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    19
The sort command enables us to do this in a flash! Just running the sort command with the file name as a parameter sorts the lines of the file alphabetically and prints the output on the terminal. 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    20
::
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    21
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    22
  $ sort stalwarts.txt 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    23
  Eric Raymond%ESR%Jargon File
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    24
  Guido van Rossum%BDFL%Python
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    25
  Ian Murdock% %Debian
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    26
  Larry Wall% %Perl
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    27
  Lawrence Lessig% %Creative Commons
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    28
  Linus Torvalds% %Linux Kernel
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    29
  Richard Stallman%rms%GNU Project
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    30
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    31
If you wish to sort them reverse alphabetically, you just need to pass the ``-r`` option. Now, you might want to sort the lines, based on each person's claim to fame or their "other" name. What do we do in that case? 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    32
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    33
Below is an example that sorts the file based on "other" names. 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    34
::
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    35
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    36
  $ sort -t % -k 2,2  stalwarts.txt
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    37
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    38
  Ian Murdock% %Debian
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    39
  Larry Wall% %Perl
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    40
  Lawrence Lessig% %Creative Commons
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    41
  Linus Torvalds% %Linux Kernel
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    42
  Guido van Rossum%BDFL%Python
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    43
  Eric Raymond%ESR%Jargon File
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    44
  Richard Stallman%rms%GNU Project
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    45
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    46
Sort command assumes white space to be the default delimiter for columns in each line. The ``-t`` option specifies the delimiting character, which is ``%`` in this case. 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    47
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    48
The ``-k`` option starts a key at position 2 and ends it at 2, essentially telling the sort command that it should sort based on the 2nd column, which is the other name. ``sort`` also supports conflict resolution using multiple columns for sorting. You can see that the first three lines have nothing in the "other" names column. We could resolve the conflict by sorting based on the project names (the 3rd column). 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    49
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    50
::
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    51
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    52
  $ sort -t % -k 2,2 -k 3,3  stalwarts.txt
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    53
  
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    54
  Lawrence Lessig% %Creative Commons
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    55
  Ian Murdock% %Debian
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    56
  Linus Torvalds% %Linux Kernel
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    57
  Larry Wall% %Perl
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    58
  Guido van Rossum%BDFL%Python
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    59
  Eric Raymond%ESR%Jargon File
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    60
  Richard Stallman%rms%GNU Project
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    61
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    62
``sort`` also has a lot of other options like ignoring case differences, month sort(JAN<FEB<...), merging already sorted files. ``man sort`` would give you a lot of information. 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    63
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    64
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    65
``uniq``
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    66
--------
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    67
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    68
Suppose we have a list of items, say books, and we wish to obtain a list which names of all the books only once, without any duplicates. We use the ``uniq`` command to achieve this. 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    69
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    70
::
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    71
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    72
  Programming Pearls
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    73
  The C Programming Language
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    74
  The Mythical Man Month: Essays on Software Engineering 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    75
  Programming Pearls
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    76
  The C Programming Language
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    77
  Structure and Interpretation of Computer Programs
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    78
  Programming Pearls
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    79
  Compilers: Principles, Techniques, and Tools
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    80
  The C Programming Language
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    81
  The Art of UNIX Programming
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    82
  Programming Pearls
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    83
  The Art of Computer Programming
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    84
  Introduction to Algorithms
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    85
  The Art of UNIX Programming
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    86
  The Pragmatic Programmer: From Journeyman to Master
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    87
  Programming Pearls
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    88
  Unix Power Tools
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    89
  The Art of UNIX Programming
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    90
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    91
Let us try and get rid of the duplicate lines from this file using the ``uniq`` command. 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    92
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    93
::
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    94
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    95
  $ uniq items.txt 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    96
  Programming Pearls
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    97
  The C Programming Language
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    98
  The Mythical Man Month: Essays on Software Engineering 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
    99
  Programming Pearls
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   100
  The C Programming Language
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   101
  Structure and Interpretation of Computer Programs
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   102
  Programming Pearls
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   103
  Compilers: Principles, Techniques, and Tools
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   104
  The C Programming Language
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   105
  The Art of UNIX Programming
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   106
  Programming Pearls
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   107
  The Art of Computer Programming
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   108
  Introduction to Algorithms
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   109
  The Art of UNIX Programming
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   110
  The Pragmatic Programmer: From Journeyman to Master
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   111
  Programming Pearls
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   112
  Unix Power Tools
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   113
  The Art of UNIX Programming
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   114
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   115
Nothing happens! Why? The ``uniq`` command removes duplicate lines only when they are next to each other. So, we get a sorted file from the original file and work with that file, henceforth. 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   116
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   117
::
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   118
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   119
  $ sort items.txt > items-sorted.txt
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   120
  $ uniq items-sorted.txt
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   121
  Compilers: Principles, Techniques, and Tools
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   122
  Introduction to Algorithms
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   123
  Programming Pearls
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   124
  Structure and Interpretation of Computer Programs
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   125
  The Art of Computer Programming
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   126
  The Art of UNIX Programming
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   127
  The C Programming Language
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   128
  The Mythical Man Month: Essays on Software Engineering 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   129
  The Pragmatic Programmer: From Journeyman to Master
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   130
  Unix Power Tools
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   131
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   132
``uniq -u`` command gives the lines which are unique and do not have any duplicates in the file. ``uniq -d`` outputs only those lines which have duplicates. The ``-c`` option displays the number of times each line occurs in the file. 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   133
::
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   134
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   135
  $ uniq -u items-sorted.txt 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   136
  Compilers: Principles, Techniques, and Tools
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   137
  Introduction to Algorithms
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   138
  Structure and Interpretation of Computer Programs
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   139
  The Art of Computer Programming
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   140
  The Mythical Man Month: Essays on Software Engineering 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   141
  The Pragmatic Programmer: From Journeyman to Master
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   142
  Unix Power Tools
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   143
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   144
  $ uniq -dc items-sorted.txt      
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   145
  5 Programming Pearls
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   146
  3 The Art of UNIX Programming
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   147
  3 The C Programming Language
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   148
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   149
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   150
``join``
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   151
--------
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   152
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   153
Now suppose we had the file ``stalwarts1.txt``, which lists the home pages of all the people listed in ``stalwarts.txt``.
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   154
::
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   155
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   156
  Richard Stallman%http://www.stallman.org
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   157
  Eric Raymond%http://www.catb.org/~esr/
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   158
  Ian Murdock%http://ianmurdock.com/
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   159
  Lawrence Lessig%http://lessig.org
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   160
  Linus Torvalds%http://torvalds-family.blogspot.com/
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   161
  Guido van Rossum%http://www.python.org/~guido/
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   162
  Larry Wall%http://www.wall.org/~larry/
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   163
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   164
It would be nice to have a single file with the information in both the files. To achieve this we use the ``join`` command. 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   165
::
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   166
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   167
  $ join stalwarts.txt stalwarts1.txt -t %
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   168
  Richard Stallman%rms%GNU Project%http://www.stallman.org
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   169
  Eric Raymond%ESR%Jargon File%http://www.catb.org/~esr/
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   170
  Ian Murdock% %Debian%http://ianmurdock.com/
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   171
  Lawrence Lessig% %Creative Commons%http://lessig.org
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   172
  Linus Torvalds% %Linux Kernel%http://torvalds-family.blogspot.com/
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   173
  Guido van Rossum%BDFL%Python%http://www.python.org/~guido/
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   174
  Larry Wall% %Perl%http://www.wall.org/~larry/
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   175
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   176
The ``join`` command joins the two files, based on the common field present in both the files, which is the name, in this case. 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   177
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   178
The ``-t`` option again specifies the delimiting character. Unless that is specified, join assumes that the fields are separated by spaces. 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   179
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   180
Note that, for ``join`` to work, the common field should be in the same order in both the files. If this is not so, you could use ``sort``, to sort the files on the common field and then join the files. In the above example, we have the common field to be the first column in both the files. If this is not the case we could use the ``-1`` and ``-2`` options to specify the field to be used for joining the files. 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   181
::
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   182
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   183
  $ join -2 2 stalwarts.txt stalwarts2.txt -t %
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   184
  Richard Stallman%rms%GNU Project%http://www.stallman.org
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   185
  Eric Raymond%ESR%Jargon File%http://www.catb.org/~esr/
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   186
  Ian Murdock% %Debian%http://ianmurdock.com/
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   187
  Lawrence Lessig% %Creative Commons%http://lessig.org
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   188
  Linus Torvalds% %Linux Kernel%http://torvalds-family.blogspot.com/
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   189
  Guido van Rossum%BDFL%Python%http://www.python.org/~guido/
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   190
  Larry Wall% %Perl%http://www.wall.org/~larry/
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   191
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   192
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   193
Generating a word frequency list
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   194
================================
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   195
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   196
Now, let us use the tools we have learnt to use, to generate a word frequency list of a text file. We shall use the free text of Alice in Wonderland.
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   197
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   198
The basic steps to achieve this task would be -
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   199
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   200
1. Eliminate the punctuation and spaces from the document. 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   201
2. Generate a list of words.
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   202
3. Count the words.
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   203
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   204
We first use ``grep`` and some elementary ``regex`` to eliminate the non-alpha-characters. 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   205
::
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   206
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   207
  $ grep "[A-Za-z]*" alice-in-wonderland.txt
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   208
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   209
This outputs all the lines which has any alphabetic characters on it. This isn't of much use, since we haven't done anything with the code. We only require the alphabetic characters, without any of the other junk. ``man grep`` shows us the ``-o`` option for outputting only the text which matches the regular expression.
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   210
::
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   211
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   212
  $ grep "[A-Za-z]*" -o alice-in-wonderland.txt
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   213
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   214
Not very surprisingly, we have all the words, spit out in the form of a list! Now that we have a list of words, it is quite simple to count the occurrences of the words. You would've realized that we can make use of ``sort`` and ``uniq`` commands. We pipe the output from the ``grep`` to the ``sort`` and then pipe it's output to ``uniq``.
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   215
::
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   216
  
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   217
  $ grep "[A-Za-z]*" -o alice-in-wonderland.txt | sort | uniq -c 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   218
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   219
Notice that you get the list of all words in the document in the alphabetical order, with it's frequency written next to it. But, you might have observed that Capitalized words and lower case words are being counted as different words. We therefore, replace all the Upper case characters with lower case ones, using the ``tr`` command. 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   220
::
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   221
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   222
  $ grep  "[A-Za-z]*" -o alice-in-wonderland.txt | tr 'A-Z' 'a-z' | sort | uniq -c 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   223
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   224
Now, it would also be nice to have the list ordered in the decreasing order of the frequency of the appearance of the words. We sort the output of the ``uniq`` command with ``-n`` and ``-r`` options, to get the desired output. 
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   225
::
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   226
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   227
  $ grep  "[A-Za-z]*" -o alice-in-wonderland.txt | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr
042767d3dd0d Added Session 4 of ULT; Scite tut to be done.
Puneeth Chaganti <puneeth@fossee.in>
parents:
diff changeset
   228