Dealing with Files and Filesystems

In this first of a two-part article, you will learn how to get the most out of certain BSD commands, as well as some useful ways to handle your filesystem. It is excerpted from chapter two of the book BSD Hacks, written by Dru Lavigne (O’Reilly, 2005; ISBN: 0596006799). Copyright © 2005 O’Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O’Reilly Media.

Now that you’re a bit more comfortable with the Unix environment, it’s time to tackle some commands. It’s funny how some of the most useful commands on a Unix system have gained themselves a reputation for being user-unfriendly. Do find, grep , sed , tr , or mount make you shudder? If not, remember that you still have novice users who are intimidated by and therefore aren’t gaining the full potential of these commands.

This chapter also addresses some useful filesystem manipulations. Have you ever inadvertently blown away a portion of your directory structure? Would you like to manipulate /tm p or your swap partition? Do your Unix systems need to play nicely with Microsoft systems? Might you consider ghosting your BSD system? If so, this chapter is for you.

HACK#13: Find Things

Finding files in Unix can be an exercise in frustration for a novice user. Heres how to soften the learning curve.

Remember the first time you installed a Unix system? Once you successfully booted to a command prompt, I bet your first thought was, Now what? or possibly, Okay, where is everything? Im also pretty sure your first foray into man find wasn’t all that enlightening.

How can you as an administrator make it easier for your users to find things? First, introduce them to the built-in commands. Then, add a few tricks of your own to soften the learning curve.

Finding Program Paths

Every user should become aware of the three w’ s: which , whereis , and whatis . (Personally, I’d like to see some why and when commands, but that’s another story.)

Use which to find the path to a program. Suppose youve just installed xmms and wonder where it went:

  % which xmms
  /usr/X11R6/bin/xmms

Better yet, if you were finding out the pathname because you wanted to use it in a file, save yourself a step:

  % echo `which xmms` >> somefile

Remember to use the backticks ( ` ), often found on the far left of the key board on the same key as the tilde ( ~ ). If you instead use the single quote ( ) character, usually located on the right side of the keyboard on the same key as the double quote ( " ), your file will contain the echoed string which xmms instead of the desired path.

The user’s current shell will affect how which’ s switches work. Here is an example from the C shell:

  % which -a xmms
  -a: Command not found.
  /usr/X11R6/bin/xmms
 
% which which
 
which: shell built-in command.

This is a matter of which which the user is using. Here, the user used the which which is built into the C shell and doesn’t support the options used by the which utility. Where then is that which ? Try the whereis command:

  % whereis -b which 
  which: /usr/bin/which

Here, I used -b to search only for the binary. Without any switches, whereis will display the binary, the manpage path, and the path to the original sources.

If your users prefer to use the real which command instead of the shell ver sion and if they are only interested in seeing binary paths, consider adding these lines to /usr/share/skel/dot.cshrc [Hack #9]:

  alias which     /usr/bin/which -a
  alias whereis   whereis -b

The -a switch will list all binaries with that name, not just the first binary found.

Finding Commands

How do you proceed when you know what it is that you want to do, but have no clue which commands are available to do it? I know I clung to the whatis command like a life preserver when I was first introduced to Unix. For example, when I needed to know how to set up PPP:

  % whatis ppp 
 
i4bisppp(4)         – isdn4bsd synchronous PPP over ISDN B-channel
  network driver
  ng_ppp(4)           – PPP protocol netgraph node type
  ppp(4)              – point to point protocol network interface
  ppp(8)              – Point to Point Protocol (a.k.a. user-ppp)
  pppctl(8)           – PPP control program
  pppoed(8)           – handle incoming PPP over Ethernet connections
  pppstats(8)         – print PPP statistics

On the days I had time to satisfy my curiosity, I tried this variation:

  % whatis "(1)"

That will show all of the commands that have a manpage in section 1. If youre rusty on your manpage sections, whatis intro should refresh your memory.

Finding Words

The previous commands are great for finding binaries and manpages, but what if you want to find a particular word in one of your own text files? That requires the notoriously user-unfriendly find command. Lets be realistic. Even with all of your Unix experience, you still have to dig into either the manpage or a good book whenever you need to find something. Can you really expect novice users to figure it out?

To start with, the regular old invocation of find will find filenames, but not the words within those files. We need a judicious use of grep to accomplish that. Fortunately, find s -exec switch allows it to use other utilities, such as grep , without forking another process.

Start off with a find command that looks like this:

  % find . -type f -exec grep "word" {} ;

This invocation says to start in the current directory ( . ), look through files, not directories ( -type f ), while running the grep command ( -exec grep ) in order to search for the word word . Note that the syntax of the -exec switch always resembles:

  -exec command with_its_parameters {} ;

What happens if I search the files in my home directory for the word alias ?

  % find . -type f -exec grep "alias" {} ;
 
alias h                  history 25
  alias j                  jobs -l   
  Antialiasing=true
 
Antialiasing arguments=-sDEVICE=x11 -dTextAlphaBits=4 -dGraphicsAlphaBits=2
  -dMaxBitmap=10000000
  (proc-arg 0 "antialiasing" "Apply antialiasing (TRUE/FALSE)")
  (proc-arg 0 "antialiasing" "Apply antialiasing (TRUE/FALSE)")

While it’s nice to see that find successfully found the word alias in my home directory, theres one slight problem. I have no idea whic h file or files contained my search expression! However, adding /dev/null to that command will fix that:

  # find . -type f -exec grep "alias" /dev/null {} ;
 
./.cshrc:alias h            history 25   ./.cshrc:alias j            jobs -l   ./.kde/share/config/kghostviewrc: Antialiasing=true
  ./.kde/share/config/kghostviewrc: Antialiasing arguments=-sDEVICE=x11
  -dTextAlphaBits=4 -dGraphicsAlphaBits=2
-dMaxBitmap=10000000
 
./.gimp-1.3/pluginrc:  (proc-arg 0 "antialiasing" "Apply antialiasing 
  (TRUE/FALSE)")
  ./.gimp-1.3/pluginrc:  (proc-arg 0 "antialiasing" "Apply antialiasing 
  (TRUE/FALSE)")

Why did adding nothing, /dev/null, automagically cause the name of the file to appear next to the line that contains the search expression? Is it because Unix is truly amazing? After all, it does allow even the state of nothingness to be expressed as a filename.

Actually, it works because grep will list the filename whenever it searches multiple files. When you just use {} , find will pass each filename it finds one at a time to grep . Since grep is searching only one filename, it assumes you already know the name of that file. When you use /dev/null { } , find actually passes grep two files, /dev/null along with whichever file find happens to be working on. Since grep is now comparing two files, its nice enough to tell you which of the files contained the search string. We already know /dev/null wont contain anything, so we just convinced grep to give us the name of the other file.

That’s pretty handy. Now let’s make it friendly. Heres a very simple script called fstring :

  % more ~/bin/fstring
 
#!/bin/sh
  # script to find a string
 
# replaces $1 with user’s search 
string
 
find . -type f -exec grep "$1" /dev/null {} ;

That $1 is a positional parameter. This script expects the user to give one parameter: the word the user is searching for. When the script executes, the shell will replace "$1" with the users search string. So, the script is meant to be run like this:

  % fstring word_to_search

If you’re planning on using this script yourself, youll probably remember to include a search string. If you want other users to benefit from the script, you may want to include an if statement to generate an error message if the user forgets the search string:

  #!/bin/sh
  # script to find a string
  # replaces $1 with user’s search string
  # or gives error message if user forgets to include search string
  if test $1
  then
    
find . -type f -exec grep "$1" /dev/null {} ;
 
else
     echo "Don’t forget to include the word you would like to search for"
     exit 1
 
fi

Don’t forget to make your script executable with chmod +x and to place it in the user’s path. /usr/local/bin is a good location for other users to benefit.

See Also

  • man which
  • man whereis
  • man whatis
  • man find
  • man grep

{mospagebreak title=HACK#14: Get the Most Out of grep}

You may not know where its odd name originated, but you can’t argue the usefulness of grep.

Have you ever needed to find a particular file and thought, I don’t recall the filename, but I remember some of its contents? The oddly named grep command does just that, searching inside files and reporting on those that contain a given piece of text.

Finding Text

Suppose you wish to search your shell scripts for the text $USER . Try this:

  % grep -s ‘$USER’ *
 
add-user:if [ "$USER" != "root" ]; the n 
  bu-user: echo " [-u user] – override $USER as the user to backup"
  bu-user:if [ "$user" = "" ]; then user="$USER"; fi
  del-user:if [ "$USER" != "root" ]; then 
  mount-host:mounted=$(df | grep "$ALM_AFP_MOUNT/$USER")
  …..
 
mount-user: echo " [-u user] – override $USER as the user to backup"
 
mount-user:if [ "$user" = "" ]; then user="$USER"; fi

In this example, grep has searched through all files in the current directory, displaying each line that contained the text $USER . Use single quotes around the text to prevent the shell from interpreting special characters. The -s option suppresses error messages when grep encounters a directory.

Perhaps you only want to know the name of each file containing the text $USER . Use the -l option to create that list for you:

  % grep -ls ‘$USER’ *
 
add-user
  bu-user
 
del-user
 
mount-host
 
mount-user

Searching by Relevance

What if you’re more concerned about how many times a particular string occurs within a file? Thats known as a relevanc e search. Use a command similar to:

  % grep -sc ‘$USER’ * | grep -v ‘:0′ | sort -k 2 -t : -r
  mount-host:6
  mount-user:2
  bu-user:2
  del-user:1
  add-user:1

How does this magic work? The -c flag lists each file with a count of match ing lines, but it unfortunately includes files with zero matches. To counter this, I piped the output from grep into a second grep , this time searching for ‘:0′ and using a second option,
-v , to reverse the sense of the search by displaying lines that don’t match. The second grep reads from the pipe instead of a file, searching the output of the first grep .

For a little extra flair, I sorted the subsequent output by the second field of each line with sort -k 2 , assuming a field separator of colon ( -t : ) and using
-r to reverse the sort into descending order.

Document Extracts

Suppose you wish to search a set of documents and extract a few lines of text centered on each occurrence of a keyword. This time we are interested in the matching lines and their surrounding context, but not in the filenames. Use a command something like this:

  % grep -rhiw -A4 -B4 ‘preferences’ *.txt > research.txt
  %
more research.txt

This grep command searches all files with the .txt extension for the word preferences. It performs a recursive search (-r ) to include all subdirectories, hides ( -h ) the filename in the output, matches in a case-insensitive ( -i ) man ner, and matches preferences as a complete word but not as part of another word ( -w ). The -A4 and -B4 options display the four lines immediately a fter and b efore the matched line, to give the desired context. Finally, I’ve redirected the output to the file research.txt.

You could also send the output straight to the vim text editor with:

  % grep -rhiw -A4 -B4 ‘preferences’ *.txt | vim -
 
Vim: Reading from stdin…

vim can be installed from /usr/ports/editors/vim.

Specifying vim - tells vim to read stdin (in this case the piped output from grep ) instead of a file. Type :q! to exit vim .

To search files for several alternatives, use the -e option to introduce extra search patterns:

  % grep -e ‘text1′ -e ‘text2′ *

Q. How did grep get its odd name?

A. grep was written as a standalone program to simulate a commonly performed command available in the ancient Unix editor ex . The command in question searched an entire file for lines containing a regular expression and displayed those lines. The command was g/re/p : g lobally search for a r egular e xpression and p rint the line.

Using Regular Expressions

To search for text that is more vaguely specified, use a regular expression. grep understands both basic and extended regular expressions, though it must be invoked as either egrep or grep -E when given an extended regular expression. The text or regular expression to be matched is usually called the pattern.

Suppose you need to search for lines that end in a space or tab character. Try this command (to insert a tab, press Ctrl-V and then Ctrl-I, shown as <tab> in the example):

  % grep -n ‘[ <tab>]$’ test-file
 
2:ends in spac e
  3:ends in tab

I used the [...] construct to form a regular expression listing the characters to match: space and tab. The expression matches exactly one space o r one tab character. $ anchors the match to the end of a line. The -n flag tells grep to include the line number in its output.

Alternatively, use:

  % grep -n ‘[[:blank:]]$’ test-file  
  2:ends is space
  3:ends in tab

Regular expressions provide many preformed character groups of the form [[:description :]]. Example groups include all control characters, all digits, or all alphanumeric characters. See man re_format for details.

We can modify a previous example to search for either preferences or pref erence as a complete word, using an extended regular expression such as this:

  % egrep -rhiw -A4 -B4 ‘preferences?’ *.txt > research.txt

The ? symbol specifies zero or one of the preceding character, making the s of preferences optional. Note that I use egrep because ? is available only in extended regular expressions. If you wish to search for the ? character itself, escape it with a backslash, as in ? .

An alternative method uses an expression of the form (string1|string2) , which matches either one string or the other:

  % egrep -rhiw -A4 -B4 ‘preference(s|)’ *.txt > research.txt

As a final example, use this to seek out all bash , tcsh , or sh shell scripts:

  % egrep ‘^#!/bin/(ba|tc|)sh[[:blank:]]*$’ *

The caret ( ^ ) character at the start of a regular expression anchors it to the start of the line (much as $ at the end anchors it to the end). (ba|tc|) matches ba, tc, or nothing. The * character specifies zero or more of [[:blank:]] , allowing trailing whitespace but nothing else. Note that the ! character must be escaped as ! to avoid shell interpretation in tcsh (but not in bash ).

Heres a handy tip for debugging regular expressions: if you dont pass a filename to grep, it will read standard input, allowing you to enter lines of text to see which match. grep will echo back only matching lines.

Combining grep with Other Commands

grep works well with other commands. For example, to display all tcsh processes:

  % ps axww | grep -w ‘tcsh’
 
saruman 10329 0.0 0.2 6416 1196 p1 Ss Sat01PM 0:00.68 -tcsh (tcsh)
 
saruman 11351 0.0 0.2 6416 1300 std Ss Sat07PM 0:02.54 -tcsh (tcsh)
  saruman 13360 0.0 0.0 1116    4 std R+ 10:57PM 0:00.00 grep -w tcsh
  %

Notice that the grep command itself appears in the output. To prevent this, use:

  % ps axww | grep -w ‘[t]csh’
  saruman 10329 0.0 0.2 6416 1196 p1 Ss Sat01PM 0:00.68 -tcsh (tcsh)
  saruman 11351 0.0 0.2 6416 1300 std Ss Sat07PM 0:02.54 -tcsh (tcsh)
  %

I’ll let you figure out how this works.

See Also

  • man grep 
  • man re_format (regular expressions)
{mospagebreak title=HACK#15: Manipulate Files with sed}

If you’ve ever had to change the formatting of a file, you know that it can be a time-consuming process.

Why waste your time making manual changes to files when Unix systems come with many tools that can very quickly make the changes for you?

Removing Blank Lines

Suppose you need to remove the blank lines from a file. This invocation of grep will do the job:

  % grep -v ‘^$’ letter1.txt > tmp ; mv tmp letter1.txt

The pattern ^$ anchors to both the start and the end of a line with no intervening characters–the regexp definition of a blank line. The -v option reverses the search, printing all nonblank lines, which are then written to a temporary file, and the temporary file is moved back to the original.

grep must never output to the same file it is reading, or the file will end up empty.

You can rewrite the preceding example in sed as:

  % sed ‘/^$/d’ letter1.txt > tmp ; mv tmp letter1.txt

‘/^$/d’ is actually a sed script. sed’ s normal mode of operation is to read each line of input, process it according to the script, and then write the processed line to standard output. In this example, the expression ‘/^$/ is a regular expression matching a blank line, and the trailing d’ is a sed function that deletes the line. Blank lines are deleted and all other lines are printed. Again, the results are redirected to a temporary file, which is then copied back to the original file.

Searching with sed

sed can also do the work of grep :

  % sed -n ‘/$USER/p’ *

This command will yield the same results as:

  % grep ‘$USER’ *

The -n (no-print, perhaps) option prevents sed from outputting each line. The pattern /$USER/ matches lines containing $USER , and the p function prints matched lines to standard output, overriding -n .

Replacing Existing Text

One of the most common uses for sed is to perform a search and replace on a given string. For example, to change all occurrences of 2003 into 2004 in a file called date, include the two search strings in the format ‘s/oldstring/newstring/’ , like so:

  % sed ‘s/2003/2004/’ date
  Copyright 2004
  …
  This was written in 2004, but it is no longer 2003.
  …

Almost! Noticed that that last 2003 remains unchanged. This is because without the g (global) flag, sed will change only the firs t occurrence on each line. This command will give the desired result:

  % sed ‘s/2003/2004/g’ date

Search and replace takes other flags too. To output only changed lines, use:

  % sed -n ‘s/2003/2004/gp’ date

Note the use of the -n flag to suppress normal output and the p flag to print changed lines.

Multiple Transformations

Perhaps you need to perform two or more transformations on a file. You can do this in a single run by specifying a script with multiple commands:

  % sed ‘s/2003/2004/g;/^$/d’ date

This performs both substitution and blank line deletion. Use a semicolon to separate the two commands.

Here is a more complex example that translates HTML tags of the form <font> into PHP bulletin board tags of the form [font] :

  % cat index.html
 
<title>hello
 
</title>
 
% sed ‘s/<(.*)>/[1]/g’ index.html  
  [title]hello
  [/title]

How did this work? The script searched for an HTML tag using the pattern ‘<.*>’ . Angle brackets match literally. In a regular expression, a dot ( . ) rep resents any character and an asterisk ( * ) means zero or more of the previous item. Escaped parentheses, ( and ) , capture the matched pattern laying between them and place it in a numbered buffer. In the replace string, 1 refers to the contents of the first buffer. Thus the text between the angle brackets in the search string is captured into the first buffer and written back inside square brackets in the replace string. sed takes full advantage of the power of regular expressions to copy text from the pattern to its replacement.

  % cat index1.html
 
<title>hello</title>
  % sed ‘s/<(.*)>/[1]/g’ index1.html
  [title>hello</title]

This time the same command fails because the pattern .* is greedy and grabs as much as it can, matching up to the second >. To prevent this behavior, we need to match zero or more of any character except < . Recall that [...] is a regular expression that lists characters to match, but if the first character is the caret ( ^ ), the match is reversed. Thus the regular expression [^<] matches any single character other than < . I can modify the previous example as follows:

  % sed ‘s/<([^<]*)>/[1]/g’ index1.html
 
[title]hello[/title]

Remember, grep will perform a case-insensitive search if you provide the -i flag. sed , unfortunately, does not have such an option. To search for title in a case-insensitive manner, form regular expressions using [...] , each listing a character of the word in both upper- and lowercase forms:

  % sed ‘s/[Tt][Ii][Tt][Ll][Ee]/title/g’ title.html

See Also

  1. man grep
  2. man sed
  3. man re_format (regular expressions)
  4. "sed & Regular Expressions" at http://main.rtfiber.com.tw/~changyj/sed /
  5. Cool sed tricks at http://www.wagoneers.com/UNIX/SED/sed. html
  6. The sed FAQ (http://doc.ddart.net/shell/sedfaq.htm)
  7. The sed Script Archive (http://sed.sourceforge.net/grabbag/ scripts/)

{mospagebreak title=HACK#16: Format Text at the Command Line}

Combine basic Unix tools to become a formatting expert.

Don’t let the syntax of the sed command scare you off. sed is a powerful utility capable of handling most of your formatting needs. For example, have you ever needed to add or remove comments from a source file? Perhaps you need to shuffle some text from one section to another.

In this hack, I’ll demonstrate how to do that. I’ll also show some handy formatting tricks using two other built-in Unix commands, tr and col .

Adding Comments to Source Code

sed allows you to specify an address range using a pattern, so let’s put this to use. Suppose we want to comment out a block of text in a source file by adding // to the start of each line we wish to comment out. We might use a text editor to mark the block with bc-start and bc-end :

  % cat source.c
    if (tTd(27, 1))
      sm_dprintf("%s (%s, %s) aliased to %sn",
         a->q_paddr, a->q_host, a->q_user, p);
    bc-start
      if (bitset(EF_VRFYONLY, e->e_flags))
  
    {
      a->q_state = QS_VERIFIED;
      return;
   
}
    bc-end
    message("aliased to %s", shortenstring(p, MAXSHORTSTR));

and then apply a sed script such as:

  % sed ‘/bc-start/,/bc-end/s/^////’ source.c

to get:

   if (tTd(27, 1))
       sm_dprintf("%s (%s, %s) aliased to %sn",
           a->q_paddr, a->q_host, a->q_user, p);
  
//bc-start
  
//  if (bitset(EF_VRFYONLY, e->e_flags))
  
//  {
  
//      a->q_state = QS_VERIFIED;
  
//      return;
  
//  }
  
//bc-end
 
message("aliased to %s", shortenstring(p, MAXSHORTSTR));

The script used search and replace to add // to the start of all lines ( s/^//// ) that lie between the two markers (bc-start/,/bc-end/ ). This will apply to every block in the file between the marker pairs. Note that in the sed script, the / character has to be escaped as / so it is not mistaken for a delimiter.

Removing Comments

When we need to delete the comments and the two bc- lines (let’s assume that the edited contents were copied back to source.c), we can use a script such as:

  % sed ‘/bc-start/d;/bc-end/d;/bc-start/,/bc-end/s/^////’ source.c

Oops! My first attempt won’t work. The bc- lines must be deleted afte r they have been used as address ranges. Trying again we get:

  % sed ‘/bc-start/,/bc-end/s/^////;/bc-start/d;/bc-end/d’ source.c

If you want to leave the two bc- marker lines in but comment them out, use this piece of trickery:

  % sed ‘/bc-start/,/bc-end/{/^//bc-/!s/////;}’ source.c

to get:

  if (tTd(27, 1) )
      sm_dprintf("%s (%s, %s) aliased to %sn",
          a->q_paddr, a->q_host, a->q_user, p);
    //bc-start
  if (bitset(EF_VRFYONLY, e->e_flags))
  {
      a->q_state = QS_VERIFIED;
      return;
 
}
    //bc-end
  message("aliased to %s", shortenstring(p, MAXSHORTSTR));

Note that in the bash shell you must use:

  % sed ‘/bc-start/,/bc-end/{/^//bc-/!s/////;}’ source.c

because the bang character ( ! ) does not need to be escaped as it does in tcsh .

What’s with the curly braces? They prevent a common mistake. You may imagine that this example:

  % sed -n ‘/$USER/p;p’ *

prints each line containing $USER twice because of the p;p commands. It doesn’t, though, because the second p is not restrained by the /$USER/ line address and therefore applies to ever y line. To print twice just those lines containing $USER , use:

  % sed -n ‘/$USER/p;/$USER/p’ *

or:

   % sed -n ‘/$USER/{p;p;}’ *

The construct {…} introduces a function list that applies to the preceding line address or range.

A line address followed by ! (or ! in the tcsh shell) reverses the address range, and so the function (list) that follows is applied to all lines not matching. The net effect is to remove // from all lines that dont start with //bc- but that do lie within the bc- markers.

Using the Holding Space to Mark Text

sed reads input into the pattern space, but it also provides a buffer (called the holdin g space) and functions to move text from one space to the other. All other functions (such as s and d ) operate on the pattern space, not the holding space.

Check out this sed script:

  % cat case.script
 
# Sed script for case insensitive search  
  #
  # copy pattern space to hold space to preserve it
  h  
  y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
abcdefghijklmnopqrstuvwxyz/
  # use a regular expression address to search for lines containing:
  /test/ {
  i
  vvvv
  a
  ^^^^
  }
  # restore the original pattern space from the hold space
  x;p

First, I have written the script to a file instead of typing it in on the com mand line. Lines starting with # are comments and are ignored. Other lines specify a sed command, and commands are separated by either a newline or ; character. sed reads one line of input at a time and applies the whole script file to each line. The following functions are applied to each line as it is read:

h

Copies the pattern space (the line just read) into the holding space.

y/ABC/abc/

Operates on the pattern space, translating A to a , B to b , and C to c and so on, ensuring the line is all lowercase.

/test/ {…}

Matches the line just read if it includes the text test (whatever the original case, because the line is now all lowercase) and then applies the list of functions that follow. This example appends text before ( i ) and after ( a ) the matched line to highlight it.

x

Exchanges the pattern and hold space, thus restoring the original contents of the pattern space.

p

Prints the pattern space.

Here is the test file:

  % cat case
  This contains text      Hello
  that we want to         TeSt
  search for, but in      test
  a case insensitive      XXXX
  manner using the sed    TEST
  editor.                 Bye bye.
  %

Here are the results of running our sed script on it:

  % sed -n -f case.script case
 
This contains text      Hello
  vvvv
  that we want to         TeSt
  ^^^^
  vvvv
  search for, but in      test
  ^^^^
  a case insensitive      XXXX
  vvvv
  manner using the sed    TEST
  ^^^^
  editor.                 Bye bye.

Notice the vvv ^^^ markers around lines that contain test.

Translating Case

The tr command can translate one character to another. To change the con tents of case into all lowercase and write the results to file lower-case, we could use:

  % tr ‘ABCDEFGHIJKLMNOPQRSTUVWXYZ’ ‘abcdefghijklmnopqrstuvwxyz’
    < case > lower-case

tr works with standard input and output only, so to read and write files we must use redirection.

Translating Characters

To translate carriage return characters into newline characters, we could use:

  % tr \r \n < cr > lf

where cr is the original file and lf is a new file containing line feeds in place of carriage returns. n represents a line feed character, but we must escape the backslash character in the shell, so we use \n instead. Similarly, a car riage return is specified as \r .

Removing Duplicate Line Feeds

tr can also squeeze multiple consecutive occurrences of a particular character into a single occurrence. For example, to remove duplicate line feeds from the lines file:

  % tr -s \n < lines > tmp ; mv tmp lines

Here we use the tmp file trick again because tr , like grep and sed , will trash the input file if it is also the output file.

Deleting Characters

tr can also delete selected characters. If for instance if you hate vowels, run your documents through this:

  % tr -d aeiou < file

Translating Tabs to Spaces

To translate tabs into multiple spaces, use the -x flag:

  % cat tabs
 
col     col    col
 
% od -x tabs
  0000000     636f 6c09  636f  6c09  636f 6c0a  0a00
 
0000015
 
% col -x < tabs > spaces
  % cat spaces
  col  col  col
 
% od -h spaces
 
0000000    636f  6c20  2020  2020  636f 6c20  2020 2020
  0000020    636f  6c0a  0a00
  0000025

In this example I have used od -x to octal dump in hexadecimal the con tents of the before and after files, which shows more clearly that the translation has worked. ( 09 is the code for Tab and 20 is the code for Space.)

See Also

  • man sed
  • man tr
  • man col
  • man od
{mospagebreak title=HACK#17: Delimiter Dilemma}

Deal with double quotation marks in delimited files.

Importing data from a delimited text file into an application is usually painless. Even if you need to change the delimiter from one character to another (from a comma to a colon, for example), you can choose from many tools that perform simple character substitution with great ease.

However, one common situation is not solved as easily: many business applications export data into a space- or comma-delimited file, enclosing individual fields in double quotation marks. These fields often contain the delimiter character. Importing such a file into an application that processes only one delimiter (PostgreSQL for example) may result in an incorrect interpretation of the data. This is one of those situations where the user should feel lucky if the process fails.

One solution is to write a script that tracks the use of double quotes to determine whether it is working within a text field. This is doable by creating a variable that acts as a text/nontext switch for the character substitution process. The script should change the delimiter to a more appropriate character, leave the delimiters that were enclosed in double quotes unchanged, and remove the double quotes. Rather than make the changes to the original datafile, it’s safer to write the edited data to a new file.

Attacking the Problem

The following algorithm meets our needs:

  1. Create the switch variable and assign it the value of 1 , meaning "nontext". Well declare the variable tswitch and define it as tswitch = 1 .
  2. Create a variable for the delimiter and define it. We’ll use the variable delim with a space as the delimiter, so delim = ‘ ‘ .
  3. Decide on a better delimiter. We’ll use the tab character, so new_delim = ‘t’ .
  4. Open the datafile for reading. 
  5. Open a new file for writing.


Now, for every character in the datafile:

  1.  Read a character from the datafile.
  2. If the character is a double quotation mark, tswitch = tswitch * -1 .
  3. If the character equals the character in delim and tswitch equals 1, write new_delim to the new file.
  4. If the character equals that in delim and tswitch equals -1, write the value of delim to the new file.
  5. If the character is anything else, write the character to the new file.

The Code

The Python script redelim.py implements the preceding algorithm. It prompts the user for the original datafile and a name for the new datafile. The delim and new_delim variables are hardcoded, but those are easily changed within the script.

This script copies a space-delimited text file with text values in double quotes to a new, tab-delimited file without the double quotes. The advantage of using this script is that it leaves spaces that were within double quotes unchanged.

There are no command-line arguments for this script. The script will prompt the user for source and destination file information.

You can redefine the variables for the original and new delimiters, delim and new_delim , in the script as needed.

  #!/usr/local/bin/python
 
import os
 
print """ Change text file delimiters.
 
# Ask user for source and target files . 
  sourcefile = raw_input(‘Please enter the path and name of the source file:’) 
  targetfile = raw_input(‘Please enter the path and name of the target file:’)
  # Open files for reading and writing. 
  source = open(sourcefile,’r')
  dest   = open(targetfile,’w')
 
# The variable ‘m’ acts as a text/non-text switch that reminds python
  # whether it is working within a text or non-text data field.
  tswitch = 1
 
# If the source delimiter that you want to change is not a space,
  # redefine the variable delim in the next line.
  delim = ‘ ‘
 
# If the new delimiter that you want to change is not a tab,
  # redefine the variable new_delim in the next line.
  new_delim = ‘t’
  for charn in source.read():
          if tswitch == 1:
               if charn == delim:
                        dest.write(new_delim)
               elif charn == ‘"’:
                        tswitch = tswitch * -1
               else:
                        dest.write(charn) 
      elif tswitch == -1:
               if charn == ‘"’:
                      tswitch = tswitch *
-1
               else:
                      dest.write(charn)
  source.close()
  dest.close()

Use of redelim.p y assumes that you have installed Python, which is available through the ports collection or as a binary package. The Python module used in this code is installed by default.

Hacking the Hack

If you prefer working with Perl, DBD::AnyData is another good solution to this problem.

See Also

  • The Python home page (http://www.python.org/)

Please check back next week for the conclusion of this article.

[gp-comments width="770" linklove="off" ]
antalya escort bayan antalya escort bayan