Home arrow Site Administration arrow Page 5 - Dealing with Files and Filesystems

HACK#17: Delimiter Dilemma - Administration

In this first of a two-part article, you will learn how to get the most out of certain BSD commands, as well as some useful ways to handle your filesystem. It is excerpted from chapter two of the book BSD Hacks, written by Dru Lavigne (O'Reilly, 2005; ISBN: 0596006799). Copyright 2005 O'Reilly Media, Inc. All rights reserved. Used with permission from the publisher. Available from booksellers or direct from O'Reilly Media.

  1. Dealing with Files and Filesystems
  2. HACK#14: Get the Most Out of grep
  3. HACK#15: Manipulate Files with sed
  4. HACK#16: Format Text at the Command Line
  5. HACK#17: Delimiter Dilemma
By: O'Reilly Media
Rating: starstarstarstarstar / 5
December 28, 2006

print this article



Deal with double quotation marks in delimited files.

Importing data from a delimited text file into an application is usually painless. Even if you need to change the delimiter from one character to another (from a comma to a colon, for example), you can choose from many tools that perform simple character substitution with great ease.

However, one common situation is not solved as easily: many business applications export data into a space- or comma-delimited file, enclosing individual fields in double quotation marks. These fields often contain the delimiter character. Importing such a file into an application that processes only one delimiter (PostgreSQL for example) may result in an incorrect interpretation of the data. This is one of those situations where the user should feel lucky if the process fails.

One solution is to write a script that tracks the use of double quotes to determine whether it is working within a text field. This is doable by creating a variable that acts as a text/nontext switch for the character substitution process. The script should change the delimiter to a more appropriate character, leave the delimiters that were enclosed in double quotes unchanged, and remove the double quotes. Rather than make the changes to the original datafile, it's safer to write the edited data to a new file.

Attacking the Problem

The following algorithm meets our needs:

  1. Create the switch variable and assign it the value of1, meaning "nontext". Well declare the variabletswitchand define it astswitch = 1.
  2. Create a variable for the delimiter and define it. We'll use the variabledelimwith a space as the delimiter, sodelim = ' '.
  3. Decide on a better delimiter. We'll use the tab character, sonew_delim = '\t'.
  4. Open the datafile for reading. 
  5. Open a new file for writing.

Now, for every character in the datafile:

  1.  Read a character from the datafile.
  2. If the character is a double quotation mark,tswitch = tswitch * -1.
  3. If the character equals the character indelimandtswitchequals 1, writenew_delimto the new file.
  4. If the character equals that indelim andtswitchequals -1, write the value ofdelimto the new file.
  5. If the character is anything else, write the character to the new file.

The Code

The Python script redelim.py implements the preceding algorithm. It prompts the user for the original datafile and a name for the new datafile. Thedelimandnew_delimvariables are hardcoded, but those are easily changed within the script.

This script copies a space-delimited text file with text values in double quotes to a new, tab-delimited file without the double quotes. The advantage of using this script is that it leaves spaces that were within double quotes unchanged.

There are no command-line arguments for this script. The script will prompt the user for source and destination file information.

You can redefine the variables for the original and new delimiters,delim andnew_delim, in the script as needed.

import os
print """ Change text file delimiters.
# Ask user for source and target files. 
  sourcefile = raw_input('Please enter the path and name of the source file:') 
  targetfile = raw_input('Please enter the path and name of the target file:')
  # Open files for reading and writing. 
  source = open(sourcefile,'r')
  dest   = open(targetfile,'w')
# The variable 'm' acts as a text/non-text switch that reminds python
  # whether it is working within a text or non-text data field.
  tswitch = 1
# If the source delimiter that you want to change is not a space,
  # redefine the variable delim in the next line.
  delim = ' '
# If the new delimiter that you want to change is not a tab,
  # redefine the variable new_delim in the next line.
  new_delim = '\t'
  for charn in source.read():
          if tswitch == 1:
               if charn == delim:
               elif charn == '"':
                        tswitch = tswitch * -1
      elif tswitch == -1:
               if charn == '"':
                      tswitch = tswitch *

Use of redelim.py assumes that you have installed Python, which is available through the ports collection or as a binary package. The Python module used in this code is installed by default.

Hacking the Hack

If you prefer working with Perl, DBD::AnyData is another good solution to this problem.

See Also

  • The Python home page (http://www.python.org/)

Please check back next week for the conclusion of this article.

>>> More Site Administration Articles          >>> More By O'Reilly Media

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort


- Coding: Not Just for Developers
- To Support or Not Support IE?
- Administration: Networking OSX and Win 7
- DotNetNuke Gets Social
- Integrating MailChimp with Joomla: Creating ...
- Integrating MailChimp with Joomla: List Mana...
- Integrating MailChimp with Joomla: Building ...
- Integrating MailChimp with Joomla
- More Top WordPress Plugins for Social Media
- Optimizing Security: SSH Public Key Authenti...
- Patches and Rejects in Software Configuratio...
- Configuring a CVS Server
- Managing Code and Teams for Cross-Platform S...
- Software Configuration Management
- Back Up a Joomla Site with Akeeba Backup

Developer Shed Affiliates


Dev Shed Tutorial Topics: