Deal with double quotation marks in delimited files.
Importing data from a delimited text file into an application is usually painless. Even if you need to change the delimiter from one character to another (from a comma to a colon, for example), you can choose from many tools that perform simple character substitution with great ease.
However, one common situation is not solved as easily: many business applications export data into a space- or comma-delimited file, enclosing individual fields in double quotation marks. These fields often contain the delimiter character. Importing such a file into an application that processes only one delimiter (PostgreSQL for example) may result in an incorrect interpretation of the data. This is one of those situations where the user should feel lucky if the process fails.
One solution is to write a script that tracks the use of double quotes to determine whether it is working within a text field. This is doable by creating a variable that acts as a text/nontext switch for the character substitution process. The script should change the delimiter to a more appropriate character, leave the delimiters that were enclosed in double quotes unchanged, and remove the double quotes. Rather than make the changes to the original datafile, it's safer to write the edited data to a new file.
Attacking the Problem
The following algorithm meets our needs:
Create the switch variable and assign it the value of1, meaning "nontext". Well declare the variabletswitchand define it astswitch = 1.
Create a variable for the delimiter and define it. We'll use the variabledelimwith a space as the delimiter, sodelim = ' '.
Decide on a better delimiter. We'll use the tab character, sonew_delim = '\t'.
Open the datafile for reading.
Open a new file for writing.
Now, for every character in the datafile:
Read a character from the datafile.
If the character is a double quotation mark,tswitch = tswitch * -1.
If the character equals the character indelimandtswitchequals 1, writenew_delimto the new file.
If the character equals that indelim andtswitchequals -1, write the value ofdelimto the new file.
If the character is anything else, write the character to the new file.
The Code
The Python script redelim.py implements the preceding algorithm. It prompts the user for the original datafile and a name for the new datafile. Thedelimandnew_delimvariables are hardcoded, but those are easily changed within the script.
This script copies a space-delimited text file with text values in double quotes to a new, tab-delimited file without the double quotes. The advantage of using this script is that it leaves spaces that were within double quotes unchanged.
There are no command-line arguments for this script. The script will prompt the user for source and destination file information.
You can redefine the variables for the original and new delimiters,delim andnew_delim, in the script as needed.
#!/usr/local/bin/python import os print """ Change text file delimiters. # Ask user for source and target files. sourcefile = raw_input('Please enter the path and name of the source file:') targetfile = raw_input('Please enter the path and name of the target file:') # Open files for reading and writing. source = open(sourcefile,'r') dest = open(targetfile,'w') # The variable 'm' acts as a text/non-text switch that reminds python # whether it is working within a text or non-text data field. tswitch = 1 # If the source delimiter that you want to change is not a space, # redefine the variable delim in the next line. delim = ' ' # If the new delimiter that you want to change is not a tab, # redefine the variable new_delim in the next line. new_delim = '\t' for charn in source.read(): if tswitch == 1: if charn == delim: dest.write(new_delim) elif charn == '"': tswitch = tswitch * -1 else: dest.write(charn) elif tswitch == -1: if charn == '"': tswitch = tswitch * -1 else: dest.write(charn) source.close() dest.close()
Use of redelim.py assumes that you have installed Python, which is available through the ports collection or as a binary package. The Python module used in this code is installed by default.
Hacking the Hack
If you prefer working with Perl, DBD::AnyData is another good solution to this problem.
See Also
The Python home page (http://www.python.org/)
Please check back next week for the conclusion of this article.