Parsing a Querystring With Perl

Having trouble parsing a querystring with Perl? If so, then Jeff’s step-by-step guide will save you headaches and have you up and running in no time!

At first, parsing a CGI query sounds like a simple task. A query is just a list of key-value pairs, so a hash sounds like the right tool. But many CGI query parsers fail for anything other than what the specific query is they expect. What if it’s a POST when you expected GET? What if the pairs are delimited by ; instead of &? What if the query is encoded via m/fd instead of a/xwfu? What if there’s more than one value for a field? And what about file uploads, a topic I’ve not covered yet?

How quickly a simple function becomes complex. But if we break the cases down into simple procedures, we’ll see it’s not as bad as we thought.{mospagebreak title=A Simple ISINDEX Query&toc=1} http://www.server.com/cgi-bin/prog?key1+key2+key3

We already know the arguments are stored in the @ARGV array, so there is technically no need to parse the $ENV{QUERY_STRING} variable. However, the parsing here will be the simplest we encounter.

I mentioned that some servers (such as Apache 1.3.1 for Win32 systems) will allow space characters (character 32) as well as ampersands (character 38) in between keywords. For this functionality, send a true value (1, for example) as the first argument to the function. Also, your system may expect just a single space between keywords; to allow for multiple spaces (and ampersands, if that option is true), send a true value as the second argument to the function. The final option causes leading and trailing spaces to be removed, and this is achieved by sending a true value as the third argument.

This function also makes adjustments for an ISMAPed image that sends XX,YY as the query string.

While the @ARGV array has shell characters escaped, this function merely decodes the query, and does not escape characters.

# @keywords = isindex_query($amp, $squash, $strip);

sub isindex_query {
my ($amp,$squash,$strip) = @_;
my $str = $ENV{QUERY_STRING};
my @kw;

# handle XX,YY
if ($str =~ /^(d+),(d+)$/) {
return ($1,$2);
}

# change %26 (encoding for ampersand) to a + character
$str =~ s/%26/+/g if $amp;

# squish more than one + into one
$str =~ tr/+//s if $squash;

# remove leading and trailing + signs
$str =~ s/^++//, $str =~ s/++$// if $strip;

# split query string by + signs
@kw = split /+/, $str;

# return decoded keywords
return map url_decode, @kw;
}


We should define the url_decode() and url_encode() functions right now, too, since they will be used over and over.

# $decoded = url_decode($string);
# $decoded = url_decode;

sub url_decode {
# default argument is $_
local $_ = @_ ? shift : $_;
defined or return;

# change + signs to spaces
tr/+/ /;

# change hex escapes to the proper characters
s/%([a-fA-F0-9]{2})/pack “H2″, $1/eg;

return $_;
}


The URLEncode Routine:

# $encoded = url_encode($string);
# $encoded = url_encode;

sub url_encode {
# default argument is $_
local $_ = @_ ? shift : $_;
defined or return;

# change unsafe characters (except for space) to encoded value
s/[^ a-zA-Z0-9._-!~*'()]/sprintf ‘%%%02X’, ord($1)/eg;

# change spaces to +
tr/ /+/;

return $_;
}


A GET query
http://www.server.com/cgi-bin/prog?name=Jeff+Pinyan&email=japhy%40pobox.com

For a GET query, we need to figure out how elements are separated. The simplest method is to split() on & or ; to get the pairs, and then again with = to get at the field and value.

# %kv_pairs = get_query($squash, $strip);

sub get_query {
my ($squash,$strip) = @_;
my $str = $ENV{QUERY_STRING};
my %kv;

# & and ; squishing
$str =~ tr/&;/&/s if $squash;

# leading/trailing & and ; removal
$str =~ s/^[&;]+//, $str =~ s/[&;]+$// if $strip;

# for each k=v pair
for (split /[&;]/, $str) {
# third arg of ‘2’ because $_ might be ‘a=b=c’
my ($k,$v) = split /=/, $_, 2;

# don’t allow for blank key
next if $k eq “”;

# XXX: this only allows one value per key!
$kv{url_decode($k)} = url_decode($v);
}

return %kv;
}


As the comment states, this query parser does not allow for multiple values for a key, such as in the query take=box&take=candle&take=sword. There are generally two ways to get around this: make the value of the key in the hash a string of comma-separated strings (or some other character, like NUL ()), or an array reference to the values. But there is some difficulty in being sure you choose a character (or sequence of characters) that is not found in the data. So I suggest the array reference method:

# %kv_pairs = get_query($squash, $strip);

sub get_query {
my ($squash,$strip) = @_;
my $str = $ENV{QUERY_STRING};
my %kv;

# ; to & translation
$str =~ tr/;/&/;

# & squishing
$str =~ tr/&//s if $squash;

# leading/trailing & removal
$str =~ s/^&+//, $str =~ s/&+$// if $strip;

# for each k=v pair
for (split /&/, $str) {
# third arg of ‘2’ because $_ might be ‘a=b=c’
my ($k,$v) = split /=/, $_, 2;

# don’t allow for blank key
next if $k eq “”;

($k,$v) = map url_decode, ($k,$v);

if (not exists $kv{$k}) { $kv{$k} = $v }
elsif (not ref $kv{$k}) { $kv{$k} = [ $kv{$k}, $v ] }
else { push @{ $kv{$k} }, $v }
}

return %kv;
}


If you are noticing that we need to know when to call which one of these functions, you’re thinking ahead. After I show how to parse a simple POST and then a m/fd POST (not file uploads yet — that’s later), then I will show a “multiplexor” — a function that decides which parser to call.

You’ll also notice that the GET parser excludes empty field names. This can be changed, if you like, by removing that line. The final code will have that feature as an option to the parser. Also note that if there is a pair without an = (such as “a=b&foo&c=d”) then the value is undef, whereas an = with no value after it (such as “a=b&foo=&c=d”) sets the value as the empty string.{mospagebreak title=A Simple POST Query&toc=1} Content-type: application/x-www-form-urlencoded
Content-length: 40

name=Jeff+Pinyan&email=japhy%40pobox.com


Decoding an a/xwfu POST query is much like the GET method, except that we must first read CONTENT_LENGTH bytes from standard input. That acts as our pseudo query string.

It is important to know that a POST query can be made at the same time as a GET query, by placing data in the query string. This is valid, and deserves to be decoded.

# %kv_pairs = simple_post_query($squash, $strip);

sub simple_post_query {
my ($squash,$strip) = @_;
read STDIN, my($str), $ENV{CONTENT_LENGTH};
my %kv;

# ; to & translation
$str =~ tr/;/&/;

# & squishing
$str =~ tr/&//s if $squash;

# leading/trailing & removal
$str =~ s/^&+//, $str =~ s/&+$// if $strip;

# for each k=v pair
for (split /&/, $str) {
# third arg of ‘2’ because $_ might be ‘a=b=c’
my ($k,$v) = split /=/, $_, 2;

# don’t allow for blank key
next if $k eq “”;

($k,$v) = map url_decode, ($k,$v);

if (not exists $kv{$k}) { $kv{$k} = $v }
elsif (not ref $kv{$k}) { $kv{$k} = [ $kv{$k}, $v ] }
else { push @{ $kv{$k} }, $v }
}

return %kv;
}


See? Hardly nothing new, except for a call to read() and the use of the $ENV{CONTENT_LENGTH} variable. I prefixed the function with simple_, because the m/fd query parser is going to be much more involving.

A Complex POST Query

CONTENT_LENGTH => 344
CONTENT_TYPE => multipart/form-data; boundary=5154532515249

–5154532515249
Content-Disposition: form-data; name=”feature”

123
–5154532515249
Content-Disposition: form-data; name=”comment”

what’s up?
–5154532515249
Content-Disposition: form-data; name=”foobar”

456
–5154532515249–


The real difficulty comes in parsing an m/fd POST query, because different browsers behave in different ways. This is A Bad Thing, since specifications are supposed to be adhered to. But we will have to make do, and get around special cases whenever possible. Some of the discrepancies are:
  • MSIE 3.01 and 3.02 on the Macintosh don’t use two leading hyphens in the boundary string
  • Many browsers don’t escape ” characters in the field names or filenames, which creates quite a problem when trying to get the value between quotes
  • Certain browsers (such as MSIE 5.0 for Win32) automatically remove fields with no name (like <input type=”text” name=””>)
As for our quoting problem, no work-around seems to exist. Imagine the following HTML tag:

<input type=”text” name=’something”; filename=”foo.bar’>

A m/fd POST query would contain

Content-Disposition: form-data; name=”something”; filename=”foo.bar”

We just fooled a query parser into thinking that there’s a file upload present. That’s not very nice of us, but then, the browser isn’t too smart. Sadly, this has no workaround. We can only hope that browsers start to escape their quotes — and we will make a pseudo-browser that does this, later. For now, let’s look at the code:

# %kv_pairs = complex_post_query($strip)

sub complex_post_query {
my ($strip) = @_;
my ($CRLF,$boundary,%kv);

# different OSs define r and n differently
# so adjust to make sure we get the line-ending right
$CRLF = $^O =~ /VMS/i ? “n” : # VMS
“t” ne “11″ ? “rn” : # EBCDIC (non-ASCII)
“1512″; # others

# for reading from STDIN
local $/ = $CRLF;

# for reading binary data on sensitive OSs
binmode STDIN if $^O =~ /^(?:WIN|VMS|OS2)/i;

# Mac MSIE 3.01/3.02 doesn’t put ‘–‘ at
# the beginning of the boundary string
# (so says Lincoln Stein in CGI.pm)
chomp($boundary = );
$boundary =~ s/^–// if
$ENV{HTTP_USER_AGENT} =~ /MSIEs+3.0[12];s*Mac/;

FORM_DATA:
while (1) {
my (%hd,$header,$value,$param,$filename,$skip);

# parse headers
while () {
chomp;
last if /^$/;

# header continutation (see RFC 822 3.4.8
# on wrapping long header lines)
$hd{$header} .= $_, next if s/^s+/ /;

($header,$value) = split /:s+/, $_, 2;

# change Content-type to CONTENT_TYPE
($header = uc $header) =~ tr/-/_/;

$hd{$header} = $value;
}

# avoid quotes in the fieldname and filename, PLEASE
$hd{CONTENT_DISPOSITION} =~
/ name=(?:”([^\"]*(?:\.[^\"]*)*)”|([^s;]*))/i and
$param = $+;
$hd{CONTENT_DISPOSITION} =~
/ filename=(?:”([^\"]*(?:\.[^\"]*)*)”|([^s;]*))/i and
$filename = $+;

# some versions of MSIE do this automatically :(
$skip = 1 if $strip and (
$param eq “” or
(defined $filename and $filename eq “”)
);

$kv{$param} = { HEADERS => %hd } unless $skip;

# file upload not supported (yet)
next if defined $filename;

# here’s the actual data
while () {
chomp;

# go to next form element
$_ eq $boundary and next FORM_DATA;

# done with form processing
$_ eq “$boundary–” and last FORM_DATA;

# if we don’t care about this element
next if $skip;

if (not exists $kv{$param}{VALUE}) {
$kv{$param}{VALUE} = $_;
}
elsif (not ref $kv{$param}{VALUE}) {
$kv{$param}{VALUE} = [ $kv{$param}, $_ ];
}
else {
push @{ $kv{$param}{VALUE} }, $_;
}
}
}

return %kv;
}


You’ll notice I go through great lengths to find the name and filename fields in the header. Let’s examine this regular expression:

(?:
” Match a ”
( Save to $1
[^\"]* 0 or more non- and non-” chars
(?:
\. followed by a char
[^\"]* 0 or more non- and non-” chars
)* This group 0 or more times
) End of $1
” Followed by a ”
| OR…
( Save to $2
[^s;]* 0 or more non-whitespace and non-; chars
) End of $2
)

We store $+ to the variable, which is the last parenthesized pattern matched, so it’s a faster way of saying:

$param = defined $1 ? $1 : $2;

This regex will break if the field name or filename has a ” in it, but common sense should dictate that you don’t do that anyway.{mospagebreak title=A File-Upload POST Query&toc=1} CONTENT_LENGTH => 166
CONTENT_TYPE => multipart/form-data; boundary=xyzzy

–xyzzy
Content-Disposition: form-data; name=”to_save”; filename=”c:me.html”
Content-Type: text/html

<html>
<body>
Not Much Here
</body>
</html>
–xyzzy–


This is not complex, really, but it involves a bit more care than the normal fields. The main point is “what do we do with the file that is uploaded?” I suggest we make a temporary file, put the data in it, and then return the name of the temporary file. This is only for the first draft of this function. I will show you later how to turn this into a more stylish solution.

One potential problem is the cleaning up of the temporary files when the program ends. We’ll see how to do that later.

# file_upload(%data, $filename, $boundary)

sub file_upload {
my ($data,$file,$stop) = @_;

# XXX: Unix-specific — this needs fixing
my $tmpfile = “/tmp/CGI-temp-$$-” . time;

$data->{NAME} = $file;

open TMPFILE, “> $tmpfile” or warn(“can’t save to $tmpfile: $!”), return;
binmode TMPFILE if $^O =~ /WIN|VMS|OS2/i;
while () {
last if $_ eq “$stop$/” or $_ eq “$stop–$/”;
print TMPFILE;
}
close TMPFILE;

$data->{FILENAME} = $tmpfile;
}
[gp-comments width="770" linklove="off" ]
antalya escort bayan antalya escort bayan