Counting Queries: PostgreSQL Analysis

There are a number of ways to get your database-backed Web application to run faster on the same hardware–allowing you to postpone upgrades for at least a little while, thus cutting costs. One way involves examining how your applications are interacting with the database. Tom Copeland explains how to “tune the queries” for a PostgreSQL database.

Everyone wants their database-backed Web application to run faster on the same hardware – if a software tweak can postpone a hardware upgrade for a year, it’s usually well worth it. One way to improve performance is to examine how the application is interacting with the database, and see if there’s anything that we can do to speed things up. We’ll take a look at a handy script that can examine PostgreSQL logs and let you know what queries are getting run the most.

It’s worth a moment to think of the various types of optimization that you can do on a database. There’s tuning the database server itself – for PostgreSQL this means tweaking the shared memory parameters and write-ahead log sizes and such in the PostgreSQL configuration file. There’s tuning the operating system on which the database is running – making sure you’ve got enough threads to prevent the server from starving, making sure you’ve got a proper filesystem selected, and so forth. We’re going to tune things at a slightly higher level – the database queries themselves.

Naturally, “tuning the queries” can be further broken down as well. One option is to work on optimizing queries that will help make specific parts of your application run faster. For example, if you notice that bringing up a customer detail screen takes a long time – whatever “long” means in your environment – it might be worth seeing what queries are being run to bring up the information. If the application is doing twenty different queries to get the data for that screen, perhaps the queries can be combined, or perhaps an expensive table scan is being done on a large but infrequently updated table, and an index could be added somewhere.

Another type of optimization is a sort of general optimization that you can do by seeing what queries are getting run the most and working on those. This doesn’t have the immediate satisfaction of seeing a screen display faster, but it’s not a bad idea if you’ve got some time and you want to do some future load-proofing. That’s the sort of optimization we’re going to focus on here.

As background information, I help work on GForge (http://gforge.org/), an Open Source collaborative software development tool which allows you to organize and manage any number of software development projects. I also administer a fairly active installation of GForge called RubyForge (http://rubyforge.org/); it hosts 170 projects and over 500 users. It’s backed by a PostgreSQL database that executes about 60,000 queries per day. Thus, I’ve got a vested interest in helping GForge to run efficiently on commodity hardware. There are many other sites running GForge as well – you can find a list of sites here (http://gforge.org/docman/view.php/1/52/gforge-sites.html). Several of these installations are much larger than mine, and general optimizations will help them as well.

{mospagebreak title=How to get the data}

Now that we’ve narrowed our focus a bit, we need to figure out how to get some data to analyze. A list of SQL queries that the database has executed is a good place to start, and fortunately PostgreSQL provides an easy way to get that. Note that I’m running PostgreSQL 7.3 installed from an RPM on a RedHat Linux machine – your PostgreSQL installation may be different if you’re on another platform, but hopefully the principles will be the same.

To turn on query logging, edit your postgresql.conf – mine is at the path /var/lib/pgsql/data/postgresql.conf. We’ll need to set the following options:

log_statement = true
syslog = 2             
syslog_facility = ‘LOCAL0′
syslog_ident = ‘postgres’

You’ll need to not only uncomment the log_statement option, but also set it to “true”. There are some other logging options like log_duration that can supply more information, but we’ll leave those for later. Now, tell PostgreSQL to reload the configuration file by issuing a “reload” command:

[root@rubyforge tom]# /etc/rc.d/init.d/postgresql reload

Click around your application a bit and PostgreSQL will start sending data to syslog. You can ensure this is happening by checking the last few lines of the syslog file:

[root@rubyforge tom]# tail -5 /var/log/messages
Mar  4 16:40:06 rubyforge postgres[24574]: [78] LOG:  query: SELECT * FROM groups WHERE group_id=’1′
Mar  4 16:40:06 rubyforge postgres[24574]: [79-1] LOG:  query: SELECT * FROM user_group
Mar  4 16:40:06 rubyforge postgres[24574]: [79-2] ^I^I^IWHERE user_id=’102′
Mar  4 16:40:06 rubyforge postgres[24574]: [79-3] ^I^I^IAND group_id=’1′
Mar  4 16:40:06 rubyforge postgres[24574]: [80] LOG:  query: BEGIN;ROLLBACK;
[root@rubyforge tom]#

Yep, looks like we’re collecting data. Now we need to leave it running for a while to collect “live” data. This is a bit of a balancing act – we don’t want to let it run too long, since it’s logging a lot of data and that slows things down, but at the same time we need to collect a fair bit of data to ensure we have useful results. In my case, I let the logging run for 4 hours, during which it collected about 2.5 MB of data representing around 10,000 SQL queries.

When you’ve got enough data, you can turn off logging by editing your postgresql.conf and commenting out the changes you made earlier:

#log_statement = true
#syslog = 2             
#syslog_facility = ‘LOCAL0′
#syslog_ident = ‘postgres’

After you’ve commented those lines out, rerun the “reload” command again – i.e., /etc/rc.d/init.d/postgresql reload. Of course, if you’re going to do this a few times, it’s worth making both a postgresql.conf.nologging and a postgresql.conf.logging file and copying the appropriate one on top of the active postgresql.conf.

{mospagebreak title=Analyze the data} 

Now you’ve got a pile of data in your syslog file. I usually run the SQL analysis on another machine, so I’ll tar up the syslog file and copy it to my workstation:

[root@rubyforge tmp]# tar -zcf sql_statement_data.tar.gz /var/log/messages
[root@rubyforge tmp]# scp sql_statement_data.tar.gz tom@hal:

Of course, you can run the analysis on the database server as well; just do whatever’s appropriate for your environment.

The next step is to actually analyze the data and see what we find. We could just scroll through the raw data, but that would take forever. So I wrote a little Ruby(http://ruby-lang.org/) library called PostgreSQL Query Analyzer (PQA) (http://rubyforge.org/projects/pqa/) to parse the data file and analyze the queries. If you’re not familiar with Ruby, it’s an object-oriented scripting language that’s great for various system administration and data manipulation tasks. To run PQA, you’ll need to have Ruby installed.

On to the results. Here’s a sample run showing the top 8 queries:

[tom@hal pqa]$ ./pqa.rb -top 8 -file overnight_query_data.txt
10077 queries (2826 unique) parsed in 1.984141 seconds
8 most frequent queries
931 times:  BEGIN;ROLLBACK;
780 times:  SELECT * FROM plugins
574 times:  SELECT language_code FROM supported_languages WHERE language_id=’1′
574 times:  SELECT language_id FROM supported_languages WHERE classname=’English’
295 times:  select * from supported_languages where language_code = ‘en’
162 times:  select * from supported_languages where language_code = ‘en-us’
85 times:  SELECT total FROM forum_group_list_vw WHERE group_forum_id=’721′
[tom@hal pqa]$

Already we see something very interesting – the supported_languages table is queried the same number of times by two rather similar queries. This is an indication that there might be a sequence of two queries that could be combined into one, which is a big win – we’ll eliminate five percent of the total number of queries in one fell swoop. Looks like our analysis efforts have been repaid already. And if we were to look at the top 30 or 40 queries, we could probably find some more instances of this sort of thing.

We can also see that we’re selecting the entire contents of the plugins table. Is there a way we can add some constraints to the query to limit the results? Is that moving a lot of data around? In this specific case, this isn’t a problem since the plugins table is very small. Still this is a good thing to be aware of if we start adding plugins – this query may be one to watch.

{mospagebreak title=Normalize the queries}

But let’s not stop there – let’s see what happens if we “normalize” the SQL queries. By that I mean we remove all the literal values, so that select language_id FROM supported_languages WHERE classname=’English’ gets turned into select language_id FROM supported_languages WHERE classname=”. That way we can get an even better picture of what queries it might be helpful to rewrite as stored procedures. Let’s see:

[tom@hal pqa]$ ./pqa.rb -top 8 -normalize -file overnight_query_data.txt
10077 queries (260 unique) parsed in 1.977645 seconds
8 most frequent queries
931 times:  BEGIN;ROLLBACK;
780 times:  SELECT * FROM plugins
780 times:  INSERT INTO activity_log (day,hour,group_id,browser,ver,platform,time,page,type) VALUES (20040225,”,”,”,”,”,”,”,”);
688 times:  SELECT language_id FROM supported_languages WHERE classname=”
688 times:  SELECT language_code FROM supported_languages WHERE language_id=”
644 times:  select * from supported_languages where language_code = ”
634 times:  SELECT total FROM forum_group_list_vw WHERE group_forum_id=”
[tom@hal pqa]$

Note that the number of unique queries has been reduced from 2826 to 260 – normalizing them shows that there really aren’t that many unique queries. Now we can also see that combining those two supported_languages queries will save us even more than we thought – we’ll be able to eliminate almost 7% of our queries. Obviously that activity_log query is a frequently used one – but with differing values each time, which is why it only showed up when we normalized. It might be a candidate for a stored procedure.

We’ve discussed a couple different types of optimizations and when they are useful. We’ve poked around the PostgreSQL database configuration file and learned how to log SQL statements. We’ve seen how we can use the open source PostgreSQL Query Analyzer (http://rubyforge.org/projects/pqa/) utility to analyze the data and report useful results, and we’ve done a bit of reasoning about the results to see what queries might actually be worthy of rewriting. Also, development on PQA will continue – drop by the project site if you have any suggestions, bug reports, or comments. Thanks!

Many thanks to the folks on the #ruby-lang Internet Relay Chat (IRC) channel for helping me figure out some regular expressions that PQA uses. Thanks also to the folks on the #pgsql channel for a philosophical discussion on general optimization techniques. Finally, thanks to my co-workers Joe Coffman, Dave Craine, and Rich Kilmer, who helped me work through various technical issues in the development of PQA.

[gp-comments width="770" linklove="off" ]

chat