SQL Performance and Tuning Considerations

This article, the first of two parts, will show you how to make the SQL statements you write run faster and more efficiently. It is excerpted from chapter 11 of the book SQL DeMYSTiFied, written by Andy Oppel (McGraw-Hill/Osborne, 2005; ISBN: 0072262249).

This final chapter covers performance and tuning considerations for making the SQL statements you write run faster and more efficiently. First, we’ll have a look at general things that can be done at the DBMS level, followed by guidelines for writing efficient SQL. The many differences across RDBMSs (even different versions from the same vendor) make universal solutions to performance issues difficult to develop. What works well in one DBMS often has a lesser, or even negative, effect on another. While I have made every attempt to include guidelines in this chapter that are universally applicable, I have also included some of the most important vendor-specific guidelines. In all cases, be sure to check your RDBMS documentation for current information.

Any seasoned DBA (Database Administrator) will tell you that database performance tuning is a never ending task. It seems there is always something that can be tweaked to make it run better. The key to success is managing your time and the expectations of the database users and setting the performance requirements for an application before it is even written. Simple statements such as “every database update must complete within four seconds” are usually the best. With that done, performance tuning becomes a simple matter of looking for things that do not conform to the performance requirement and tuning them until they do. The law of diminishing returns applies to database tuning, and you can put lots of effort into tuning a database process for little or no gain. The beauty of having a standard performance requirement is that you can stop when the process meets the requirement and then move on to the next problem.

General RDBMS Tuning Considerations

Most RDBMS performance problems are the result of poorly written SQL statements, but there are things that the DBA can do to make the entire database system run more efficiently. So, before we move on to tuning SQL queries, the sections that follow offer some general considerations for tuning the RDBMS.

Minimize Disk Reads and Writes

This may seem obvious, but the slowest operation on most computer systems is I/O (input/output), such as reading data from and writing data to the storage system (the hard disks). Moving data around in memory and performing various calculations and transformations on the data are both much faster than an I/O operation. So why wouldn’t you just put the entire database in memory and keep it there? Actually, that may happen someday. However, for now, memory is too expensive to be a complete replacement for disks, and most memory is volatile, meaning that the data held in it is lost when the computer system is shut down. There is nonvolatile memory, but it costs considerably more. Therefore, the best thing the DBA can do is to make efficient use of the available memory to minimize I/O operations and the time spent waiting for them to complete. Here are some ways to do that:

  • Allocate correctly sized buffers. You may recall that a buffer is an area of memory used to hold data that has been recently read from or is destined to be written to the storage system. With properly sized buffers, recently read data will stay in memory for a reasonable time, improving the probability that a new query will find the data it needs already sitting in a buffer. On the output side, if there is room in the correct buffer, the RDBMS can write the data to the buffer and copy the buffer from memory to the storage system for permanent storage at a later time. This is called asynchronous I/O, which is fundamental to an efficient RDBMS.
  • Spread out disk I/O. If you have multiple physical disk drives, spread the database files out to allow for parallel I/O. (Multiple partitions on a personal computer hard drive don’t count because those partitions are all on the same physical drive.) In general, a disk drive can access only one spot on the drive at a time, so multiple I/O operations on a single drive must be handled serially. By spreading files out, multiple I/O operations can happen at the same time (one per drive at any instant).

Tune the Computer System and Environment

It should be obvious that the computer system on which the DBMS runs should be as fast and efficient as possible. Here are some general considerations:

  • Select fast and reliable hardware. The faster the hardware on your computer system, particularly disk drives and processors, the faster your database queries will run.
  • Tune the operating system. While there is a lot of variation across vendors and operating system versions, there are usually somethings that can be done to improve performance. For example, in Microsoft Windows the size of the swap file can have a significant effect on operating system performance. Consult a tuning manual or white paper on your operating system for specifics.

{mospagebreak title=Design the Tables Efficiently}

Relational table design can have a huge effect on performance. Most popular RDBMSs have manuals or white papers that cover best practices for designing efficient tables. Here are some general guidelines that apply to all SQL implementations:

  • CHAR vs. VARCHAR  For character columns that are five characters or less, the CHAR data type is most appropriate, even if the data length varies. The VARCHAR data type is better for variable length columns that are usually longer than five characters in size. For columns that always contain data of the same length, CHAR is an obvious choice. The reason for using CHAR for short columns is that VARCHAR columns include 1 to 3 bytes to hold the length of the column data, and there is additional processing overhead for calculating the length when the column data changes.
  • Numeric column data types  Use the smallest data type in which the data will fit. A lot of space is wasted if, for example, the BIGINT data type is used when the data would always fit in INT or SMALLINT.
  • Identical data types for primary and foreign key columns  Joins work much more efficiently if the data types of the columns used in the join predicate (usually the primary key in one table and a foreign key in the other table) have identical data types. When the data types are different, the DBMS has to convert one of them so that the data values can be compared correctly, and such conversion amounts to unnecessary overhead.
  • Beware of triggers  Triggers placed on database tables are sometimes convenient and/or necessary ways to solve specific problems such as enforcing complicated constraints. However, the work that the trigger does is part of the unit of work of the SQL statement that modifies the table (the statement that caused the trigger to fire), and therefore, it slows down that SQL statement.

Tuning SQL Queries

About 80 percent of database query performance problems can be solved by adjusting the SQL statement. However, you must understand how the particular DBMS being used processes SQL statements in order to know what to tweak. For example, placing SQL statements inside stored procedures can yield remarkable performance improvement in Microsoft SQL Server and Sybase, but the same is not true in Oracle.

A query execution plan is a description of how an RDBMS will process a particular query, including index usage, join logic, and estimated resource cost. It is important to learn how to use the “explain plan” utility in your DBMS, if one is available, because it will show you exactly how the DBMS will process the SQL statement you are attempting to tune. Examples of obtaining execution plans for Oracle and Microsoft SQL Server databases appear later in this chapter.

The next section covers some general query design and tuning tips for SQL, followed by sections that offer recommendations for several of the most popular RDBMS products currently on the market.

{mospagebreak title=General RDBMS Considerations}

This section covers design and tuning considerations that apply to most SQL implementations. As always, consult the documentation for the DBMS you are using for applicability information.

Know Your Optimizer

The query optimizer is the software component in the RDBMS that analyzes an SQL statement to determine the best way to execute it. Most modern optimizers are cost-based, which means that they estimate the cost of all possible ways to execute a statement and then chose the one with the lowest cost. An important component of cost-based optimizers is statistics gathered from the database, such as the number of rows in each table and the number of unique values each indexed column has. However, some optimizers are rule-based, which means that they apply a set of rules to the SQL statement instead of cost estimates when deciding how to best execute it. A few optimizers such as Oracle’s allow a choice of either cost-based or rule-based. Here are some considerations regarding query optimizers:

  • Order of table names  Does the order of the table names in the FROM or JOIN clause have any influence on the order in which tables are accessed when performing the joins? This is more likely the case with a rule-based optimizer. Ideally, the DBMS should access the most selective table (the one that will eliminate the most number of rows from the result set) first. For rule-based optimizers, you may find some surprising differences. For example, Oracle version 7 processed the table name list from right to left, but this was changed to left to right order starting with version 8.
  • Order of search predicates  Does the order of the predicates in the WHERE clause have any influence on the order in which the predicates are evaluated? Ideally, the most restrictive predicate (the one that eliminates the most number of rows) should be evaluated first.
  • Lack of statistics  If a cost-based optimizer is being used, what does it do when statistics have not been collected for one or more tables? Some optimizers, such as Oracle’s, revert back to rule-based, while others assume default values for the required statistics or simply refuse to use any indexes and do full table scans of all the tables. A full table scan is where the DBMS reads every row in the table to find the desired ones, which of course can be a performance disaster for tables with very large numbers of rows.
  • Query rewrites  Are queries rewritten into more efficient forms by the optimizer? For example, many optimizers automatically rewrite subselects into equivalent joins in order to simplify subsequent processing. In some cases, you may find that certain DBMS options must be enabled in order to allow the optimizer to rewrite queries.
  • View definition merges  For queries that use views, at what point does the DBMS merge the view definition (the query that defines the view) into the SQL statement submitted by the database user? This has obvious implications for the optimizer -the sooner it can evaluate the entire SQL statement, the smarter its decision should be.
  • Other criteria  What other criteria influence the optimizer? For example, some optimizers will favor unique indexes over nonunique ones, and some will favor the use of an index to sequence rows over sorting the result set.

{mospagebreak title=Efficient Query Design}

Many application developers simply write SQL statements off the top of their heads without giving much thought to designing them for efficient processing. This is an easy trap to fall into because SQL is a nonprocedural language that gives the false impression that the form of the statement does not matter provided it produces the correct result set. Statements that are not designed can introduce enormous performance issues. For example, I recently reviewed an SQL statement that retrieved about six million rows of data each time it was executed. The application that issued the SQL only required the first row found. When this statement was run, the entire database seized until the processing was complete. It felt like the lights dimmed in the building from all the resources this one query consumed. It took the developer who fixed the problem only about an hour to turn the query into a very efficient one, but a little bit of thought by the original developer would have avoided the crisis entirely. Here are some considerations regarding query design:

  • Know your data.  When writing the SQL statement, you should have some idea of how many rows are in each table, how selective your WHERE predicates are, and how many rows you expect in the result set. The larger the number of rows involved, the more time you should spend thinking about the best way to write the SQL statement.
  • Minimize returned rows.  The fewer the rows in the result set, the more efficiently the query will run.
  • Avoid scans of large tables.  For tables over 1000 rows or so, scanning all the rows in the table instead of using an index can be expensive in terms of resources required. And, of course, the larger the table, the more expensive a table scan becomes. Full table scans occur in the following situations:
  • The query does not contain a WHERE clause to limit rows.
  • None of the columns referenced in the WHERE clause matches the leading column of an index on the table.
  • Index and table statistics have not been updated. Most RDBMS query optimizers use statistics to evaluate available indexes, and without statistics, a table scan may be seen as more efficient than using an index.
  • At least one column in the WHERE clause does match the first column of an available index, but the comparison used obviates the use of an index. These cases include the following:

    • Use of the NOT operator (for example, WHERE NOT CITY = ‘New York’). In general, indexes can be used to find what is in a table, but cannot be used to find what is not in a table.
    • Use of the NOT EQUAL operator (for example, WHERE CITY <> ‘New York’).
    • Use of a wildcard in the first position of a comparison string (for example, WHERE CITY LIKE ‘%York%’).
    • Use of an SQL function in the comparison (for example, WHERE UPPER(CITY) = ‘NEW YORK’).
  • Avoid unnecessary columns.  The wider the data in each row in the result set, the more disk space and memory that is required for intermediate operations such as sorts and to hold the result set.
  • Avoid unnecessary tables.  The fewer the tables, the more efficient the query.
  • Avoid sorts of large result sets.  Sorts are expensive, especially when the rows being sorted will not fit in memory. When the result set is expected to be very large, sorts should be avoided. Most optimizers will use an index if it can eliminate the need for a sort, but creating an index solely to avoid a sort of a large number of rows is probably not wise because of the overhead required to maintain the index.
  • Match data types in predicates.  Whether a predicate compares two column values as is done with joins, or a column value and a literal as is done when filtering rows, it is important for the data types to match. When the data types do not match, the DBMS must convert one of them before performing the comparison, and while the work to do this is relatively small, it mounts quickly when it has to be done for each row in a large table.
  • Use IN instead of OR when possible.  The IN operator can be rewritten as a JOIN, but the OR operator often requires multiple queries to be run with the results combined by the DBMS. The former is far more efficient.
  • Use GROUP BY instead of DISTINCT.  In most DBMSs, a GROUP BY is a more efficient way to eliminate duplicate rows compared with the DISTINCT keyword. The reason for this is that a GROUP BY invokes the sort required to find the duplicates earlier in the processing of the query, while a DISTINCT applies the sort as the very last step (applied to the final result set). The sooner the duplicate rows are eliminated, the more efficiently the remainder of the processing on that result set can be performed.
  • Use hints if you must.  Hints are special syntax that can be placed in the SQL statement to direct the optimizer to take certain actions, such as forcing the use of a particular index or a particular method to join tables. While this can be a very attractive option, it should only be used as a last resort because hints are not portable across database vendors, and they sometimes stop working when the DBMS software is upgraded to a newer version. The Oracle, MySQL, and Microsoft SQL Server optimizers all respond to hints, but the syntax accepted by each is different.
  • Temporary tables may help.  Temporary tables can help in some situations, such as assembling large result sets and then adding an index or two to support multiple subsequent queries against the temporary table. However, remember that you’re doubling up the reads and writes when you do this because all the data selected from the original (base) tables must be written to the temporary table(s) and then read back from there. In short, there are no free lunches.
  • Views may help.  Views can help because they hide complex operations such as nested aggregate functions. And with DBMSs that don’t have an SQL statement cache, views may process more efficiently than ordinary queries because the SQL statement that defines the view has already been parsed and optimized, which means this work does not have to be done every time the view is accessed. But above all, remember that views are also SQL queries, so they are subject to all the tuning considerations you apply to any SQL statement.

{mospagebreak title=Use Indexes Wisely}

Indexes can greatly improve data access times. However, always keep in mind that indexes take up storage space and they have to be maintained. Here are some considerations related to the use of indexes to improve query performance:

  • Avoid indexes on frequently updated columns. Creating an index on a column that is frequently updated doubles up the amount of writes required when the column is updated. Always remember that when column data is updated, the DBMS must also update any indexes that include that column.
  • Create only selective indexes.  Index selectivity is a ratio of the number of distinct values a column has divided by the number of rows in a table. For example, if a table has 1000 rows and a column has 800 distinct values, the selectivity of the index is 0.8, which is considered good. However, a column such as gender that only has two distinct values (M and F) has very poor selectivity (.002 in this case). Unique indexes always have a selectivity ratio of 1.0, which is the best possible. A good rule of thumb is to avoid indexes with a selectivity of less than 0.33 unless they are indexes especially designed for low selectivity such as bit-map indexes.
  • Foreign key indexes improve joins.  With most optimizers, an index on a foreign key column greatly improves join performance, and it can enable additional join methods for the optimizer to use.
  • Index columns frequently used in predicates. For large tables, every query should contain a WHERE predicate that references an indexed column. Therefore, it is best to find the columns that are most frequently referenced by predicates and to index them.
  • Don’t overindex.  As a rule of thumb, don’t create more than three or four indexes for any table. As already stated, indexes take up storage and must be maintained. Too many indexes on a table can cripple the performance of an INSERT or UPDATE issued against that table.
  • Avoid overlapping indexes.  Nearly every RDBMS can use an index even when the WHERE predicate references only the first column of the index. Therefore, overlapping indexes (those that have the same leading column) are redundant and unnecessary.
  • Consider unique indexes.  With some RDBMSs, such as DB2, unique indexes are so superior that DBAs often add otherwise unnecessary columns to an index just to make it unique.
  • Drop indexes for bulk loads.  For mass loads of data into a table, consider dropping some of the indexes and re-creating them after the load is complete. This can save a substantial amount of time in some DBMSs.

MySQL Considerations

Here are a few tuning considerations that are particular to the MySQL DBMS:

  • Storage engine  MySQL has a unique feature that provides multiple storage engines, one of which may be selected for each new table. These storage engines include MyISAM (a replacement for the original ISAM), HEAP, MERGE, InnoDB, and BDB (Berkeley DB). The differences are too involved to explain here, but there are several chapters of the MySQL Reference Manual devoted to them. Suffice it to say that the choice of storage engine has a profound effect on the optimizer and the performance of SQL statements issued against the table.
  • Hash indexes  MySQL supports hash indexes where the data values are sent through a hashing algorithm before being added to the index. While this technique scatters sequentially assigned values so new rows of data end up more uniformly spread out in the index, hash indexes have the disadvantage of not being useful as a replacement for sorts because the index entries are not in key sequence.

Please check back next week for the conclusion of this article.

[gp-comments width="770" linklove="off" ]

chat sex hikayeleri Ensest hikaye