HomeMySQL Page 4 - SQL Performance and Tuning Considerations
Efficient Query Design - MySQL
This article, the first of two parts, will show you how to make the SQL statements you write run faster and more efficiently. It is excerpted from chapter 11 of the book SQL DeMYSTiFied, written by Andy Oppel (McGraw-Hill/Osborne, 2005; ISBN: 0072262249).
Many application developers simply write SQL statements off the top of their heads without giving much thought to designing them for efﬁcient processing. This is an easy trap to fall into because SQL is a nonprocedural language that gives the false impression that the form of the statement does not matter provided it produces the correct result set. Statements that are not designed can introduce enormous performance issues. For example, I recently reviewed an SQL statement that retrieved about six million rows of data each time it was executed. The application that issued the SQL only required the ﬁrst row found. When this statement was run, the entire database seized until the processing was complete. It felt like the lights dimmed in the building from all the resources this one query consumed. It took the developer who ﬁxed the problem only about an hour to turn the query into a very efﬁcient one, but a little bit of thought by the original developer would have avoided the crisis entirely. Here are some considerations regarding query design:
Know your data. When writing the SQL statement, you should have some idea of how many rows are in each table, how selective your WHERE predicates are, and how many rows you expect in the result set. The larger the number of rows involved, the more time you should spend thinking about the best way to write the SQL statement.
Minimize returned rows. The fewer the rows in the result set, the more efﬁciently the query will run.
Avoid scans of large tables. For tables over 1000 rows or so, scanning all the rows in the table instead of using an index can be expensive in terms of resources required. And, of course, the larger the table, the more expensive a table scan becomes. Full table scans occur in the following situations:
The query does not contain a WHERE clause to limit rows.
None of the columns referenced in the WHERE clause matches the leading column of an index on the table.
Index and table statistics have not been updated. Most RDBMS query optimizers use statistics to evaluate available indexes, and without statistics, a table scan may be seen as more efﬁcient than using an index.
At least one column in the WHERE clause does match the ﬁrst column of an available index, but the comparison used obviates the use of an index. These cases include the following:
Use of the NOT operator (for example, WHERE NOT CITY = 'New York'). In general, indexes can be used to ﬁnd what is in a table, but cannot be used to ﬁnd what is not in a table.
Use of the NOT EQUAL operator (for example, WHERE CITY <> 'New York').
Use of a wildcard in the ﬁrst position of a comparison string (for example, WHERE CITY LIKE '%York%').
Use of an SQL function in the comparison (for example, WHERE UPPER(CITY) = 'NEW YORK').
Avoid unnecessary columns. The wider the data in each row in the result set, the more disk space and memory that is required for intermediate operations such as sorts and to hold the result set.
Avoid unnecessary tables. The fewer the tables, the more efﬁcient the query.
Avoid sorts of large result sets. Sorts are expensive, especially when the rows being sorted will not ﬁt in memory. When the result set is expected to be very large, sorts should be avoided. Most optimizers will use an index if it can eliminate the need for a sort, but creating an index solely to avoid a sort of a large number of rows is probably not wise because of the overhead required to maintain the index.
Match data types in predicates. Whether a predicate compares two column values as is done with joins, or a column value and a literal as is done when ﬁltering rows, it is important for the data types to match. When the data types do not match, the DBMS must convert one of them before performing the comparison, and while the work to do this is relatively small, it mounts quickly when it has to be done for each row in a large table.
Use IN instead of OR when possible. The IN operator can be rewritten as a JOIN, but the OR operator often requires multiple queries to be run with the results combined by the DBMS. The former is far more efﬁcient.
Use GROUP BY instead of DISTINCT. In most DBMSs, a GROUP BY is a more efﬁcient way to eliminate duplicate rows compared with the DISTINCT keyword. The reason for this is that a GROUP BY invokes the sort required to ﬁnd the duplicates earlier in the processing of the query, while a DISTINCT applies the sort as the very last step (applied to the ﬁnal result set). The sooner the duplicate rows are eliminated, the more efﬁciently the remainder of the processing on that result set can be performed.
Use hints if you must. Hints are special syntax that can be placed in the SQL statement to direct the optimizer to take certain actions, such as forcing the use of a particular index or a particular method to join tables. While this can be a very attractive option, it should only be used as a last resort because hints are not portable across database vendors, and they sometimes stop working when the DBMS software is upgraded to a newer version. The Oracle, MySQL, and Microsoft SQL Server optimizers all respond to hints, but the syntax accepted by each is different.
Temporary tables may help. Temporary tables can help in some situations, such as assembling large result sets and then adding an index or two to support multiple subsequent queries against the temporary table. However, remember that you're doubling up the reads and writes when you do this because all the data selected from the original (base) tables must be written to the temporary table(s) and then read back from there. In short, there are no free lunches.
Views may help. Views can help because they hide complex operations such as nested aggregate functions. And with DBMSs that don't have an SQL statement cache, views may process more efﬁciently than ordinary queries because the SQL statement that deﬁnes the view has already been parsed and optimized, which means this work does not have to be done every time the view is accessed. But above all, remember that views are also SQL queries, so they are subject to all the tuning considerations you apply to any SQL statement.