Sybase ASE: Helping indexes work better

Drew Montgomery wrote up an excellent description of how to use indexes more wisely:
There are times when it seems that the optimizer just doesn’t “get it”. In other words, it doesn’t use what seems like an obvious choice for an index. There are several possible reasons for that, and we are going to explore a couple of them here…

The first could be that the index in question is using old statistics. The statistics for an index are used when evaluating queries to help the optimizer determine the cost of using a particular index. Each index for a table is evaluated by the optimizer along with some other complex interactions to determine if an index (or set of indices if we are dealing with multiple tables) should be used. If the statistics for an index are out of date, old, tired, worn out, they may no longer accurately represent the distribution of the data. The optimizer, using the older statistics, may determine that another course of action is best. This is why we would want to periodically update the statistics. Of course, this isn’t a cure-all for query woes. Far from it! But it can be a cause of some optimizer confusion.

Another reason could be that the query itself doesn’t let us use an index. You may be asking yourself “what could I have done to make the optimizer mad at me?” It’s not personal. It may just be the way you have rendered your query. One little known fact is that if you have an aggregate function against a column, the optimizer can’t use that column from an index. In other words, if you have a query that looks like this:

select * from local_users
where upper(name) like "%SMITH%"

You have prevented the optimizer from using an index based on name. The act of upper casing the column prevents the optimizer from using it. The same would be true if you had used an arithmetic function or even an equation:

select * from accounts
where account_limit * 10 < expected_account_limit[/sql] In the above case, the problem will be that the optimizer could use expected_account_limit OR account_limit but in no case could it use both. You would have to reform the query to move the *10 to the other side of the equation to allow the optimizer to use the account_limit column. In the same vein, if there is a data type mismatch between the columns being used in a query, the optimizer will have to make the column types "match" before it can complete the query. This means that the optimizer is applying an "implied" aggregate function (convert) to align the data types. If you have the following query: [sql]select p.name, p.address, j.title from personnel p, jobs j where p.j_id = j.j_id[/sql] and the column j_id in personnel allows nulls (for those people who are working under cover!), and j_id in jobs does not allow nulls since that is the primary key, the optimizer will not use an index based on the j_id column in the jobs table. In order to "match" the data types, it has to convert the j_id column in the jobs table from "int not null" to "int null" in order to match the definition of the j_id column from the personnel table. Sometime the difference in the columns could be a subtle as a varchar(10) vs. a char(10) or a varchar(20) vs. a varchar(21). In each case, the optimizer will make a conversion to the broadest common denominator to make the match, and in many cases that conversion will prevent the optimizer from using a given index. The last possible that I will discuss is the case where the wrong index exists, or a better index could be created. There are two distinct cases for this one. The first is where an index is given an advantage of some kind. That index could be in the form of the index being unique, or if the index is clustered. That advantage could be enough to sway the optimizer to use a specific index. Another case could be that even though the query is using an index, another index may be more efficient. I recently ran across this specific case. A query was using four different columns in its where clause, and two of those columns in an order by clause. The dynamic of the query was trying to retrieve the data on the basis of the date of the record, in descending order (I.E. they wanted the "youngest" record first) and in fact they had a restriction in the select to get the "top 1" record based on these criteria. A facsimile of the query is as follows: [sql]select top 1 name, balance_date, amount from bank_info where name = @name and account_type = @type and branch_id = @branch and balance_date < @eval_date order by name, balance_date desc[/sql] The query was using an index that had two of the four columns involved. Although this was better than a table scan (table had 82 million rows), it was not hugely efficient. The index did reduce the data set being sifted through, but was still taking 30,000 I/O to complete. Part of the problem was that once the basic query had been satisfied, the result was being put into a sort table that was then interrogated for the youngest record. This is a case for a new index. But the question is: "what does the index go on?" We could build an index against the an id column and the date. That would help the sort, but would not help the where clause a lot. We could build an index on the four columns in the where clause, but the result would still be that the data found would go into a sort table. Probably more efficient, but not as good as it gets. For the particular query, three of the four columns are being matched exactly, with the fourth being the date column's limitation as to what the cut off date is. If the cutoff date is midnight of March 11th, we are looking for the latest record from March 10th and earlier. The best solution is to create an index against the four columns in the were clause, with the date column last and (in the definition of the index) descending. That way when we restrict the other three columns, we start with the newest record. The other half of this is perhaps a little less obvious: use all of the columns of the new index in the order by clause. The change would be as follows: [sql]select top 1 name, balance_date, amount from bank_info where name = @name and account_type = @type and branch_id = @branch and balance_date < @eval_date order by name, account_type, branch_id, balance_date desc[/sql] That way the optimizer can use the index for the where clause and the order by clause simultaneously. The bottom line is that we go from thousands of I/Os to hundreds of I/O's (with the old index and old order by) finally to a much more reasonable 10-15 I/O. The difference in performance is dramatic. Bottom line, be aware of not only what indices exist, but how they are being used. Make sure to take care of them, and they will be like a good motor oil - they will improve the performance of your system while extending the life of the application.

Share Button

Leave a Reply

Your email address will not be published. Required fields are marked *