Friday 3 August 2012

Introduction to Indexes




Good indexes are the key to good performance in SQL Server and the key to creating good indexes is to understand what indexes are and how SQL Server uses them to evaluate queries.
In this first part of a three part series, I’m going to look at the very basics of what indexes are, what types exist in SQL and how they’re used.

What is an index?

An index is a structure within SQL that is used to quickly locate specific rows within a table. It can be useful to imaging an index at the back of a textbook when thinking about SQL indexes. They both serve the same purpose – to find specific information quickly.

General Structure

An index is defined on one or more columns, called key columns. The key columns (also referred to as the index key) can be likened to the terms listed in a book index. They are the values that the index will be used to search for. As with the index found at the back of a text book (see figure 1), the index is sorted by the key columns.
Index sample
Figure 1: Book index.             Image copyright Simple Talk publishing.
If an index is created with more than one key column, it is known as a composite index.
The general structure of an index is that of a balanced tree (b-tree). The index will have a single root page, zero or more intermediate levels and then a leaf level. A page is an 8 kilobyte chunk of the data file, with a header and footer and is identified by a combination of File ID and Page number.
Index Structure
Figure 2: Index Structure
Note: Commonly the root page is shown at the top of the tree diagram and the leaf pages at the bottom. Think of it as an inverted tree.
In the leaf level, there’s one entry for each row in the index1. The entries in the index are ordered logically2 in the order of the index key.
The non-leaf levels of the index contain one row per page of the level below, referencing the lowest index key value on each page.  If all of those rows fit onto a single page, then that page is considered the root and the index is only two levels deep. If all of those rows will not fit on a single page, then one (or more) intermediate levels are added to the index.
The number of levels in an index is referred to as the depth of the index. This is an important consideration for evaluating the efficiency of the index. The index illustrated in figure 2 has a depth of 3.
(1)   With the exception of SQL 2008’s filtered indexes, an index will have the same number of rows at the leaf level as the table.
(2)   I’m using the phrase ‘logically ordered’ because the index does not necessarily define the physical storage of the rows. The rows are stored in a way that SQL can retrieve them ordered.

Clustered and nonclustered

There are two main types of indexes in SQL Server, the clustered index and the nonclustered index
Clustered indexes define the logical order of the table. The leaf level of the clustered index has the actual data pages of the table. Because of this there can only be one clustered index per table. A table that does not have a clustered index is referred to as a heap.
Nonclustered indexes are separate from the table. The leaf level of a nonclustered index has a pointer as part of each index row. That pointer is either the clustered index key in the cases where the base table has a clustered index or the RID (Row Identifier) in the cases where the table is a heap. The RID is an 8-byte structure comprised of File ID, Page Number and Slot Index and will uniquely identify a row in the underlying heap. Either way, the each row of a nonclustered index has a reference to the complete data row.

Index Limits

There are a number of built-in limitations on indexes
Key size
The size of an index key is limited to a maximum of 900 bytes and a maximum of 16 columns. This is definitely a limit, not a goal, as the larger the index key gets, the more pages in the index and the deeper the index tree. As the number of pages and the depth of the tree increases so the index becomes less efficient to use. Larger indexes also use more storage space and result in less efficient use of SQL’s data cache.
Number of indexes
In SQL 2005 and earlier there was a limitation of 250 indexes per table, one clustered and 249 non-clustered. In SQL 2008, with the addition of filtered indexes, that limitation was increased to 1000, one clustered and 999 non-clustered indexes.
Both of these limits are very high and there are few circumstances where a well-designed system should approach that limit.
The reason for this is twofold.
·        As the number of indexes increases so the total size occupied by the table (with all of its indexes) increases. Sure, hard drives are cheap and storage is abundant but increasing the size of a database has other effects, Maintenance operations (backups, restores, consistency checks and index rebuilds) all take longer as the size of a database increases.
·        Indexes have to be kept up to date as data changes and the more indexes there are on a table, the more places the data has to be changed. If there are 10 nonclustered indexes on a table, an insert must be done in 11 places (the table and each of those nonclustered indexes). On databases that are mostly read-only (decision support, data warehouses) that may be acceptable. On databases that have frequent inserts, updates and deletes (OLTP systems), the overhead impose by multiple indexes may not be acceptable

How SQL uses indexes

If a table does not have index, the only way to find all occurrences of a value within a table is to read the entire table. If a table has an index, it speeds up the locating of values within that index in two ways.
1.      The index is sorted in the order of the key columns. This means that once all the matching values have been found, the remaining portion of the table can be ignored. This is the same as a telephone directory, where once all entries with a particular surname have been found, the rest of the book can be ignored as no further matches are possible
2.      The tree structure of the index allows a divide-and-conquer approach to locating rows, where large portions of the table can be quickly excluded from the search. This is illustrated in Figure 3
There are four basic operations that SQL can do on an index. It can scan the index, it can seek on the index, it can do lookups to the index and it can update the index

Scans

An index scan is a complete read of all of the leaf pages in the index. When an index scan is done on the clustered index, it’s a table scan in all but name.
When an index scan is done by the query processor, it is always a full read of all of the leaf pages in the index, regardless of whether all of the rows are returned. It is never a partial scan.
A scan does not only involve reading the leaf levels of the index, the higher level pages are also read as part of the index scan.

Seeks

An index seek is an operation where SQL uses the b-tree structure to locate either a specific value or the beginning of a range of value. For an index seek to be possible, there must be a SARGable3 predicate specified in the query and a matching (or partially matching) index. A matching index is one where the query predicate used a left-based subset of the index columns. This will be examined in much greater detail in a part 3 of this series.
The seek operation is evaluated starting at the root page. Using the rows in the root page, the query processor will locate which page in the next lower level of the index contains the 1st row that is being searched for. It will then read that page. If that is the leaf level of the index, the seek ends there. If it is not the leaf then the query processor again identifies which page in the next lower level contains the specified value. This process continues until the leaf level is reached.
Once the query processor has located the leaf page containing either the specified key value or the beginning of the specified range of key values then it reads along the leaf pages until all rows that match the predicate have been returned. Figure 2 shows how a seek would be done on an index when searching for the value 4.
Index Seek
If the index contains all the columns that the query needs, then the index is said to be covering for that query. If the index does not contain all the columns then SQL will do a lookup to the base table to fetch the other columns in order to process the query.
(3)   SARGable is a made-up word, constructed from the phrase Search ARGument. It refers to a predicate that is of a form that SQL can use for an index seek. For more details see: http://www.sql-server-performance.com/tips/t_sql_where_p2.aspx

Lookups

Lookups occur when SQL uses an index to locate rows affected by a query but that index does not contain all the columns required to satisfy the query, that is the index is not covering for that query. To fetch the remaining columns, SQL does a lookup to either the clustered index or heap.
A lookup to a clustered index is essentially a single row clustered index seek and it is always a single row seek. So if a lookup is needed for 500 rows, that involves 500 individual clustered index seeks.

Updates

Anytime that a row is changed, those changes must be made not only in the base table (clustered index or heap) but also in any index that contains the columns that were affected by the change.  This applies to insert, update and delete operations.

Considerations for creating indexes

I’ll be going into more detail on considerations for indexes in the next two parts, but in general:
  • Clustered index should be narrow, because the clustering key is part of all nonclustered indexes.
  • Composite nonclustered indexes are generally more useful than single column indexes, unless all queries against the table filter on one column at a time.
  • Indexes should be no wider than they have to be. Too many columns wastes space and increases the amount of places that data must be changed when an insert/update/delete occurs.
  • If an index is unique, specify that it is unique. The optimiser can sometimes use that information to generate more optimal execution plans.
  • Be careful of creating lots of indexes on frequently modified tables as it can slow down data modifications.
In part 2 of this series I’ll be looking in more detail into clustered indexes. What they are, how they differ from nonclustered indexes and what the considerations are for creating a clustered index.

Good indexes are the key to good performance in SQL Server and the key to creating good indexes is to understand what indexes are and how SQL Server uses them to evaluate queries.
In the previous article I looked at the very basics of what indexes are, what types exist in SQL and how they’re used. In this article I’m going to take a closer look at clustered indexes, what makes them different from nonclustered indexes, how SQL uses clustered indexes and some of the recommendations for selecting the clustered index key.

What is a clustered index?

A clustered index is an index where the leaf level of the index contains the actual data rows of the table. Like any other index, a clustered index is defined on one or more columns – the index key. The key columns for a clustered index are often referred to as the clustering key.
The clustered index can be looked at as a balanced tree (b-tree) structure built on top of a table, with the rows in the table stored logically in the order of the clustering key (see figure 1). The clustered index essentially is the table and hence there cannot be more than one on a table.
Architecture of a Clustered Index
Figure 1: Architecture of a clustered index
The clustering key is, in some ways, the ‘address’ of a data row. It can be used, through the tree structure of the clustered index, to locate a row of data. Because it is used to locate a data row, the clustering key must be unique. If, when the clustered index is created it is defined as unique (either by specifying the UNIQUE keyword as part of the create index statement or by defining the clustered index as part of a primary key or unique constraint), then everything is fine. What if the clustered index is not defined as unique? In that case, SQL automatically adds another column to the index, a 4-byte integer column that’s called the Uniquifier. This column is hidden, it is not displayed in the table design and it cannot be queried directly.

Difference between a clustered index and a heap

A table that has no clustered index is referred to as a heap. Whereas a clustered index has the data stored logically in the order of the index key, a heap has no ordering of rows or pages.
When a row is added to a table with a clustered index, that row has a defined page that it must go on to, according to the value of the clustered index key. Using the index depicted in Figure 1 as an example, if a new row is added with a value of 4 for the clustered index key, that row must go onto the second leaf page. If that page is full, a new page gets allocated, joined into the index and some of the rows from the full page are moved to the new one. This is called a page split, it can be an expensive operation and it leads to fragmentation.
When a row is added to a table without a clustered index (a heap) there’s no specific place that the row has to go. A heap has no defined order. The row will be added wherever there’s a space large enough to put the row.

How a clustered index is used

There are three ways that a clustered index can be used by a query to locate – seek, scan and lookup.

Seek

For the clustered index to be used in a seek operation, the query must contain a SARGable1 predicate based on the clustering key or part of the clustering key.
A simple example of a clustered index seek is as follows (based on the AdventureWorks database)
 SELECT ProductID, ProductNumber, Name
  FROM Production.Product
  WHERE ProductID = 380
ProductID is the clustering key for the table Production.Product and the filter on ProductID is SARGable.
Execution Plan
(1)   SARGable is a made-up word, constructed from the phrase Search ARGument. It refers to a predicate that is of a form that SQL can use for an index seek. For more details see: http://www.sql-server-performance.com/tips/t_sql_where_p2.aspx

Scan

A scan of the clustered index is equivalent to a table scan. It’s a read of all of the data pages in the table
Because the clustered index is the table it can be seen as the default access path for queries to get data to retrieve rows. If a query does not have a SARGable predicate1 based on the clustering key  and there are no non-clustered indexes which are appropriate for a query (and what makes a non-clustered index appropriate will be discussed in detail in part 3), then a scan of the clustered index will be done to retrieve the data for the query. This is a full scan of all the data pages, i.e. a table scan.
In this example a clustered index scan is done because there are no indexes on either of the columns in the where clause.
 SELECT ProductID, ProductNumber, Name
  FROM Production.Product
  WHEREColor = 'Blue' AND DaysToManufacture BETWEEN 2 AND 4
Execution Plan
SARGable is a made-up word, constructed from the phrase Search ARGument. It refers to a predicate that is of a form that SQL can use for an index seek. For more details see: http://www.sql-server-performance.com/tips/t_sql_where_p2.aspx

Lookups

A lookup occurs when a nonclustered index was used to locate rows for a query, but the nonclustered index did not contain all of the columns required for the query. To fetch the remaining columns, a lookup is done to the clustered index.
A lookup is equivalent to a single row clustered index seek. Lookups are always done one row at a time. For this reason, they are very expensive operations, especially when lots of rows are involved.
SELECT ProductID, ProductNumber, Name
  FROM production.Product
  WHERE ProductNumber = 'HN-1224'
Execution Plan

Considerations for selecting the clustering key


There are two main schools of thought in what makes a good clustering. One school says to put the clustered index on the column or set of columns that would be most useful for queries, either frequently run queries or ones doing queries of large ranges of data. The other school says to use the clustered index primarily to organise the table and leave data access to the nonclustered indexes.
I hold to the second school, so the rest of this article will be written from that perspective.
There are four main attributes that are desirable for a clustering key. The clustering key should be:
  • Narrow
  • Unique
  • Unchanging
  • Ever increasing

Narrow

The width of the clustering key affects the depth of the clustered index and hence it’s efficiency. A clustered index with a very deep b-tree requires queries to read more intermediate pages to locate rows. This increased page reads makes deeper clustered indexes less efficient for data access, especially for lookups. Additionally, more pages means more space used on disk and more space used in memory when the index is in the data cache
The width of the clustering key does not, however, only affect the clustered index. The clustering key, being the rows’ address, is located in every single nonclustered index. Hence a wide clustering key increases the size of all nonclustered indexes, reducing their efficiency as well.

Unique

The clustered index has to be unique. It’s used as the address for a row. If the clustered index is not defined as unique, SQL makes it unique by adding a hidden 4 byte integer column. This makes the index wider than it needs to be and makes the nonclustered indexes wider than they should be.
This attribute needs to be considered in relation to the others. If making the cluster unique requires adding several wide columns then it may be better to keep the index narrower and leave it as not unique.

Unchanging

The clustering key defines where a row is found, which page it is in. If the value of the clustering key is changed, the row must be moved from the page where it currently is to a new location. This means that the update is essentially a delete-insert combination.

Ever-increasing

An ever-increasing column is one where every value inserted is higher than all values currently in the table. This is important for a clustered index as, if inserts occurred all over the table, there would be a high number of page splits and the index would become rapidly fragmented.
Since the clustered index is the largest index on a table (as it contains the entire data row) and because it tends to be used for scans more than other indexes, the clustered index is the one that is affected the most when fragmented and the one that, generally, is the most expensive to rebuild.
For this reason, an ever-increasing clustering key is usually a good choice, especially for tables that are subject to frequent inserts.

3)

What is a nonclustered index?

A nonclustered index is the second type of index in SQL Server. They have the same b-tree structure as the clustered index (see previous article). Unlike the clustered index nonclustered indexes does not contain the entire data row at the leaf level. Rather, the nonclustered index contains just the columns defined in the index, and a pointer to the actual data row. See figure 1 for a high-level architecture. Assume that the nonclustered index depicted there is based off the clustered index depicted in the previous part.
Nonclustered index image.
Figure 1: Nonclustered index structure
When the underlying table is a heap (has no clustered index) then the pointer to the data row is the RID, an 8 byte structure comprised of the File ID, Page No and Slot index, i.e. the actual location of the row within the data file.
When the underlying table has a clustered index, the pointer to the actual data row is the clustering key. As mentioned in the previous article, this has important considerations for the selection of a clustering key.
There can be multiple nonclustered indexes on a table. In SQL 2005 and earlier, there was a limit of 249 nonclustered index per table. In SQL 2008, with the introduction of filtered indexes, the limit on indexes has been increased to 999.

Include columns

Columns specified as include columns are stored at the leaf level of the nonclustered index, but not at the intermediate or root levels. They are not part of the index key and do not count towards the 16 column/900 byte limit on indexes. Any data type other than the old LOB types (TEXT, NTEXT, IMAGE) are allowed for include columns, although columns defined as varbinary(MAX) Filestream are not permitted as include columns. The ability to specify include columns was added in SQL 2005.
Include columns are useful in that they are columns available in the index but not contributing to the size of the index key hence they make it easier to create covering indexes (see next section) than could be done without them.
Typically columns that appear only in the select clause of a query and not within the WHERE, FROM or GROUP BY are candidates for INCLUDE columns.
If LOB columns are specified as include columns, the entire contents of the LOB columns are duplicated. It’s not just a pointer to the existing LOB pages added to the nonclustered index.
The ability to specify include columns was added in SQL 2005.

Covering indexes

Covering is not a property of an index, it is not possible to say, without additional information that a query is covering or not covering. Instead, covering deals with an index and a query together.
An index is said to be covering for a specific query if that index contains within it all of the columns necessary for the query. The columns may be part of the key or they may be include columns. This means that a query that has a covering index can be evaluated completely from that index, without needing to go to the base table (heap or cluster) to fetch additional columns.
It has been said before that creating a covering index for a query is just about the best way to speed a query up. This is true. It is not always possible or desirable to cover every query. Covering every query may lead to unacceptably large indexes or unacceptably large numbers of indexes with all the attendant problems (as mentioned in part 1)
It is possible (and often desirable) for an index to be covering for more than one query.

Filtered indexes

Filtered indexes are a new feature in SQL 2008 and they allow for an index to contain only some of the rows in the table. Prior to this, an index would always have, at the leaf level, the same number of rows as the table. With filtered indexes, an index can be based on just a subset of the rows.
When creating a filtered index, a predicate is specified as part of the index creation statement. There are several limitations on this predicate. The comparison cannot reference a computed column, a column that is declared as a user-defined type, a spatial column or a hierarchyid column.
A clustered index cannot be filtered as it is the actual table.

How a nonclustered index is used

Seek

For SQL to do a seek on a nonclustered index, the query must have a SARGable1 predicate referencing the index key or a left-based subset of the index key. In addition to this, the index must be either covering or return sufficiently small number of rows that the required lookups to the clustered index/heap (to retrieve the remainder of the columns) is not considered too expensive by the optimiser. If the required lookups are considered too expensive then the index will not be used.
It usually works out that the number of rows where the optimiser decides that key/RID lookups are too expensive is somewhere around 0.5%-1% of the total rows in the table.
(1) - SARGable is a made-up word, constructed from the phrase Search ARGument. It refers to a predicate that is of a form that SQL can use for an index seek. For more details see: http://www.sql-server-performance.com/tips/t_sql_where_p2.aspx

Scan

An index scan is a read of all of the leaf pages in the nonclustered index. A scan of a nonclustered index is generally more efficient than a scan of the clustered index, because nonclustered indexes are generally smaller.
A nonclustered index scan usually indicates that the index contains all the columns required by the query but they are in the wrong order, the query predicates are not SARGable or there are no query predicates.

Update

When a change is made to a column that is either a key column or an include column of a nonclustered index, that index will be modified as part of the insert statement. Indexes are never left out of sync with the underlying table data.
Inserts and deletes will affect all nonclustered indexes on a table.

Considerations for selecting nonclustered indexes

The decision as to what columns should be indexed should be based on the queries that are run against the table. There’s no point in indexing a column that is never used in a query.

Selectivity

In general, a nonclustered index should be selective. That is, the values in the column should be fairly unique and queries that filter on it should return small portions of the table.
The reason for this is that key/RID lookups are expensive operations and if a nonclustered index is to be used to evaluate a query it needs to be covering or sufficiently selective that the costs of the lookups aren’t deemed to be too high.
If SQL considers the index (or the subset of the index keys that the query would be seeking  on) insufficiently selective then it is very likely that the index will be ignored and the query executed as a clustered index (table) scan.
It is important to note that this does not just apply to the leading column. There are scenarios where a very unselective column can be used as the leading column, with the other columns in the index making it selective enough to be used.

Single column vs. Multi-column indexes

In general, wider nonclustered indexes are more useful than single column nonclustered indexes. This is because it is very unusual for SQL to use multiple nonclustered indexes on the same table to evaluate a query. An index defined over more than one column is referred to as a composite index.
Consider a hypothetical table (Table1) that has three indexes on it, one on Col1, one on Col2 and one on Col3. If a query with predicates on all three of those columns is executed against that table, SQL might seek on all three indexes and intersect the results, it might seek on two, intersect the results and do lookups for the column needed for the last filter and apply that filter, it might seek on one, do lookups for the other two columns and then filter them, it might just scan the cluster/heap. All of these ae possible options, but they are not the most optimal option.
If there was a single index defined with all three columns, the matching rows could be located as a single seek operation.
three indexes in a query plan.
Figure 2: Three separate indexes
Execution plan details.
Figure 3: One composite index
For a more details analysis of single vs composite indexes see http://sqlinthewild.co.za/index.php/2010/09/14/one-wide-index-or-multiple-narrow-indexes/

Column order

The order of the columns in an index key should be chosen based on several factors: the selectivity of the columns, what % of queries will filter on that column and whether the queries will filter with equality or inequality matches.
If the query predicate is based on multiple columns and there is an index where the combination of columns forms a left-based subset of the index key, then retrieving the rows is a single seek operation. If the query predicate is based on multiple columns and there is an index that contains those columns as part of the index key, but they do not all form a left-based subset of the index key, then retrieving those rows is a 2-step process; seek on the columns that do form a left-based subset and then filter out rows that don’t match the other conditions.
Let’s look at a quick example.
Imagine we have a hypothetical table named Table1 and that there is an index on that table with the index key consisting of three columns – Col1, Col2 and Col3.
 CREATE INDEX idx_Table1_Demo
      ON Table1 (Col1, Col2, Col3)

Let’s say further that there are two queries against that table.
SELECT Col1 FROM Table1 WHERE Col1 = @Var1 AND Col2 = @Var2
This query can be evaluated by a single seek operation against the two columns. 
Execution plan details
SELECT Col1 FROM Table1 WHERE Col1 = @Var1 AND Col3 = @Var3
This query cannot be evaluated by a single seek operation against the two columns.  Rather it has to be done as a seek just on one column, Col1, and the rows that are returned from that seek get checked to see if they match the second predicate or not.
Execution Plan details
The other thing that has to be considered is the type of predicates that the queries will be using.
It is often said that the most selective column should be made the leading column of an index. This is true, but must be considered in light of the other factors. There is little point in making a column the leading column of an index if only 5% of the queries that use that index filter on that column. It would mean that only those queries could seek on the index and the other 95% would either scan or use a different index (if one existed)

In conclusion

This concludes the introductory coverage of indexes. I hope that this has been of use to people. For further and deeper information there are a few posts on my blog, also see the Inside SQL Server 2005 books, specifically the 1st book (The Storage Engine) and the 4th book (Query tuning and optimisation) and SQL Server 2008 Query Performing Distilled.

No comments:

Post a Comment