Introduction to Indexes
Good indexes are the key to good performance in SQL Server and the key
to creating good indexes is to understand what indexes are and how SQL
Server uses them to evaluate queries.
In this first part of a three part series, I’m going to look at the
very basics of what indexes are, what types exist in SQL and how they’re
used.
What is an index?
An index is a structure within SQL that is used to quickly locate
specific rows within a table. It can be useful to imaging an index at
the back of a textbook when thinking about SQL indexes. They both serve
the same purpose – to find specific information quickly.
General Structure
An index is defined on one or more columns, called key columns. The
key columns (also referred to as the index key) can be likened to the
terms listed in a book index. They are the values that the index will be
used to search for. As with the index found at the back of a text book
(see figure 1), the index is sorted by the key columns.
Figure 1: Book index. Image copyright Simple Talk publishing.
If an index is created with more than one key column, it is known as a composite index.
The general structure of an index is that of a balanced tree (b-tree).
The index will have a single root page, zero or more intermediate
levels and then a leaf level. A page is an 8 kilobyte chunk of the data
file, with a header and footer and is identified by a combination of
File ID and Page number.
Figure 2: Index Structure
|
Note: Commonly the root page is shown at the top of the tree diagram
and the leaf pages at the bottom. Think of it as an inverted tree.
In the leaf level, there’s one entry for each row in the index
1. The entries in the index are ordered logically
2 in the order of the index key.
The non-leaf levels of the index contain one row per page of the level
below, referencing the lowest index key value on each page. If all of
those rows fit onto a single page, then that page is considered the root
and the index is only two levels deep. If all of those rows will not
fit on a single page, then one (or more) intermediate levels are added
to the index.
The number of levels in an index is referred to as the depth of the
index. This is an important consideration for evaluating the efficiency
of the index. The index illustrated in figure 2 has a depth of 3.
(1) With the exception of SQL 2008’s filtered indexes, an index will
have the same number of rows at the leaf level as the table.
(2) I’m using the phrase ‘logically ordered’ because the index does
not necessarily define the physical storage of the rows. The rows are
stored in a way that SQL can retrieve them ordered.
Clustered and nonclustered
There are two main types of indexes in SQL Server, the clustered index and the nonclustered index
Clustered indexes define the logical order of the table. The leaf
level of the clustered index has the actual data pages of the table.
Because of this there can only be one clustered index per table. A table
that does not have a clustered index is referred to as a heap.
Nonclustered indexes are separate from the table. The leaf level of a
nonclustered index has a pointer as part of each index row. That pointer
is either the clustered index key in the cases where the base table has
a clustered index or the RID (Row Identifier) in the cases where the
table is a heap. The RID is an 8-byte structure comprised of File ID,
Page Number and Slot Index and will uniquely identify a row in the
underlying heap. Either way, the each row of a nonclustered index has a
reference to the complete data row.
Index Limits
There are a number of built-in limitations on indexes
Key size
The size of an index key is limited to a maximum of 900 bytes and a
maximum of 16 columns. This is definitely a limit, not a goal, as the
larger the index key gets, the more pages in the index and the deeper
the index tree. As the number of pages and the depth of the tree
increases so the index becomes less efficient to use. Larger indexes
also use more storage space and result in less efficient use of SQL’s
data cache.
Number of indexes
In SQL 2005 and earlier there was a limitation of 250 indexes per
table, one clustered and 249 non-clustered. In SQL 2008, with the
addition of filtered indexes, that limitation was increased to 1000, one
clustered and 999 non-clustered indexes.
Both of these limits are very high and there are few circumstances where a well-designed system should approach that limit.
The reason for this is twofold.
· As the number of indexes increases so the total size occupied
by the table (with all of its indexes) increases. Sure, hard drives are
cheap and storage is abundant but increasing the size of a database has
other effects, Maintenance operations (backups, restores, consistency
checks and index rebuilds) all take longer as the size of a database
increases.
· Indexes have to be kept up to date as data changes and the
more indexes there are on a table, the more places the data has to be
changed. If there are 10 nonclustered indexes on a table, an insert must
be done in 11 places (the table and each of those nonclustered
indexes). On databases that are mostly read-only (decision support, data
warehouses) that may be acceptable. On databases that have frequent
inserts, updates and deletes (OLTP systems), the overhead impose by
multiple indexes may not be acceptable
How SQL uses indexes
If a table does not have index, the only way to find all occurrences
of a value within a table is to read the entire table. If a table has an
index, it speeds up the locating of values within that index in two
ways.
1. The index is sorted in the order of the key columns. This
means that once all the matching values have been found, the remaining
portion of the table can be ignored. This is the same as a telephone
directory, where once all entries with a particular surname have been
found, the rest of the book can be ignored as no further matches are
possible
2. The tree structure of the index allows a divide-and-conquer
approach to locating rows, where large portions of the table can be
quickly excluded from the search. This is illustrated in Figure 3
There are four basic operations that SQL can do on an index. It can
scan the index, it can seek on the index, it can do lookups to the index
and it can update the index
Scans
An index scan is a complete read of all of the leaf pages in the
index. When an index scan is done on the clustered index, it’s a table
scan in all but name.
When an index scan is done by the query processor, it is always a full
read of all of the leaf pages in the index, regardless of whether all
of the rows are returned. It is never a partial scan.
A scan does not only involve reading the leaf levels of the index, the
higher level pages are also read as part of the index scan.
Seeks
An index seek is an operation where SQL uses the b-tree structure to
locate either a specific value or the beginning of a range of value. For
an index seek to be possible, there must be a SARGable
3
predicate specified in the query and a matching (or partially matching)
index. A matching index is one where the query predicate used a
left-based subset of the index columns. This will be examined in much
greater detail in a part 3 of this series.
The seek operation is evaluated starting at the root page. Using the
rows in the root page, the query processor will locate which page in the
next lower level of the index contains the 1
st row that is
being searched for. It will then read that page. If that is the leaf
level of the index, the seek ends there. If it is not the leaf then the
query processor again identifies which page in the next lower level
contains the specified value. This process continues until the leaf
level is reached.
Once the query processor has located the leaf page containing either
the specified key value or the beginning of the specified range of key
values then it reads along the leaf pages until all rows that match the
predicate have been returned. Figure 2 shows how a seek would be done on
an index when searching for the value 4.
If the index contains all the columns that the query needs, then the
index is said to be covering for that query. If the index does not
contain all the columns then SQL will do a lookup to the base table to
fetch the other columns in order to process the query.
(3) SARGable is a made-up word, constructed from the phrase Search
ARGument. It refers to a predicate that is of a form that SQL can use
for an index seek. For more details see:
http://www.sql-server-performance.com/tips/t_sql_where_p2.aspx
Lookups
Lookups occur when SQL uses an index to locate rows affected by a
query but that index does not contain all the columns required to
satisfy the query, that is the index is not covering for that query. To
fetch the remaining columns, SQL does a lookup to either the clustered
index or heap.
A lookup to a clustered index is essentially a single row clustered
index seek and it is always a single row seek. So if a lookup is needed
for 500 rows, that involves 500 individual clustered index seeks.
Updates
Anytime that a row is changed, those changes must be made not only in
the base table (clustered index or heap) but also in any index that
contains the columns that were affected by the change. This applies to
insert, update and delete operations.
Considerations for creating indexes
I’ll be going into more detail on considerations for indexes in the next two parts, but in general:
-
Clustered index should be narrow, because the clustering key is part of all nonclustered indexes.
-
Composite nonclustered indexes are generally more useful than single
column indexes, unless all queries against the table filter on one
column at a time.
-
Indexes should be no wider than they have to be. Too many columns
wastes space and increases the amount of places that data must be
changed when an insert/update/delete occurs.
-
If an index is unique, specify that it is unique. The optimiser can
sometimes use that information to generate more optimal execution plans.
-
Be careful of creating lots of indexes on frequently modified tables as it can slow down data modifications.
In part 2 of this series I’ll be looking in more detail into clustered
indexes. What they are, how they differ from nonclustered indexes and
what the considerations are for creating a clustered index.
Good indexes are the key to good performance in SQL Server and the key
to creating good indexes is to understand what indexes are and how SQL
Server uses them to evaluate queries.
In the previous article I looked at the very basics of what indexes
are, what types exist in SQL and how they’re used. In this article I’m
going to take a closer look at clustered indexes, what makes them
different from nonclustered indexes, how SQL uses clustered indexes and
some of the recommendations for selecting the clustered index key.
What is a clustered index?
A clustered index is an index where the leaf level of the index
contains the actual data rows of the table. Like any other index, a
clustered index is defined on one or more columns – the index key. The
key columns for a clustered index are often referred to as the
clustering key.
The clustered index can be looked at as a balanced tree (b-tree)
structure built on top of a table, with the rows in the table stored
logically in the order of the clustering key (see figure 1). The
clustered index essentially is the table and hence there cannot be more
than one on a table.
Figure 1: Architecture of a clustered index
The clustering key is, in some ways, the ‘address’ of a data row. It
can be used, through the tree structure of the clustered index, to
locate a row of data. Because it is used to locate a data row, the
clustering key must be unique. If, when the clustered index is created
it is defined as unique (either by specifying the UNIQUE keyword as part
of the create index statement or by defining the clustered index as
part of a primary key or unique constraint), then everything is fine.
What if the clustered index is not defined as unique? In that case, SQL
automatically adds another column to the index, a 4-byte integer column
that’s called the Uniquifier. This column is hidden, it is not displayed
in the table design and it cannot be queried directly.
Difference between a clustered index and a heap
A table that has no clustered index is referred to as a heap. Whereas a
clustered index has the data stored logically in the order of the index
key, a heap has no ordering of rows or pages.
When a row is added to a table with a clustered index, that row has a
defined page that it must go on to, according to the value of the
clustered index key. Using the index depicted in Figure 1 as an example,
if a new row is added with a value of 4 for the clustered index key,
that row must go onto the second leaf page. If that page is full, a new
page gets allocated, joined into the index and some of the rows from the
full page are moved to the new one. This is called a page split, it can
be an expensive operation and it leads to fragmentation.
When a row is added to a table without a clustered index (a heap)
there’s no specific place that the row has to go. A heap has no defined
order. The row will be added wherever there’s a space large enough to
put the row.
How a clustered index is used
There are three ways that a clustered index can be used by a query to locate – seek, scan and lookup.
Seek
For the clustered index to be used in a seek operation, the query must contain a SARGable
1 predicate based on the clustering key or part of the clustering key.
A simple example of a clustered index seek is as follows (based on the AdventureWorks database)
SELECT ProductID, ProductNumber, Name
FROM Production.Product
WHERE ProductID = 380
ProductID is the clustering key for the table Production.Product and the filter on ProductID is SARGable.
(1) SARGable is a made-up word, constructed from the phrase Search
ARGument. It refers to a predicate that is of a form that SQL can use
for an index seek. For more details see:
http://www.sql-server-performance.com/tips/t_sql_where_p2.aspx
Scan
A scan of the clustered index is equivalent to a table scan. It’s a read of all of the data pages in the table
Because the clustered index is the table it can be seen as the default
access path for queries to get data to retrieve rows. If a query does
not have a SARGable predicate
1 based on the clustering key
and there are no non-clustered indexes which are appropriate for a query
(and what makes a non-clustered index appropriate will be discussed in
detail in part 3), then a scan of the clustered index will be done to
retrieve the data for the query. This is a full scan of all the data
pages, i.e. a table scan.
In this example a clustered index scan is done because there are no indexes on either of the columns in the where clause.
SELECT ProductID, ProductNumber, Name
FROM Production.Product
WHEREColor = 'Blue' AND DaysToManufacture BETWEEN 2 AND 4
SARGable is a made-up word, constructed from the phrase Search
ARGument. It refers to a predicate that is of a form that SQL can use
for an index seek. For more details see:
http://www.sql-server-performance.com/tips/t_sql_where_p2.aspx
Lookups
A lookup occurs when a nonclustered index was used to locate rows for a
query, but the nonclustered index did not contain all of the columns
required for the query. To fetch the remaining columns, a lookup is done
to the clustered index.
A lookup is equivalent to a single row clustered index seek. Lookups
are always done one row at a time. For this reason, they are very
expensive operations, especially when lots of rows are involved.
SELECT ProductID, ProductNumber, Name
FROM production.Product
WHERE ProductNumber = 'HN-1224'
Considerations for selecting the clustering key
There are two main schools of thought in what makes a good clustering.
One school says to put the clustered index on the column or set of
columns that would be most useful for queries, either frequently run
queries or ones doing queries of large ranges of data. The other school
says to use the clustered index primarily to organise the table and
leave data access to the nonclustered indexes.
I hold to the second school, so the rest of this article will be written from that perspective.
There are four main attributes that are desirable for a clustering key. The clustering key should be:
-
Narrow
-
Unique
-
Unchanging
-
Ever increasing
Narrow
The width of the clustering key affects the depth of the clustered
index and hence it’s efficiency. A clustered index with a very deep
b-tree requires queries to read more intermediate pages to locate rows.
This increased page reads makes deeper clustered indexes less efficient
for data access, especially for lookups. Additionally, more pages means
more space used on disk and more space used in memory when the index is
in the data cache
The width of the clustering key does not, however, only affect the
clustered index. The clustering key, being the rows’ address, is located
in every single nonclustered index. Hence a wide clustering key
increases the size of all nonclustered indexes, reducing their
efficiency as well.
Unique
The clustered index has to be unique. It’s used as the address for a
row. If the clustered index is not defined as unique, SQL makes it
unique by adding a hidden 4 byte integer column. This makes the index
wider than it needs to be and makes the nonclustered indexes wider than
they should be.
This attribute needs to be considered in relation to the others. If
making the cluster unique requires adding several wide columns then it
may be better to keep the index narrower and leave it as not unique.
Unchanging
The clustering key defines where a row is found, which page it is in.
If the value of the clustering key is changed, the row must be moved
from the page where it currently is to a new location. This means that
the update is essentially a delete-insert combination.
Ever-increasing
An ever-increasing column is one where every value inserted is higher
than all values currently in the table. This is important for a
clustered index as, if inserts occurred all over the table, there would
be a high number of page splits and the index would become rapidly
fragmented.
Since the clustered index is the largest index on a table (as it
contains the entire data row) and because it tends to be used for scans
more than other indexes, the clustered index is the one that is affected
the most when fragmented and the one that, generally, is the most
expensive to rebuild.
For this reason, an ever-increasing clustering key is usually a good
choice, especially for tables that are subject to frequent inserts.
3)
What is a nonclustered index?
A nonclustered index is the second type of index in SQL Server. They
have the same b-tree structure as the clustered index (see previous
article). Unlike the clustered index nonclustered indexes does not
contain the entire data row at the leaf level. Rather, the nonclustered
index contains just the columns defined in the index, and a pointer to
the actual data row. See figure 1 for a high-level architecture. Assume
that the nonclustered index depicted there is based off the clustered
index depicted in the previous part.
Figure 1: Nonclustered index structure
When the underlying table is a heap (has no clustered index) then the
pointer to the data row is the RID, an 8 byte structure comprised of the
File ID, Page No and Slot index, i.e. the actual location of the row
within the data file.
When the underlying table has a clustered index, the pointer to the
actual data row is the clustering key. As mentioned in the previous
article, this has important considerations for the selection of a
clustering key.
There can be multiple nonclustered indexes on a table. In SQL 2005 and
earlier, there was a limit of 249 nonclustered index per table. In SQL
2008, with the introduction of filtered indexes, the limit on indexes
has been increased to 999.
Include columns
Columns specified as include columns are stored at the leaf level of
the nonclustered index, but not at the intermediate or root levels. They
are not part of the index key and do not count towards the 16
column/900 byte limit on indexes. Any data type other than the old LOB
types (TEXT, NTEXT, IMAGE) are allowed for include columns, although
columns defined as varbinary(MAX) Filestream are not permitted as
include columns. The ability to specify include columns was added in SQL
2005.
Include columns are useful in that they are columns available in the
index but not contributing to the size of the index key hence they make
it easier to create covering indexes (see next section) than could be
done without them.
Typically columns that appear only in the select clause of a query and
not within the WHERE, FROM or GROUP BY are candidates for INCLUDE
columns.
If LOB columns are specified as include columns, the entire contents
of the LOB columns are duplicated. It’s not just a pointer to the
existing LOB pages added to the nonclustered index.
The ability to specify include columns was added in SQL 2005.
Covering indexes
Covering is not a property of an index, it is not possible to say,
without additional information that a query is covering or not covering.
Instead, covering deals with an index and a query together.
An index is said to be covering for a specific query if that index
contains within it all of the columns necessary for the query. The
columns may be part of the key or they may be include columns. This
means that a query that has a covering index can be evaluated completely
from that index, without needing to go to the base table (heap or
cluster) to fetch additional columns.
It has been said before that creating a covering index for a query is
just about the best way to speed a query up. This is true. It is not
always possible or desirable to cover every query. Covering every query
may lead to unacceptably large indexes or unacceptably large numbers of
indexes with all the attendant problems (as mentioned in part 1)
It is possible (and often desirable) for an index to be covering for more than one query.
Filtered indexes
Filtered indexes are a new feature in SQL 2008 and they allow for an
index to contain only some of the rows in the table. Prior to this, an
index would always have, at the leaf level, the same number of rows as
the table. With filtered indexes, an index can be based on just a subset
of the rows.
When creating a filtered index, a predicate is specified as part of
the index creation statement. There are several limitations on this
predicate. The comparison cannot reference a computed column, a column
that is declared as a user-defined type, a spatial column or a
hierarchyid column.
A clustered index cannot be filtered as it is the actual table.
How a nonclustered index is used
Seek
For SQL to do a seek on a nonclustered index, the query must have a SARGable
1
predicate referencing the index key or a left-based subset of the index
key. In addition to this, the index must be either covering or return
sufficiently small number of rows that the required lookups to the
clustered index/heap (to retrieve the remainder of the columns) is not
considered too expensive by the optimiser. If the required lookups are
considered too expensive then the index will not be used.
It usually works out that the number of rows where the optimiser
decides that key/RID lookups are too expensive is somewhere around
0.5%-1% of the total rows in the table.
(1) - SARGable is a made-up word, constructed from the phrase Search
ARGument. It refers to a predicate that is of a form that SQL can use
for an index seek. For more details see:
http://www.sql-server-performance.com/tips/t_sql_where_p2.aspx
Scan
An index scan is a read of all of the leaf pages in the nonclustered
index. A scan of a nonclustered index is generally more efficient than a
scan of the clustered index, because nonclustered indexes are generally
smaller.
A nonclustered index scan usually indicates that the index contains
all the columns required by the query but they are in the wrong order,
the query predicates are not SARGable or there are no query predicates.
Update
When a change is made to a column that is either a key column or an
include column of a nonclustered index, that index will be modified as
part of the insert statement. Indexes are never left out of sync with
the underlying table data.
Inserts and deletes will affect all nonclustered indexes on a table.
Considerations for selecting nonclustered indexes
The decision as to what columns should be indexed should be based on
the queries that are run against the table. There’s no point in indexing
a column that is never used in a query.
Selectivity
In general, a nonclustered index should be selective. That is, the
values in the column should be fairly unique and queries that filter on
it should return small portions of the table.
The reason for this is that key/RID lookups are expensive operations
and if a nonclustered index is to be used to evaluate a query it needs
to be covering or sufficiently selective that the costs of the lookups
aren’t deemed to be too high.
If SQL considers the index (or the subset of the index keys that the
query would be seeking on) insufficiently selective then it is very
likely that the index will be ignored and the query executed as a
clustered index (table) scan.
It is important to note that this does not just apply to the leading
column. There are scenarios where a very unselective column can be used
as the leading column, with the other columns in the index making it
selective enough to be used.
Single column vs. Multi-column indexes
In general, wider nonclustered indexes are more useful than single
column nonclustered indexes. This is because it is very unusual for SQL
to use multiple nonclustered indexes on the same table to evaluate a
query. An index defined over more than one column is referred to as a
composite index.
Consider a hypothetical table (Table1) that has three indexes on it,
one on Col1, one on Col2 and one on Col3. If a query with predicates on
all three of those columns is executed against that table, SQL might
seek on all three indexes and intersect the results, it might seek on
two, intersect the results and do lookups for the column needed for the
last filter and apply that filter, it might seek on one, do lookups for
the other two columns and then filter them, it might just scan the
cluster/heap. All of these ae possible options, but they are not the
most optimal option.
If there was a single index defined with all three columns, the matching rows could be located as a single seek operation.
Figure 2: Three separate indexes
Figure 3: One composite index
For a more details analysis of single vs composite indexes see
http://sqlinthewild.co.za/index.php/2010/09/14/one-wide-index-or-multiple-narrow-indexes/
Column order
The order of the columns in an index key should be chosen based on
several factors: the selectivity of the columns, what % of queries will
filter on that column and whether the queries will filter with equality
or inequality matches.
If the query predicate is based on multiple columns and there is an
index where the combination of columns forms a left-based subset of the
index key, then retrieving the rows is a single seek operation. If the
query predicate is based on multiple columns and there is an index that
contains those columns as part of the index key, but they do not all
form a left-based subset of the index key, then retrieving those rows is
a 2-step process; seek on the columns that do form a left-based subset
and then filter out rows that don’t match the other conditions.
Let’s look at a quick example.
Imagine we have a hypothetical table named Table1 and that there is an
index on that table with the index key consisting of three columns –
Col1, Col2 and Col3.
CREATE INDEX idx_Table1_Demo
ON Table1 (Col1, Col2, Col3)
Let’s say further that there are two queries against that table.
SELECT Col1 FROM Table1 WHERE Col1 = @Var1 AND Col2 = @Var2
This query can be evaluated by a single seek operation against the two columns.
SELECT Col1 FROM Table1 WHERE Col1 = @Var1 AND Col3 = @Var3
This query cannot be evaluated by a single seek operation against the
two columns. Rather it has to be done as a seek just on one column,
Col1, and the rows that are returned from that seek get checked to see
if they match the second predicate or not.
The other thing that has to be considered is the type of predicates that the queries will be using.
It is often said that the most selective column should be made the
leading column of an index. This is true, but must be considered in
light of the other factors. There is little point in making a column the
leading column of an index if only 5% of the queries that use that
index filter on that column. It would mean that only those queries could
seek on the index and the other 95% would either scan or use a
different index (if one existed)
In conclusion
This concludes the introductory coverage of indexes. I hope that this
has been of use to people. For further and deeper information there are a
few posts on my
blog, also see the Inside SQL Server 2005 books, specifically the 1
st book (The Storage Engine) and the 4
th book (Query tuning and optimisation) and SQL Server 2008 Query Performing Distilled.