Introduction
------------
This paper discusses the query optimization process using the cost
based optimization (CBO). The working details of the cost optimizer are explained.
It covers concepts such as cardinality, selectivity, join orders, etc., and also
how these are implemented to compute the costs used in evaluating the execution
plans.
Query Optimization
------------------
A query is a non-procedural request of information from the database. To
process a query, the kernel has to characterize the retrieval strategy or
formulate an execution plan for fetching the candidate rows. Typically,
to execute a query, several execution plans may be feasible. For example,
tables participating in a join could be processed in a number of different
orders, depending on the join conditions and the join methods implemented in
the kernel. To choose between alternative plans, the kernel must use some
realistic unit to measure the resources used by each plan. It can then choose
between competing plans on the basis of costs and discard all except the one
which consumes the least.
Rule Based (Heuristic) Optimization
-----------------------------------
Prior to the introduction of the cost based optimizer, Oracle used the
Heuristic Approach to query optimization. Execution plans were generated
on the basis of certain heuristics or "Rules". For example, if the
"where clause" of a query contained a predicate referencing an indexed
column, and the index was the best index to use based on predetermined
ranking of indexes, then the table would be accessed using that index
irrespective of other considerations such as the size of the table,
variability of the data for that column, selectivity of the index etc.
There was no knowledge on the statistical descriptions of the data in tables
and indexes. Instance specific parameters such as the capability of multi
block i/o, available sort memory etc were not considered. If data in the tables
was inconsistent with the "Rules", then sub-optimal plans would be generated.
Cost Based Optimization
-----------------------
Oracle addresses this by incorporating a Cost Engine in the kernel to
estimate and select execution plans on the basis of costs. Costs quantify the
resource consumption of the query. Resources used by a query can be broken into
three principal parts - I/O cost, CPU Costs and Network Costs. I/O cost is the
cost of transferring information from disk to main memory. Rows in tables are
stored in database blocks which reside in disk files. Accessing the rows
involves reading the blocks into the buffer pool in SGA (Shared Global Area)
which resides in the main memory. Typically this is the most significant cost
of processing a query. The CPU cost is the cost of processing the information
once its available in memory. Once the blocks are cached, the rows have to be
identified and processed to perform operations such as join, sort etc.
For queries which access information across nodes of a distributed database,
there is a Network Cost which measures the cost of information transfer
(network I/O). Queries which access remote tables or perform distributed
joins would have significant element of this cost. Currently, the cost model
incorporates functions to cost I/O operations and to some extent CPU costs.
Network Costs are not directly computed but taken as certain weight of
the I/O's estimated. As the cost optimizer model is enhanced, these
elements of cost will be measured better and costed more accurately.
IMPORTANT:
To enable costing of execution plans, detailed statistical descriptions of
the data relating to objects in the query is required. The statistics are
generated by the ANALYZE facility. There are two modes in which analyze may
be performed - compute or estimate. Compute scans each member of the object
whereas estimate mode would look at a sample of the total. If there are
no statistics, then the cost optimizer uses hard coded estimates or "guesses".
Optimization Process
--------------------
The optimization process starts with performing transitivity analysis on the
predicates of the query. Then, each query block is optimized by determining
and evaluating access paths open to table in the query. Joins involve join
order enumeration and costing. To influence the optimizer, users can specify
"Hints" to force the optimizer to consider a certain access strategy superior
to others irrespective of the costs. Each of this steps is considered in detail
below.
Transitivity
------------
Consider a query with a join involving tables T1 and T2:
SELECT ... FROM t1,t2 WHERE
t1.column1=t2.column1 AND
t1.column1=100;
This query is equivalent to the query with the additional predicate on T2:
SELECT ... FROM t1,t2 WHERE
t1.column1=T2.column1 AND
t1.column1=100 AND
t2.column1=100;
The advantage we have in optimization is that the additional predicate opens
an extra access path for T2.column1 which did not exist in the original query.
Transitivity analysis involves generating additional predicates based on
existing predicates to open extra access paths to objects. This however does
not mean that the new path will be chosen. The cost optimizer will generate
transitive predicates as a first step in optimization.
Cardinality
-----------
The cost of I/O is directly proportional to the number of rows to be
retrieved and manipulated by a query. A query may specify retrieval of all the
rows of a table or may contain restrictions based on some criteria specified in
the "Where" clause. Optionally it may demand the results to be aggregated based
on "Group By" clause or sorted on some expression or column values using a
"Order By" clause. The number of rows expected to be returned by the query
constitutes the cardinality of the query. The Cost Engine, to be able to cost
queries, must have some mechanism to compute the cardinality of the query.
Selectivity Analysis
--------------------
Cardinality determination requires the use of selectivity factors.
Selectivity factors are estimates about the number of rows that will satisfy a
certain criteria on column or attribute values. For example, we wish to
retrieve all rows that satisfy the column Sex with value Female. A selectivity
of 1/2 would indicate that half of the rows possess this value. Typically,
selectivity of predicate of form "column=constant value" would be:
selectivity( Of Predicate P<n>) = 1/NDV
(NDV = No. Of distinct Values of the column.
(Predicate - a single Where clause condition)
Selectivity estimate is thus a value ranging from 0 to 1 with one indicating
poorest selectivity. Predicates come in different forms and many other
relational operators (e.g <,> etc), column expressions, joins etc. Using the
statistical descriptors such as high/low values, number of distinct values,
etc. and by resolving certain complex forms to simpler types the cost optimizer
calculates the selectivity of each predicate. Note that so far we have dealt
with selectivity associated with predicates. A query typically has multiple
predicates that are joined by boolean operators (AND,OR,NOT). Selectivity of
individual predicates are combined in the following way to determine the
overall selectivity which will be used to determine the cardinality of the
query.
Logical Operator Resultant Selectivity
---------------- ---------------------
AND S1 * S2
OR S1 + S2 -( S1 * S2 )
NOT 1 - S1
S1=Selectivity of Predicate P1
S2=Selectivity of Predicate P2
Retrieval Methods
-----------------
Once the cardinality of the query is known, the next thing to consider are the
costs associated with each available retrieval method. Currently, Oracle has
two possible methods to retrieve rows from a table:
1. Table can be scanned sequentially starting from the first data block
of first extent to the last block allocated. This is called the
"Full Table Scan".
2. Table can accessed block at a time by first retrieving the address of
the row by performing a lookup on one or more indexes. This is called
"Index Scan".
Each method involves I/O usage and no assumptions are made on availability of
blocks in cache. The I/Os involved in each case are detailed using the
statistical descriptions of the objects accessed and certain instance
initialization parameters e.g. multi-block read factor, etc.
A table may have several access indexes on it. Given the query, it may be
possible to use one or more of the indexes. The cost engine considers all
the possible access paths and computes the cost of each path.
Full Scan Cost
--------------
Full table scan cost is dependent on the maximum number of blocks ever used
by the table and the multi-block read factor. As more rows are inserted into
the table, blocks are acquired from the extents assigned to the table. Even if
some rows are later deleted from the table and may cause blocks to be empty,
the kernel has to still read all blocks ever used by the table. Multi-Block
read factor is the number of adjacent database blocks that can be read in
a single I/O operation.
Index Scan Cost
---------------
The cost of using an indexed access can be split into two components :
1. Cost of traversing the index (Oracle indexes are B*tree structures
made up of one or more levels of branch blocks and leaf blocks) and
2. Cost of performing table lookup. In certain cases, this may not
be needed if the optimizer determines that index lookup is sufficient
to satisfy the query requirements.
The following statistical descriptors are available about indexes :
1. Bl - Number of levels in the index including the leaf level
2. Lb - Number of leaf blocks
3. Bk - the average number of index leaf blocks per index key value
4. Dk - the average number of data blocks in the table that are pointed
to by a key value.
5. Cf - Clustering Factor
This value indicates the ordering of the rows in the table blocks
based on the values of the index. If this value is closer to the number of
blocks in the table, then the table rows are ordered by the values of this
index. This means that row addresses in the index leaf block are most
likely to point to rows in the same table data block. If this value is
closer to the number of rows in the table, then the table rows are
randomly distributed in relation to the values of this index. This means
that row addresses in the index leaf block are most likely to point to
rows in the different table data blocks.
Using one or more of these descriptors, the index traversal cost is
calculated. For example, consider a query which has predicates of the form
"column=<constant>" for all columns of a unique index -
Cost of Index Access = Number of Levels + 1
= Bl + 1
The cost, in other words, is the cost of accessing each block as
we traverse down in the index plus the cost of accessing the table block.
The Cost Engine uses several functions to evaluate the costs for
complicated cases. Predicates may use non equality operators such as
"column > value" or LIKE operator or may specify only some columns in a
concatenated index. Using selectivity estimates and table and index
statistics, the cost engine computes the number of distinct blocks that
will be touched by the predicates and thus derives the costs.
Sort Costs
----------
Certain SQL operations require a sort of the rows to be performed. This may be
explicit in the SQL statement e.g. when an Order By, Group By clause is used or
be implicit e.g when a sort merge join is done to join two tables. No matter
what triggers the sort, to perform the sort, the kernel uses up the system
resources which need to be costed. Depending on how much data is to be sorted,
the kernel may be able to complete a sort in memory. This depends on the
setting of the initialization parameter "sort_area_size". If the data to be
sorted does not fit in the sort_area_size buffer, the kernel splits the sort
into manageable segments and does smaller sorts writing the results to disk in
areas called as the temporary segments.
When all segments are sorted, we do a final merge pass. In estimating sort
I/O costs, we first determine the size of the data based on the average row
lengths. We then ascertain if this can fit in the sort_area_size and if it
can then no I/Os are involved. Otherwise, the I/Os are estimated based on the
writes/reads from temporary segments. Multi-block I/O capability is considered
in the calculations. The calculations also take into account the work done
in the final merge pass.
Join Processing
---------------
In a query involving more than one table, join processing is involved to
determine the most optimal plan to perform the join. This comprises:
1. Join cardinality
2. Enumerating join orders
3. Evaluating the costs associated with each join path under each
join method.
Join Cardinality
----------------
Join cardinality is the determination of the number of rows that will make up
the joined relation. In the worst case, we have no join predicates and the
cardinality is the cartesian product of the two or more tables involved. But
typically when two or more tables participate in a join, the join is based on
values of some columns. Using the statistical descriptions on the columns and
the number of rows in each table, we calculate the cardinality of the joined
form.
Join Orders
-----------
A "Join Order" is a particular permutation of ordering the access to tables
participating in the join to complete the joined relation. Join orders depend
on the type of join or the join topology. Depending on the structuring of the
join predicates, joins can be classified into various types viz chain, star,
cycle, complete graph, etc. For example, consider the query with a three table
join -
SELECT ... FROM t1,t2,t3 WHERE
t1.col1 = t2.col1 AND
t2.col1 = t3.col1 ;
Tables t1, t2 and t3 are joined in a chain order.
Tables t1,t2,t3 can be joined in n! i.e n factorial way .
where n is number of tables .
So in This case it will be 3*2*1=6
Possible join orders:
t1->t2->t3, t2->t1->t3, t2->t3->t1, t3->t2->t1
t1->t3->t2, t3->t1->t2
Note the Join orders such as t1->t3->t2, t3->t1->t2 are evaluated using
Cartesian product as there is no join predicate specified between t1 and t3.
Each of the join orders are possible and need to be evaluated for resource
usage. As number of tables increase, depending on the join topology, the
search space or the possible permutations go up steeply. Several techniques
are used to prune the search space and reduce the work involved in identifying
the best join order.
Evaluating Join Path Costs
--------------------------
For each join order, the optimizer evaluates the cost of performing the join
by breaking the order into pairs. The first part is made of the relations
already joined and the second part of the next table to be joined. Costs are
considered using both join methods currently implemented in the kernel, the
sort-merge join method and the nested-loops join method.
In the sort-merge method, the tables participating in the join are first
sorted by the join columns. The sorted relations are then joined. The cost of
performing join by this method includes the sorting costs and the cost of
retrieving sorted records to perform the join.
In the nested-loops method, we scan the outer table row one at a time and for
each row retrieved, we access the inner table based on the join columns. The
outer table will actually be a joined relation where more than two tables are
joined. The cost of performing the join in this method includes the cost of
accessing the outer table and for each row retrieved and the cost of
fetching the group of rows from the inner table based on the join columns.
For each table in a particular join order, there is a large number of
possible access paths, The table can be accessed using a ROWID or could be
accessed by doing a table scan. Alternatively, it may be accessed using a
single index or a cluster scan or a merge of several single column non unique
indexes.
Several rules are used to narrow down or prune the search space to speed up
access path selection. For example, if there is a lookup by ROWID, it is
considered the best and we don't bother to evaluate other access paths open to
the table.
Selecting the Execution Plan
----------------------------
As the optimizer evaluates the cost of each plan it compares it with the best
cost plan seen so far. It keeps doing that until it gets to the best from the
range of plans it considers. If the best plan selected is not the "ideal"
plan based on the users knowledge of the data, the user can force a different
plan by using "Hints".
Influencing the Cost Optimizer - Hints
--------------------------------------
The rule based Optimizer can be influenced in many ways. The query could
be rearranged with the tables ordered differently to force a certain join
order, functions could be put around indexed columns to avoid using the index,
etc. These techniques were developed by users over a period of time and were
valuable in achieving desired performance. However, this is hard to document
and difficult to generalize. To avoid these difficulties, the cost optimizer
incorporates a mechanism for users to directly influence the behavior of the
optimizer in cases where the optimizer does not produce the best result. Users
can specify directives which include access methods, join processing techniques,
join orders, etc. through hints.
Hints are specified in comments to SQL statements to keep them portable
across platforms. Hints which are invalid e.g referencing an index
on a column not referenced in the predicates, will be ignored. Also, some hints
may be sometimes lost e.g. when view optimization is performed, the query on
the view is merged into the underlying tables and the hints specified with the
view may not be used.
The cost optimizer by default tries to optimize a query to reduce the overall
cost of retrieving all the rows to be fetched by a query. This can be changed
in some cases where the user, running an interactive application like
SQL*Forms, might want to optimize for the first row. This can be done by
setting session level goal for the optimizer using the "alter session set
optimizer_goal" statement. What the cost optimizer does is to re-evaluate the
cost by dividing the cost of all rows by the cardinality of the query. This is
done to make certain operations such as indexed scan or nested loops superior
to other access methods.
For more information on hints see: Note.29236.1 QREF: SQL Statement HINTS
Note: The optimizer_mode values RULE and CHOOSE are no longer used as of 10gR1 and higher. We now automatically calculate statistics to calculate the best possible plan using the Cost Based Optimizer:
Document 1270930.1 Upgrade From 9i OPTIMIZER_MODE RULE Or CHOOSE No Longer Used In 10g And Higher: Use ALL_ROWS Or FIRST_ROWS _n
Comments
Post a Comment