The steps of query optimization are explained below.

a) Cast into some Internal Representation – This step involves representing each SQL query into some internal representation which is more suitable for machine manipulation. The internal form typically chosen is a query tree as shown below.

Query Tree for the SELECT statement discussed above:

b)Convert to Canonical Form – In this second step, the optimizer makes use of some transformation laws or rules for sequencing the internal operations involved. Some examples are given below.
(Note: In all these examples the second form will be more efficient irrespective of the actual data values and physical access paths that exist in the stored database. )

Rule 1:

(A JOIN B) WHERE restriction_A AND restriction_B

(A WHERE restriction_A) JOIN (B WHERE restriction_B)
Restrictions when applied first, cause eliminations and hence better performance.

Rule 2:

(A WHERE restriction_1) WHERE restriction_2

A WHERE restriction_1 AND restriction_2
Two restrictions applied as a single compound one instead applying the two individual restrictions separately.

Rule 3:

(A[projection_1])[projection_2]

A[projection_2]

If there is a sequence of successive projections applied on the same relation, all but the last one can be ignored. i.e., The entire operation is equivalent to applying the last projection alone.

Rule 4:

(A[projection]) WHERE restriction

(A WHERE restriction)[projection]
Restrictions when applied first, cause eliminations and hence better performance.

Reference [1] gives more such general transformation laws.
c)Choose Candidate Low-level Procedures – In this step, the optimizer decides how to execute the transformed query. At this stage factors such as existence of indexes or other access paths, physical clustering of records, distribution of data values etc. are considered.

The basic strategy here is to consider the query expression as a set of low-level implementation procedures predefined for each operation. For eg., there will be a set of procedures for implementing the restriction operation: one (say, procedure 'a') for the case where the restriction attribute is indexed, one (say, procedure 'b') where the restriction attribute is hashed and so on.

Each such procedure has and associated cost measure indicating the cost, typically in terms of disk I/Os.

The optimizer chooses one or more candidate procedures for each low-level operations in the query. The information about the current state of the database (existence of indexes, current cardinalities etc.) which is available from the system catalog will be used to make this choice of candidate procedures.

d)Generate Query Plans and Choose the Cheapest – In this last step, query plans are generated by combining a set of candidate implementation procedures. This can be explained with the following example(A trivial one but illustrative enough).

Assume that there is a query expression comprising a restriction, a join and a projection. Some examples, of implementation procedures available for each of these operations can be assumed as given in the table below.

Operation
Condition Existing
Implementation Procedure
Restriction Restriction attribute is indexed
a
Restriction Restriction attribute is hashed
b
Restriction Restriction attribute is neither indexed nor hashed
c
Join
d
Join
e
Projection
f
Projection
g

Now the various query plans for the original query expression can be generated by making permutations of implementation procedures available for different operations. Thus the query plans can be

– adf
- adg
– aef
– aeg
– bdf
...
...

It has to be noted that in reality, the number of such query plans possible can be too many and hence generating all such plans and then choosing the cheapest will be expensive by itself. Hence a heuristic reduction of search space rather than exhaustive search needs to be done. Considering the above example, one such heuristic method can be as follows:

If the system knows that the restriction attribute is neither indexed nor hashed, then the query plans involving implementation procedure 'c ' alone (and not 'a' and 'b') need to be considered and the cheapest plan can be chosen from the reduced set of query plans.