All About
Oracle Latches
What are latches
Latches vs Enqueues
How Latches work
Causes of Contention for
specific latches
Reducing Contention
for Internal Latches
Measuring Latch Contention
Tuning Applications to
avoid Latch contention
Identifying Hot Blocks
What Are
Latches?
Latches are serialization mechanisms that protect areas of Oracle’s
shared memory (the SGA). In simple terms, latches prevent two processes
from simultaneously updating — and possibly corrupting — the same area
of the SGA. A latch is a type of a lock that can be very quickly
acquired and freed. Latches are typically used to prevent more than one
process from executing the same piece of code at a
given time.
Oracle sessions need to update or read from the SGA for almost all
database operations. For instance:
• When a session reads a block from disk, it must modify a free block
in the buffer cache and adjust the buffer cache LRU chain
• When a session reads a block from the SGA, it will modify the LRU
chain.
• When a new SQL statement is parsed, it will be added to the library
cache within the SGA.
• As modifications are made to blocks, entries are placed in the redo
buffer.
• The database writer periodically writes buffers from the cache to
disk (and must update their status from “dirty” to “clean”).
• The redo log writer writes entries from the redo buffer to the redo
logs.
Latches prevent any of these operations from colliding and possibly
corrupting the SGA.
Latches vs Enqueues
Enqueues are another type of locking mechanism used in Oracle. An
enqueue is a more sophisticated mechanism which permits several
concurrent processes to have varying degree of sharing of "known"
resources. Any object which can be concurrently used, can be protected
with enqueues. A good example is of locks on tables. We allow varying
levels of sharing on tables e.g. two processes can lock a table in
share mode or in share update mode etc. One difference is that the
enqueue is obtained using an OS specific locking mechanism. An enqueue
allows the user to store a value in the lock, i.e the mode in which we
are requesting it. The OS lock manager keeps track of the resources
locked. If a process cannot be granted the lock because it is
incompatible with the mode requested and the lock is requested with
wait, the OS puts the requesting process on a wait queue which is
serviced in FIFO. Another difference between latches and enqueues is
that in latches there is no ordered queue of waiters like in enqueues.
Latch waiters may either use timers to wakeup and retry or spin (only
in multiprocessors). Since all waiters are concurrently retrying
(depending on the scheduler), anyone might get the latch and
conceivably the first one to try might be the last one to get
How
Latches Work
Because the duration of operations against memory is very small
(typically in the order of nanoseconds) and the frequency of latch
requests very high, the latching mechanism needs to be very
lightweight.
If the latch is already in use, Oracle can assume that it will not be
in use for long, so rather than go into a passive wait (e.g.,
relinquish the CPU and go to sleep) Oracle will retry the operation a
number of times before giving up and going to passive wait. This
algorithm is called acquiring a spinlock and the number of “spins”
before sleeping is controlled by the Oracle initialization parameter
“_spin_count”.
The first time the session fails to acquire the latch by spinning, it
will attempt to awaken after 10 milliseconds. Subsequent waits will
increase in duration and in extreme circumstances may exceed one
second. In a system suffering from intense contention for latches,
these waits will have a severe impact on response time and throughput.
Causes of
contention for specific latches
If a required latch is busy, the process requesting it spins, tries
again and if still not available, spins again. The loop is repeated up
to a maximum number of times determined by the initialization parameter
_SPIN_COUNT. If after this entire loop, the latch is still not
available, the process must yield the CPU and go to sleep. Initially is
sleeps for one centisecond. This time is doubled in every subsequent
sleep. This causes a slowdown to occur and results in additional CPU
usage, until a latch is available. The CPU usage is a consequence of
the "spinning" of the process. "Spinning" means that the process
continues to look for the availability of the latch after certain
intervals of time, during which it sleeps.
The latches that most frequently affect performance are those
protecting the buffer cache, Redo Log Buffer and Shared Pool.
• Buffer cache latches (Cache buffer
chain latch, Cache buffers LRU chain latch): These latches
protect the library cache in
which sharable SQL is stored. In a well defined application there
should be little or no contention for these latches, but in an
application that uses literals instead of bind variables (for instance
“WHERE surname=’HARRISON’” rather that “WHERE surname=:surname,”
library cache contention is common. Contention on these latches is
typically
caused by concurrent access to a very “hot” block and the most common
type of such a hot block is an index root or branch block (since any
index based query must access the root block).
• Redo copy/redo allocation latches:
These latches protect the redo log
buffer, which buffers entries made to the redo log. Recent improvements
(from Oracle 7.3 onwards) have reduced the frequency and severity of
contention for these latches.
• Shared pool latches (Row cache
objects latch, Library cache latch, Shared pool latch): These
latches are held when allocations or
de-allocations of memory occur in the shared pool. Prior to Oracle
8.1.7, the most common cause of shared pool latch contention was an
overly large shared pool and/or failure to make use of the reserved
area of the shared pool.
Reducing
contention for internal latches
We can reduce contention for these latches and tune them by adjusting
certain init.ora parameters.
Cache buffer chain latch:
Contention in this latch might be related with the Buffer cache size,
but it might be present due to a "hot block" (meaning a block highly
accessed). Before incrementing the parameter DB_BLOCK_BUFFERS check
that specific blocks are not causing the contention avoiding memory
wasting.
Cache buffers LRU chain latch:
Multiple Buffer pools and adjusting the parameter DB_BLOCK_LRU_LATCHES
to have multiple LRU latches will help on reducing latch contention.
Redo Allocation Latch:
Contention for this latch in Oracle7 can be reduced by decreasing the
value of LOG_SMALL_ENTRY_MAX_SIZE on multi-cpu systems to force the use
of the redo copy latch. In Oracle8i this parameter is obsolete, so you
need to consider to increase the size of the LOG_BUFFER or reduce the
load of the log buffer using NOLOGGING features when possible.
Redo copy latch:
This latch is waited for on both single and multi-cpu systems. On
multi-cpu systems, contention can be reduced by increasing the value of
LOG_SIMULTANEOUS_COPIES (Hidden in Oracle8i) and/or increasing
LOG_ENTRY_PREBUILD_THRESHOLD (undocumented in Oracle7).
Row cache objects latch:
In order to reduce contention for this latch, we need to tune the data
dictionary cache. In Oracle7 this basically means increasing the size
of the shared pool (SHARED_POOL_SIZE) as the dictionary cache is a part
of the shared pool.
Library cache and Shared pool latches
The first resource to reduce contention on this latch is to ensure that
the application is reusing as mush as possible SQL statement
representation. If the application is already tuned the
SHARED_POOL_SIZE can be increased. Be aware that if the application is
not using appropriately the library cache the contention might be worst
with a larger structure to be handled.
Measuring
Latch Contention
A -
Ratio-based
Techniques (old wrong way)
We see that for each latch, the number of gets (requests for the
latch), misses (number of times the first request fails) and sleeps
(number of times a session failed to obtain a latch by spinning) are
recorded. In the past, queries such as the following were often used to
determine latch health:
select substr(name,1,45) name,
gets, misses, misses*100/gets misspct
from v$latch
where gets > 0
order by 4 desc;
NAME
GETS MISSES MISSPCT
--------------------------------- ----------- ----------- -------
latch wait list
32
1 3.13
process
allocation
28
0 .00
session
allocation
1,223,068
84 .01
session
switching
8,009
0 .00
process group
creation
40
0 .00
session idle
bit
2,426,940
1 .00
shared java
pool
1,188
0 .00
event group
latch
28
0 .00
messages
2,128,851
461 .02
enqueues
3,168,279
7 .00
enqueue hash
chains
1,747,312
2 .00
channel handle pool
latch
40
0 .00
checkpoint queue
latch
1,459,929
26 .00
cache buffers chains
13,851,179 16,254
.12
This approach was flawed on a
number of levels:
• It is actually the number of sleeps that most accurately influences
the impact of the latch contention on response time.
• A high miss rate is expected for certain latches.
• A latch with a high miss rate (or sleep rate) that is not frequently
accessed is probably not impacting performance.
• Even if a latch is experiencing a high sleep rate, we can’t determine
the impact on performance without taking into account waits for other
resources. So if sessions are waiting 90% for IO, 8% for CPU and 2% for
latch, expending effort on halving the latch sleep wait only provides a
1% improvement in response time – probably not noticeable.
In the above example the “latch wait
list” latch has the highest miss rate. However, this is totally
irrelevant since it was only requested 26 times, while the “cache buffer chains” latch appears
to have only a moderate miss rate, but has been requested almost three
million times and — as we shall see — is the latch most affecting
performance.
B - Wait
interface-based techniques (better way)
A better approach to estimating the impact of latch contention is to
consider the relative amount of time being spent waiting for latches.
The following query gives us some indication of this:
SELECT substr(event,1,50) event,
time_waited,
round(time_waited*100/ SUM
(time_waited) OVER(),2) wait_pct
FROM (SELECT event, time_waited
FROM v$system_event
WHERE event NOT IN
('Null event',
'client message',
'rdbms ipc reply',
'smon timer',
'rdbms ipc message',
'PX Idle Wait',
'PL/SQL lock timer',
'file open',
'pmon timer',
'WMON goes to sleep',
'virtual circuit status',
'dispatcher timer',
'SQL*Net message from client',
'parallel query dequeue wait',
'pipe get'
)
UNION
(SELECT NAME,
VALUE
FROM v$sysstat
WHERE NAME LIKE 'CPU used when call started'))
ORDER BY 2 DESC;
EVENT
TIME_WAITED WAIT_PCT
------------------------------
----------- ----------
latch free
40144
31.67
CPU used when call started
30341
23.94
control file sequential read
12341
9.74
direct path read
11933
9.41
control file parallel write
6487
5.12
file identify
5666
4.47
log file sync
3492
2.75
log file parallel write
3213
2.53
instance state change
3064 2.42
log file switch completion
3049
2.41
db file sequential read
2290 1.81
Now we can look at the sleeps in v$latch to determine which latches are
likely to be contributing most to this problem:
select substr(name,1,35) name,
gets, sleeps,
sleeps*100/sum(sleeps) over() sleep_pct, sleeps*100/gets sleep_rate
from v$latch
where gets > 0
order by sleeps desc;
NAME
GETS SLEEPS SLEEP_PCT
SLEEP_RATE
------------------------------
----------- ------------ --------- ----------
cache buffers
chains
13,863,552
38,071 99.48 .2746
session
allocation
1,223,982
110
.29 .0090
checkpoint queue
latch
1,461,039
39
.10 .0027
library
cache
9,239,751
22
.06 .0002
shared
pool
869,652
16
.04 .0018
messages
2,130,515
6 .02
.0003
redo
writing
1,330,987
6 .02
.0005
latch wait
list
33
0 .00
.0000
session
switching
8,014
0 .00
.0000
session idle
bit
2,428,851
0 .00
.0000
enqueues
3,171,018
0 .00
.0000
channel handle pool
latch
40
0 .00
.0000
message pool operations
parent
3
0 .00
.0000
Now we are in a position to make some reasonable conclusions:
• Latch sleeps contribute to about 30% of database response time (very
excessive),
AND
• It’s the cache buffers chains latches that contributes to the vast
majority of these waits.
Note that if we had used the conventional “ratio based” analysis
outlined in the previous section we would have discounted cache buffers
chains latches as a problem because the miss rate was “only” 0.15%.
Tuning the
Application to Avoid Latch Contention
There are some things we can do within our application design that can
reduce contention for latches.
Using Bind Variables
As noted earlier, failure to use bind variables within an application
is the major cause of library cache latch contention. All Oracle
applications should make use of bind variables whenever possible.
However, all is not lost if you are unable to modify your application
code. From 8.1.6 onwards you can use the “CURSOR_SHARING” parameter to
cause Oracle to modify SQL on the fly to use bind variables. A setting
of FORCE causes all literals to be converted to bind variables. A
setting of SIMILAR causes statements to be rewritten only if it would
not cause the statements execution plan today (which can happen if
there are histogram statistics defined on a column referenced in the
WHERE clause). The default value for this option is EXACT.
Avoiding Hot Blocks
Cache buffers chains latch contention is one of the most intractable
types of latch contention. There are a couple of things you can do at
the application level to reduce the severity of this type of contention.
Firstly, identify the blocks that are “hot.” Metalink note 163424.1,
“How to Identify a Hot Block Within The Database” describes how to do
this. Having identified the identity of the hot block, you will most
likely find that it is an index root or branch block. If this is the
case, there are two application design changes that may help.
1) Consider partitioning the table and
using local indexes. This might allow you to spread the heat amongst
multiple indexes (you will probably want to use a hash partition to
ensure an even spread of load amongst the partitions).
2) Consider converting the table to a hash cluster keyed on the columns
of the index. This allows the index to be bypassed completely and may
also result in some other performance improvements. However, hash
clusters are suitable only for tables of relatively static size, and
determining an optimal setting for the SIZE and HASHKEYS storage
parameters are essential.
Investigating
_spin_count
Prior to Oracle 8.1, the spin count parameter (_spin_count or
latch_spin_count) was a documented parameter and many DBAs attempted to
adjust it to resolve latch contention. However, as of Oracle8i the
parameter is “undocumented” (e.g., does not appear in v$parameter and
is not documented in the Oracle reference manual). Why did Oracle do
this?
The official Oracle Corporate line is that the value of _spin_count is
correct for almost all systems and that adjusting it can cause degraded
performance. For instance, Metalink Note:30832.1 says: “If a system is
not tight on CPU resource _spin_count can be left at higher values but
anything above 2000 is unlikely to be of any benefit.” However, I
believe that higher values of _spin_count can relieve latch contention
in many circumstances and I think Oracle depreciated the parameter
incorrectly.
Identifying HOT BLocks
Metalink
Note:163424.1
Goals:
How to identify blocks which cause latch contention on the 'cache
buffers chains' latch.
How to identify a hot block within the database buffer cache
Possible hot blocks in the buffer cache normally can be identified by a
high or rapid increasing wait count on the CACHE BUFFERS CHAINS latch.
This latch is acquired when searching for data blocks cached in
the buffer cache. Since the Buffer cache is implemented as a sum of
chains of blocks, each of those chains is protected by a child of this
latch when needs to be scanned. Contention in this latch can be caused
by very heavy access to a single block. This can require the
application to be reviewed.
By examining the waits on this latch, information about the segment and
the specific block can be obtained using the following queries.
First determine which latch id(ADDR) are interesting by examining the
number of sleeps for this latch. The higher the sleep count, the
more interesting the latch id(ADDR) is:
select CHILD# "cCHILD",
ADDR "sADDR",
GETS "sGETS" , MISSES "sMISSES",
SLEEPS "sSLEEPS"
from v$latch_children
where name = 'cache buffers
chains'
order by 4, 1, 2, 3;
Run the above query a few times to to establish the id(ADDR) that has
the most consistent amount of sleeps. Once the id(ADDR) with the
highest sleep count is found then this latch address can be used to get
more details about the blocks currently in the buffer cache protected
by this latch. The query below should be run just after determining the
ADDR with the highest sleep count.
column segment_name format a35
select /*+ RULE */
e.owner ||'.'||
e.segment_name segment_name,
e.extent_id extent#,
x.dbablk - e.block_id +
1 block#,
x.tch,
l.child#
from
sys.v$latch_children
l,
sys.x$bh x,
sys.dba_extents e
where
x.hladdr = 'ADDR' and
e.file_id = x.file# and
x.hladdr = l.addr and
x.dbablk between
e.block_id and e.block_id + e.blocks -1
order by x.tch desc ;
SEGMENT_NAME
EXTENT# BLOCK#
TCH CHILD#
---------------------
------------ ------------ ------ ----------
SCOTT.EMP_PK
5
474 17 7,668
SCOTT.EMP
1
449 2 7,668
Depending on the TCH column (The number of times the block is hit by a
SQL statement), you can identify a hotblock. The higher the value of
the TCH column, the more frequent the block is accessed by SQL
statements.
In order to reduce contention for this object the following mechanisms
can be put in place:
1)Examine the application to see if the execution of
certain DML and SELECT statements can be reorganized to eliminate
contention on the object.
2)Decrease the buffer cache -although this may only help
in a small amount of cases.
3)DBWR throughput may have a factor in this as well. If
using multiple DBWR's then increase the number of DBWR's
4)Increase the PCTUSED / PCTFREE for the table storage
parameters via ALTER TABLE or rebuild. This will result in less rows
per block.
5)Consider implementing reverse key indexes (if range
scans aren't commonly used against the segment)