Oracle Database Backup And Recovery Information
Contents
Options
Backup Modes:
Offline or Cold Backup
Online or Hot Backup
A quick review of Oracle backup and recovery
options.
Backup Modes |
Backup Types |
|
|
|
|
Logical |
Physical |
Offline |
Online |
Export |
Y |
|
|
Y |
Cold |
|
Y |
Y |
|
Hot |
|
Y |
|
Y |
Online backups are performed while the database is
still running, off-line backups are performed while the database is
down. Logical backups can also be used for migrating between database
versions and Oracle platforms. In Oracle these are performed using the
IMPORT/EXPORT utilities.
Copying of the datafiles, control file and online redo log files must be done by using an operating system copy utility. This is a considered as a complete backup of the database. Any changes made after this backup will be unrecoverable if the database is running in NOARCHIVELOG mode. All transactions are recorded in online redo log files whether archiving or not. When redo logs are archived (ARCHIVELOG mode), ORACLE allows you to apply these transactions after restoring files that were damaged (assuming an active Redo log file was not among the files damaged). Whenever the schema of the database is changed i.e., a new datafile is added or a file is renamed or a tablespace is created or dropped, shutdown the database and at least make a copy of the control file and the newly added datafile. A complete backup of the database is preferred.
Procedure
1. Execute script to generate suitable backup script to
accomplish (5) below (where applicable).
2. Backup the control file to trace file.
3. Shut down application (where applicable).
4. Shut Oracle down cleanly. If a SHUTDOWN ABORT has
been issued, the instance must be restarted and closed again.
5. Copy all relevant database files to a staging area.
These comprise:
6. Restart database.
7. Copy files created to offline storage media (i.e.
tape). Ideally these copies will be left on disk to minimise the time
taken to recover the database.
At sites where database must operate 24 hours per day and when it is not feasible to take offline backups, then an alternative is provided by ORACLE to perform physical backups while the database remains available for both reading and updating. For this kind of backup the database must be in ARCHIVELOG mode. Only data files and current control file need to be backed up. Unlike offline backups, the unit of a online backup is tablespace, and any or all tablespaces can backed up whenever needed. Different datafiles can be backed up at different times. This process should be fully automated if possible. Procedure:
Not all data files need be backed up in this operation, for example read-only tablespaces only need to be backed-up once after the tablespace is made read-only. All of the above processing can be performed while the database is open. Ideally it should occur during a period of low transaction activity.
Frequency of online backups
In order to determine how frequently to back up the files of a database, balance the amount of time available for taking backups and the time available for recovery after media failure. The time for recovery depends on how old your most recent copy of the damaged file is. The older the backup, the greater the number of redo log files needed to be applied, and the longer recovery will take.
Backup strategies must be tested before being used to protect a production database.
Ensure that backups of all datafiles and of all necessary redo logs are kept, and that backups are restored correctly. (If file compression is used, verify that the file is correct after decompression.)
What happens between BEGIN BACKUP and END BACKUP?
Once the ALTER TABLESPACE ts_name BEGIN BACKUP is issued, two things happen:
(1) Extra information is recorded in the redo logs. Because of this, is it important that on-line backups are done as quickly as possible, and also, if possible, during a quieter period when there is less update activity on the database.
(2) The status in the datafile header is changed to indicate that the datafile is being backed up. Oracle stops recording the occurrence of checkpoints in the header of the database files. This means that when a database file is restored, it will have knowledge of the most recent checkpoint that occurred BEFORE the backup, not any that occurred during the backup. This way, the system will ask for the appropriate set of redo log files to apply should recovery be needed. Since vital information needed for recovery is recorded in the Redo logs, these REDO LOGS are considered as part of the backup. Hence, while backing up the database in this way the database must be in ARCHIVELOG mode. Status in the datafile header is not reset until END BACKUP is issued.
On END BACKUP, the system ceases writing this extra information to the redo-logs, and recommences noting the occurrence of the checkpoints in each file of the database. The checkpoint in the datafile header is changed during the next log switch after END BACKUP is issued. The above information will allow the tablespace to be recovered as if the database had been offline when the backup took place.
All the files belonging the database are important. Along with other tablespaces, special care should be taken to ensure that the SYSTEM tablespace and tablespaces containing rollback segments, are protected by backups. Also backup the control file and datafile immediately after adding it to a tablespace or after creating tablespace if archiving is enabled. If media failure damages a datafile that has not been backed up, recovering it's tablespace is not possible. After backing up the newly added datafile, include it in the regular datafile backup rotation.
Identifying the files to backup: Use the 'v$'
(dynamic performance tables) to
find the file names:
Archiving is the process of writing data from a
filled online redo log file to a backup destination.
An Oracle database can operate in either NOARCHIVELOG or
ARCHIVELOG mode. Oracle writes to the current online redo log file
until it fills up, it then begins writing to the next one. When the
last online redo log file is full, Oracle cycles back to the first one.
In ARCHIVELOG mode each log file is saved before being overwritten. The
archive process will backup each
redo log file as soon as it is filled. The ARCHIVELOG function enables
a whole range of
possible backup options:
Points to note:
Recommendations:
Its important to identify exactly what recovery facilities are required by the site This will determine the frequency and type of backup that will have to be taken. The answers to several questions must be established before attempting to implement any backup procedures.
Implementing a backup strategy requires consideration of the following.
How much data can you afford to lose?
If you can afford to lose up to a day's worth of data,
then running in NOARCHIVELOG mode and taking nightly off-line backups
is sufficient. Any changes applied to the database after the most
recent backup will be unrecoverable.
If you cannot afford to lose transactions performed
since the last backup then you must run in ARCHIVELOG mode. The
archived redo log files contain a record of all transactions applied
since the last backup. These transactions can be reapplied after
restoring from backup any datafiles that were damaged.
How often and for how long, can the database be
off-line to perform backups?
Many sites operate 24 hours per day. Even at sites where
the database can be shutdown for maintenance, a complete backup may
take longer than the window available. When it is not feasible to take
off-line backups often enough to provide the protection from data loss
the site requires, then online backups are the only option.
Should recovery be needed, how quickly do you
need to recover your data?
The cost of downtime (or database outage) may be a key
factor in configuring the backup strategy. Recovery time using:
Logical backups (export) take the longest
Physical backups provide the fastest recovery times.
Do you need the capability to reverse changes
made to the database?
Do you wish to be able to protect against inadvertent
changes made to both the content and structure of the database. For
example, bringing back a dropped table, or removing a datafile added to
the wrong tablespace.
How to minimize vulnerabilities during backup and recovery?
How can physical and logical errors be monitored and checked for?
Guidelines for scheduling Routine backups
Non routine backups are required in the following circumstances:
Backup and recovery performance can be improved using the following guidelines:
Implementing a Backup Strategy Using ARCHIVELOG
Procedures
to enable Archiving Log Mode
Enabling automatic archiving (init.ora) method
Shutdown the database from Server Manager:
SHUTDOWN;
O/S backup of all database, redo logs, and control files.
Set two parameter in INIT.ORA.
Startup ARCHIVELOG mode from Server Manager:
Enabling manual archiving method
Disabling automatic archiving
From within Server Manager:
Checking backups for corruption
If you use RMAN to perform backups, the contents of this paper don’t apply to you. Being an Oracle utility, RMAN knows the structure of Oracle blocks, and can therefore detect corruption as it is backing things up. If any corruption is encountered, it will throw an error stack the size of the Empire State Building, and proceed no further.
If you are using Operating System techniques to take your backups, though, then there is a real risk that the resulting copies of Data Files and so forth might get corrupted as part of the backup process -or that the Data Files themselves already contain corruption, which the O/S just happily copies across, oblivious to its presence.
Fortunately, Oracle supplies a handy little utility
called "dbverify" which can be used to check the backed up copies of
the Data Files for corruption, once the backup has finished. This
utility is run from the command line, by typing in the command dbv. You
simply tell it what files to check, and supply some other run-time
parameters if needed. You might type something like this, therefore:
C:\>
DBV FILE=D:\BACKUPS\SYSTEM01.BKP BLOCKSIZE=8192 LOGFILE=DBVLOG.TXT
Note that you must tell dbverify what size Oracle blocks to expect when it scans the Data File copy. If you miss that parameter out, dbverify will expect to find 2K blocks, and when it encounters anything else, will simply abort with an error.
You’ll also notice there a ‘logfile’ parameter. That’s so that dbverify’s work can be output to a text file for later reading. If you miss it out, the results of the check are simply displayed on screen, which is probably not exactly what you want.
There are some other parameters that can be
supplied, too. For example, you can request a check of parts of a Data
File copy, by specifying the start and end block addresses. If you need
the complete list of available parameters, just type:
C:\> DBV HELP=Y
Now all the Oracle documentation states that dbverify can only be applied to ‘cache managed files’ -i.e., ones which are read into the Buffer Cache. So that theoretically rules out running it against the Control Files and seeing whether they are internally corrupt. However, in Oracle version 8.0, 8i and 9i, you can run the tool against Control Files.
Well, dbverify honestly cannot be used to verify redo log files, whether of the Online or Archived variety (if you try it, you’ll get a report indicating that the entire file is corrupted, which simply means it doesn’t understand their contents). Therefore, the only really viable tool is Log Miner - and that was only introduced in Oracle 8.1.x (though it can be used from there to examine 8.0.x version logs).
If Log Miner encounters corruption during its
analysis session (begun with the EXECUTE
DBMS_LOGMNR.START_LOGMNR(DICTFILENAME=>’HJRDICT.ORA’) command), then
it simply bombs out with an error -at which point you know you have a
problem.
This section is divided into two parts.
An overview of database recovery
A step through the options available to the DBA including example commands.
Prevention, detection and repair
A more detailed look at recovery based on the database component which has failed.
Instance Failure
Instance failure is a hardware, software, or system
failure that prevents an instance from continuing work. It can be
caused by a CPU failure, an operating system failure, a power outage,
failure of one of the ORACLE background processes or a failure to
access a required database file when the file is not lost or damaged.
Instance Recovery
Instance recovery is automatic. Restarting the database
performs the instance recovery. It involves two steps.
1) Rolling forward.
2) Rolling back transactions that have been explicitly rolled
back or have not been committed.
Any resources held by transactions in process at the time of the failure will be released. Instance recovery is not necessary if the database is shutdown normally.
Media failure
Media failure is a hardware, software, or system failure
that prevents reading or writing to files that are required to operate
the database. Media failure is a failure caused by the loss of the
control file, database files or redo log file.
Media Recovery
The database must be operating in ARCHIVELOG mode. In
addition, you must have the latest backup of database, all online redo
log files, archived logs, current control file.
You can use different commands to recover your database, they are:
1) RECOVER DATABASE
This command is only used with a current control file.
Database must be mounted, but not OPEN. The control file is compared
with the datafile header and brings the datafiles up to date by
applying archived redo logs to make the control file entries match the
datafile header. Online redo logs are applied to make the datafiles
current.
Once the recovery is complete open the database with:
ALTER DATABASE OPEN
2) RECOVER DATAFILE ‘filename’
This command is used when database is up and can't be
brought down. Can also be used when the database is in mounted state.
The tablespace which contains these datafiles must be taken offline.
Issue RECOVER DATAFILE ‘filename’. You will be prompted
for log files. The changes will be applied only to these files.
Once the media recovery is complete the tablespace can
be brought online. Allows 'multi-tasking' recovery. Different datafiles
can be recovered parallelly using different sessions or terminals. Very
useful when there are several datafiles to be recovered.
3) RECOVER TABLESPACE ‘tsname’
Tablespace must be offline. Database must be in OPEN
state. Recovers a single tablespace to the current state. This command
can't be used on SYSTEM tablespace or a tablespace which has rollback
segments having a status "in use".
If having database OPEN is not an issue, can recover using standard recovery (RECOVER DATABASE)
4) RECOVER DATABASE UNTIL CANCEL
Manual recovery after media failure enables you to
control how many redo log files to apply to the database. This can used
to undo an inadvertent change to the database by stopping recovery
before the change took place. MANUAL option needed for recovery with a
control file backup (old copy) and current control file is not
available.
Database must be MOUNT-ed but not OPEN. After MOUNT-ing
the database connect internal and issue RECOVER DATABASE UNTIL CANCEL
command. Then you will be prompted beginning with the earliest redo log
file recorded in the header of each database file. The recovery process
will continue to prompt for redo log files until CANCEL is typed when
prompted for the next redo log file. Recovery can be cancelled at any
time of any redo log.
5) RECOVER DATABASE UNTIL TIME <date&time>
Is same as RECOVER DATABASE UNTIL CANCEL except the
granularity is recovery is stopped at a specified in time within a log
file.
The format of <date&time> is YYYY-MM-DD:hh:mm:ss
where YYY:MM:DD is the year, month and day component of the
data, and hh is the hour, mm minutes, and ss
seconds component of the time.
6) RECOVER DATABASE &ldots;.. USING BACKUP
CONTROLFILE
Can be used for recovery with an old copy (backup) of
control file. Everything else is similar to other RECOVER DATABASE
commands.
Opening the Database
As security measure before starting the recovery backup
datafiles, online logs, and control file. If space is a constraint then
at least backup the online logs and control files.
Open the database with: ALTER DATABASE OPEN [NO]RESETLOGS
RESETLOGS
Before using RESETLOGS option take a cold backup of the
database. Once RESETLOGS is used then the redo log files can't be used
at all.
The RESETLOGS clears all the online redo logs and
modifies all the online data files to indicate no recovery is needed.
After resetting the redo logs none of the existing log files or data
file backups can be used. In the control file log sequence number is
modified, which is very important for recovery purposes. The recovery
will be applied only to the log files whose sequence number is greater
than log sequence number in the control file. One has to be very
cautious when using RESETLOGS option. One more important factor to
remember is that all datafiles must be online otherwise they will
become useless once the database is up.
NORESETLOGS
The NORESTLOGS option will allow to apply the online
redo logs and will not clear the redo log files during startup. Leaves
online logs intact. Only used in scenario where MANUAL RECOVERY started
and CANCELled, and then RECOVER DATABASE is started.
Prevention, detection and repair
This report outlines possible database outages and
focuses on procedures needed to prevent, detect and repair these
outages. The report is sectioned by database component and includes
suggestions and recommendations to prevent, detect and repair the loss
of the component. Ideally, proactive mechanisms are implemented to
reduce the likelihood of outages but if they do occur, recovery should
be automated as much as possible.
Control files are key components to recovery. A
control file that reflects the physical structure and current status of
the database must always be available. The control file should be
backed up at the completion of every structural change. For example,
after adding a data file or log file and after the completion of a hot
backup to the database, the control file should be backed up.
Because a control file is an unreadable binary file, Oracle provides a
method to create a readable SQL script. Such a script is generated for
a given control file by issuing an alter database backup control
file to trace command. Multiple backup copies of control files
from different times should be kept to enable point-in-time recovery.
Oracle recommends that a backup control file be created after the
completion of a hot backup or after a structural change to the
database. In addition, backing up of the control file to a trace file
should be done as frequently as backing up the control files. The trace
files should have a naming standard which includes the timestamp. Trace
files with and without the resetlog option should be created and kept
readily available in the event the DBA needs to recreate the control
file quickly in a recovery situation. This should be standard practice
in all database environments.
At least three control files should be used. Each should be on
different controllers and disks. Since loss of any control file will
result in the database crashing, it is always advisable to have the
control files hardware mirrored as well. Hardware striping, which does
not give any performance gains since control files are small, should
not be used. By using hardware striping, the probability of losing a
control file due to disk failure increases dramatically because the
control file is striped or distributed across several disks.
In the event of losing a control file and its mirrored copy due to user
error or corruption, the DBA can easily restart the instance with the
remaining valid control files. If all control files are lost, the DBA
needs to startup nomount the instance, recreate the control file by
running a valid create control file script, and open the
instance.
Failure to write to any one of the control files will result in the
database aborting or crashing. Oracle does recommend having multiple
copies of the control file spread over multiple disks to reduce the
probability of losing all control files due to corruption or disk
crash. However, having several copies of the control files should not
be the only protection, we recommend that the control files should be
mirrored on the hardware level.
If you lose a single control file and have at least one good control
file, we recommend that you
If you lose all your control files, then you should issue the command
CREATE CONTROLFILE
The create control file command should already be
in a script and can easily be created while the database is up by
issuing an alter database backup controlfile to trace. The
output from the RESETLOGS option is dramatically different from the
output without RESETLOGS (NORESETLOGS is default).
The NORESETLOGS option produces a script that does not contain the
thread numbers of each group. This may be serious problem if you modify
this script to contain the RESETLOGS option. This will ignore the other
threads.
Risks and disadvantages of the create controlfile command:
When should you backup your controlfile?
What to check for when using the backup controlfile option during recovery?
Make sure that the contents of the controlfile reflect the physical structure of the database you are recovering to.
Alter database open resetlogs is needed when using backup controlfile. A backup is always recommended after resetlogs because the logs are cleared and the sequence numbers are reinitialized.
In the event that media recovery needs to be
performed, the current control file should be used if it does reflect
the physical structure of the database that you are planning to recover
to. The current control file preserves crucial timestamp and stop SCN
information which may optimize and ease recovery while using the
"backup control file" syntax needs to ignore some timestamp information
found in the control file. The "backup control file" syntax
relinquishes control of recovery to the user.
Control Files |
Preventive Steps |
Detection Mechanisms |
Loss of any control file will result in an outage. |
|
|
Types of Outage |
Steps |
Early Detection |
Error Detection |
Loss of a single control file Notes: The time metrics do not include analysis of the trace files and modification to the init.ora file. These tests were performed on a 5 GB database) |
Shutdown Instance (abort ) Start Background processes Mount exclusive Crash Recovery Open DB |
No detection is possible until a checkpoint is performed. |
The CKPT process if present, otherwise the LGWR process will generate the error stack OPIRIP, ORA-00447, ORA-00201, ORA-00202. Error number 202 reports the file effected. The full syntax is; ORA-00202: CONTROL FILE: ‘file name’ |
Loss of all control files Notes: The time metric assume that the ‘alter database backup controlfile to trace’ command has been correctly edited and is available for execution. |
Shutdown Instance (abort ) Start Background processes Create Controlfile Media Recovery Archive Log Current Open DB (includes crash & other checks ) 1.56 Mins |
No detection is possible until a checkpoint is performed. |
The CKPT process if present, otherwise the LGWR process will generate the error stack OPIRIP, ORA-00447, ORA-00201, ORA-00202. Error number 202 reports the file effected. The full syntax is; ORA-00202: CONTROL FILE: ‘file name’ |
On-line redo logs are crucial for instance and media recovery. They contain the redo to roll the database forward and the undo needed to rollback any uncommitted transactions. The loss of a current on-line redo log or loss of an unarchived on-line redo log will result in an outage. An unarchived log cannot be dropped unless NOARCHIVELOG mode is enabled; unfortunately, in order to switch to this mode, one needs to shutdown and remount the instance.
To protect from the above outages, Oracle recommends Oracle multiplexing of the redo log files over disk mirroring if both techniques cannot be used. Oracle multiplexing (multiple members per group) protects against user error (e.g. an accidental delete) while hardware mirroring does not. Ideally, each member of the on-line redo group should be on different drives and different controllers and not striped across disks. Here is an example which contrasts OS mirroring of log files versus log file groups. If there is a single operator or configuration error to a mirrored log, the mirrored logs become corrupt and operations are stopped. In contrast, Oracle continues to run if only one member of the multiplexed on-line log is corrupted or destroyed. Thus, Oracle multiplexing of groups are more resilient. Although it requires allocating more on-line redo logs and adds a small performance overhead, Oracle recommends its use in place of mirroring.
For extreme cases where the instance aborts due to a background process dying, the customer may need to bring up the database and perform crash recovery within a small interval of time. This interval of time should reflect the checkpoint interval setting or checkpoint_timeout setting in your database. A checkpoint is the process of writing out dirty buffers from the SGA to the physical data files and updating the files on completion to synchronize them to a point in time. Smaller checkpoint intervals will enable crash recovery to complete faster although it imposes a performance overhead of writing dirty buffers and updating file headers. Oracle, however, always does a checkpoint on a log switch. A typical recommended setting is a checkpoint every 10 to 15 minutes.
Furthermore, the customer should monitor the frequency of checkpoints and archiver wait problems. Although the database is up and running, since it is hung due to checkpoint problems or archiver wait problems, the database is essentially not available. If the checkpoint initiated in a log is not completed before the log needs to be recycled, the database will temporarily freeze all user activity until the checkpoint is finished. Async io should be available and should be default. With Async io, this problem does not usually appear. If async io is not available, multiple database writers (DBWR) and a checkpoint process (CKPT) can reduce checkpoint wait errors but usually is only a temporary fix. However, if this happens frequently, it may be due to the size and number of log files being small relative to the rate of changes being made in the database. A similar situation arises when log writer waits on archiver to complete archiving of a redo log. Again, the solution is to add more log file groups or increase the size of the log file groups to allow the archiver process more time to do its job. One can also increase log_archive_buffer_size to increase the size of a single read from the on-line redo logs by archiver (arch). Multiple archivers are a common solution to archive waits. One can enable another "archiver" process by systematically issuing an ‘alter system archive log current’. Checkpoints should always be monitored to determine if activity predictions are accurate or changing. Log_checkpoints_to_alert is set to TRUE in the init.ora to write checkpoint messages to the alert.log; however, this may cause a flood of messages in the alert.log. Alternatively, the checkpoints can be monitored via querying V$SYSSTAT.
Oracle recommends that log file status be monitored very closely, especially for the STALE or INVALID status. The INVALID log file errors do appear in the alert.log as an IO error and would be detected by alert.log parsing. The STALE status indicates that the state of any pending write on that log is unknown or the log is not complete (a log being written to prior to a shutdown abort). This potential problem does not appear in the alert.log and can only be monitored by V$LOGFILE. If frequent STALE or INVALID status is found on a running database, it may be indicative of hardware problems.
Log files have the highest IO activity along with rollback segments. They should be placed on their own set of disks if possible and definitely separated from the archive files.
Standards:
What do you do when you lose or discover a stale or invalid redo log member of a group?
What if you lose the entire group that has been archived?
What if you lose a group that has not been archived but is not the current or next group?
What do you do if you lose the current online redo log?
You need to restore and perform incomplete database recovery.
Multiplexing the logs would have the effect of LGWR
multiplexing writes and ARCH multiplexing reads. Should one operation
be lost, the affected instance would not be crippled. Using mirroring,
a read or write could be lost only via a dual hardware or firmware
failure; an operating system or mirroring bug; or, more likely, human
error overwriting the partition. Here’s another way to contrast OS
mirroring of log files versus log file groups. If there is a single
operator or configuration error to a mirrored log, the mirrored logs
become corrupt and operations are stopped. In contrast, Oracle
continues to run if only one member of the multiplex online log is
corrupted or destroyed. Thus, Oracle multiplexing groups give more
resilience. Although it requires allocating more online redo logs and
adds a small performance overhead, we recommend its use in place of
mirroring.
Online Redo Logs |
Preventive Steps |
Detection Mechanisms |
Online redo logs are needed for recovery. |
Implement multiple archivers. |
|
Types of Outage |
Steps |
Early Detection |
Error Detection |
Loss of one and not all of the online redo log members of a particular group |
Drop log file member Add log file member (40 MB) |
No detection is possible until log switch is performed either exiting or entering the file. |
The alert log records the following error stack when switching out of the redo log group. ORA-00316, ORA-00312, ORA-00321, ORA-00312. In addition the LGWR process will generate the same error stack Error number 312 reports the file effected. The full syntax is; ORA-00312: online log 1 thread 1: ‘file name’ The view, V$LOG, should be monitored for statuses of ‘STALE’ or ‘INVALID’. |
Loss of inactive archived redo log group Notes: 1. All recovery time are quite static except for the crash recovery. 2. The addition of new redo log group can be done afterwards. |
Shutdown abort Startup mount Drop the problem redo log Alter database open (crash recovery) |
No detection is possible until the LGWR process write to the redo log |
The LGWR process will generate the error stack OPIRIP, the general one ORA-00447 followed by more specific ones. In this test case, they are ORA-00313, ORA-00312 and ORA-07366. The failure of the LGWR will cause other background processes fail with some general error message produced and finally alert log will report a background process failure. (may not be the LGWR process) |
Types of Outage |
Steps |
Early Detection |
Error Detection |
Loss of an inactive redo log group that has not been archived Notes: 2. The addition of new redo log group can be done afterwards. 3. Cannot drop the unarchived redo log without setting noarchivelog mode.(ORA-00350) 4. Cannot set archivelog after dropping the problem redo log group since instance recovery required.(ORA-00265) |
Shutdown abort Startup mount Alter database noarchivelog Drop the problem redo log Alter database open (crash recovery) Shutdown normal Startup mount Alter database archivelog Alter database open |
No detection until the ARCH process archive the redo log |
The ARCH process will generate archival stoppage message - ARCH: Archival stopped, error occurred. Will continue retrying - followed by informative error messages (ORA-00255, ORA-00312) reporting the problem online redo log and more specific error message(s) telling the cause (in the test case, it is ORA-00286). The same set of error messages will also appear on the alert log file together with archival stoppage message. |
Other Online Redo Log Outages |
Detection |
Steps |
Loss of the current online redo log group. |
The LGWR process will generate the error stack OPIRIP, the general one ORA-00447 followed by more specific ones. In this test case, they are ORA-00313, ORA-00312 and ORA-07366. The failure of the LGWR will cause other background processes fail with some general error message produced and finally alert log will report a background process failure. (may not be the LGWR process). V$LOG and V$LOGFILE will indicate if this is the current log. If so, we must switch to CRF. |
1. Restore and commence incomplete recovery. |
Silent Redo Log Corruption |
Error (ORA 600[3020]) during application of archive log on the standby database site. |
1. Rebuild standby database if there is one
present. |
Internal Redo Corruption |
ORA 600 [3020] implies that this change in the redo log is corrupted or inconsistent with the changes in the data block. All ORA 600 errors during application or writing of redo logs may be evidence to a corruption in the online redo log. |
1. If it does not affect primary, then
refresh or rebuild standby database. |
The system tablespace is the most crucial of all tablespaces because it contains the data dictionary tables and objects. The data files should be protected by redundant hardware and controllers. It is not advisable to stripe the system data files across too many disks since this increases the number of potential failure points. Loss of the system data file will result in database recovery or switching to a disaster recovery site.
The system tablespace should be configured and sized properly. By convention, no user objects or temporary segments are created in SYSTEM. This should be enforced using privileges and roles with periodic monitoring to catch any deviations. Basically, the RESOURCE role and UNLIMITED TABLESPACE privilege should not be granted to an user. A typical size of 30-50 MB is common for the system tablespace, while an additional 300 MB - 500 MB may be required if PL/SQL and/or stored procedures are used. Free space and fragmentation should be monitored.
Loss of a system data file:
System Tablespace |
Preventive Steps |
Detection Mechanisms |
System tablespace contains the data dictionary tables. |
Spare disks in case of disk failures. |
|
System Tablespace Outages |
Comments / Detection Mechanisms |
Steps |
Lack of space or system tablespace fragmentation. |
This should be monitored very closely and the monitor tool should ensure that space is sufficient in all their tablespaces. However, if this occurs, the customer needs to add a datafile. |
add another data file |
Loss of system tablespace. |
Monitor tool should track disk failures and correlate with the corresponding data file. |
1. restore and recover. |
Corruption of data dictionary object. |
ORA-1578 on a data dictionary object, and ORA 600 may be an indication of data dictionary corruption. These errors should be parsed from the alert.log by the monitoring tool. |
1. restore and recover. |
Since rollback segment tablespaces typically experience the highest IO activity, RAID 1 is a good choice over RAID 5 to allow for sufficient redundancy without sacrificing too much performance.
Rollback segment configuration is important to allow for continual availability for all applications modifying data. Rollback segments should be configured to minimize space usage, to reduce undo contention, to distribute IO, and to minimize rollback segment maintenance due to application error. The optimum number of rollback segments will be determined by the peak transactional concurrency the database needs to support while the size is determined by the largest amount of rollback that a single transaction can generate The rollback segment tablespace should have ample free space to allow for rollback segment growth while the OPTIMAL parameter should be set on all rollback segments to reclaim space in case of an exceptionally large transaction extend the rollback segment pass average size.
If the system is limited for disk space, the system tablespace and rollback segment tablespace can coexist on the same disk units. Loss of any one of these data files will result in database recovery or resorting to a disaster recovery site. The rollback segments contain information to undo the changes of an aborted transaction and are also used for read consistency. Loss or corruption of an active rollback segment can result in serious data inconsistency. We cannot guarantee the integrity of the database if we lose an active rollback segment.
Loss of a rollback segment data file:
Rollback segment with "needs recovery" status:
A rollback segment is considered "corrupt" when it cannot rollback an active transaction. Depending on the objects involved and the extent of the "problem", you may be able to startup the database by just taking the segment name out of the init.ora file.
You should also check if all data files are online. Often, this error can be resolved by onlining your data files and allowing Oracle access to apply the undo to the data files’ data blocks.
Set event 10015 in the init.ora file if the database cannot startup. event="10015 trace name context forever"
It prints out the rollback segments it is recovering while it is rolling back active transactions during startup. At times, we can drop the object that needs the undo and the "needs recovery" status will go away because there is no undo to apply to existing objects.
Call Support for further suggestions.
If the rollback segment is indeed corrupted or the rollback segment is lost that contains active transaction, we then suggest restore and attempt database recover to bring the database back to a consistent state. Sometimes, recreation of the database is necessary if no other backup is available.
One can check XIDUSN from V$TRANSACTION to find the
active rollback segments by comparing it with the SEGMENT_ID column in
DBA_ROLLBACK_SEGS.
Rollback Tablespace |
Preventive Steps |
Detection Mechanisms |
Rollback segments contain the undo needed to rollback any statement and for transaction consistency. |
Monitor for disk failures. |
|
Rollback Tablespace Outages or Rollback related problems. |
Detection Mechanisms/Comments |
Realistic/ Tested TTR |
Steps |
Cannot allocate extents or tablespace fragmentation. Rollback segment maintenance errors. |
The customer should monitor any ORA 1538, 1551, 1552, 1553, 1554, 1555, 1556, 1557, 1558, 1559, 1562. |
Database availability is not lost. Application errors may occur. |
1. Add more space by adding a datafile. |
ORA-1555 or snapshot too old errors. |
ORA 1555 |
Database availability is not lost. Application errors may occur. |
1. A larger rollback segment is needed. |
Rollback segment corruption. |
ORA 600, 1555, 1578 |
commence disaster recovery plan |
1. attempt to drop rollback segment. If not
successful, then commence disaster recovery plan. |
Loss of a rollback segment data file that does not contain active transactions. |
Monitor tool needs to detect disk failures and correlate OS files to Oracle files. |
Several rollback segment tablespaces are available. Database availability is not lost. Application errors may occur. |
1. Drop rollback segments in that data file
and drop tablespace. |
Loss of rollback segment data files that currently being used. |
Monitor tool needs to detect disk failures and correlate OS files to Oracle files. If we cannot offline and drop affected rollback segment tablespace, then there are active transactions. |
|
1. Restore and recover |
Although the loss of the temporary tablespaces will not result in losing access to data, the application, in particular operations that require sort space or temp space, may falter. Temporary tablespace should be hardware mirrored to protect from disk failure. At least two temporary tablespaces should be available while precreated alter user scripts are established to prevent system tablespace fragmentation or application failure in case of loss of a temporary tablespace. Furthermore, .the default initial and next extent sizes for TEMP, with a pctincrease of 0, should be set to ensure that the largest sort space requirement can be met without running out of extents (121 for 2K block size). The initial and next extents should be set greater than sort_area_size and larger than the average sort if it does not conflict with available memory requirements.
Multiple temporary tablespaces will be beneficial in distributing sort allocations for different users and improve the performance of concurrent sorts. Furthermore, if several index rebuilds have to occur due to corruption or maintenance, multiple temp tablespaces will allow the parallelism to restore and create the indexes quickly. For index rebuilds, TEMP space of up to twice the size of indexed data may need to be allocated. As this TEMP space may not be required during normal operation, it is useful to keep spare disk space.
Following the initial index building phase,
temporary segments usage is limited to sort activity. For performance
reasons, the sort_area_size setting should be high enough to keep disk
sorts to a minimum assuming there is enough real memory available. All
complex queries which may potentially require disk sorts will then
write to the temp tablespace.
Temporary Tablespace |
Preventive Steps |
Detection Mechanisms |
Temporary segments are used for sorting and temporary scratch work when the allotted space in memory is insufficient. |
Extra disks should be available. |
|
Temporary Tablespace Outages or Temporary tablespace related problems. |
Detection Mechanisms/ Comments |
Realistic/ Tested TTR |
Steps |
Cannot allocate extents or tablespace fragmentation. |
These errors do not appear in the alert.log; thus, it must be trapped within the application. ORA 1547, ORA 1562 during a sort operation indicates temp tablespace configuration problem. |
The customer should configure the TEMP tablespace to avoid problems like this. To fix this problem once it is detected encompasses adding another data file or creating another tablespace. |
1. add another data file. . |
Losing temporary tablespace data file. |
|
|
1. Run alter user scripts to switch temporary tablespace usage for users. 2. Create another temp tablespace if possible. |
Types of Outage |
Steps |
Early Detection |
Error Detection |
Loss of temporary tablespace data file |
Drop current temporary tablespace. Create another temporary tablespace. Alter user to use to new temporary tablespace. |
Monitor tool needs to detect disk failure and correspond OS files to Oracle files. |
Monitor tool needs to detect disk failures and correspond OS files to Oracle files. |
All application data files should be hardware mirrored on different controllers and disks for additional resilience. More importantly, the application should have integrity checks and error checking mechanisms to prevent errors and to automatically recover in case of errors. Security policies and proper prevention techniques will be essential to avoid dropping, deleting or modifying critical data. We also recommend partitioning of data and implementing business rules and database integrity constraints to detect for application data inconsistencies.
Even with built in prevention, user errors, application bugs, hardware failures, etc still occur which will result in some type of recovery. There are several possible solutions to resolve these problems:
Application Data |
Preventive Steps |
Detection Mechanisms |
Index segments should only hold indexes. |
|
|
Case 1: Loss
of index data file
Tablespace recovery and object recovery are two options
that are available. Tablespace recovery would entail restoring the
affected tablespace from the hot backup and rolling forward. Object
recovery consists of recreating the tablespace and recreating the
indices. Precreated create index scripts (implemented in
parallel or with degrees of parallelism) should be kept ready and ran
concurrently touching different temporary tablespaces to expedite
recovery. In addition, large temporary tablespaces should be made
available to allow for successful index creation .
Although both recovery descriptions allow for full
availability of the database and the tables, this is still a serious
outage due to the slow performance.
Loss of Index Data file |
Steps |
Early Detection |
Error Detection |
Recreate Index Tablespace |
Drop Index tablespace Create another index tablespace Create all indices in the tablespace in parallel. |
Monitor tool should detect disk failure. If the failure is an index tablespace, automated steps are needed to recreate the indices. |
Monitor tool should detect disk failures. |
Restore from Backup and Recover |
Offline tablespace. Restore from hot backup. Set autorecovery on. Bring up recovery processes. Alter database recover automatic tablespace. Alter tablespace online. |
Monitor tool should detect disk failure. If the failure is an index tablespace, automated steps are needed to be restore and recover. |
Monitor tool should detect disk failures. |
Switch to disaster recovery plan. |
Switch to standby database or replicated database. |
Monitor tool needs to detect disk failure and correspond OS files to Oracle files. |
Monitor tool needs to detect disk failures and correspond OS files to Oracle files. |
Case
2: Loss of application table datafiles
Hopefully, the customer does implement some type of
object level backup and recovery strategy to reduce MTTR. If they do
not, they must either restore from backup and recover or implement
their disaster recovery plan.
Object level recovery only works when there are mechanisms within their application to resynchronize from a determined point of time. When the subset has been restored, then the subset should be able to resynchronized with the rest of the database via the application.
The database should be partitioned in such a way
that recovery can occur to subsets of the database while other
functionality is not impeded.
Loss of Data File |
Steps |
Early Detection |
Error Detection |
Recreate Table Tablespace |
Only applicable if exports or unloads of the tables are taken. Only applicable if there is an object level recovery plan. (snapshots, unloader, replicated database, etc) |
Monitor tool should detect disk failure. If the failure is an index tablespace, automated steps are needed to be restore and recover. |
IO errors or ORA 1100s errors. |
Restore from Backup and Recover |
Offline tablespace. Restore from hot backup Set autorecovery on. Bring up recovery processes. Alter database recover automatic tablespace. Alter tablespace online. |
Monitor tool should detect disk failure. If the failure is an index tablespace, automated steps are needed to be restore and recover. |
Monitor tool should detect disk failures. |
Commence disaster recovery plan |
Only applicable if there is a standby database or a replicated database. This can also be accomplished by breaking three-way mirrors to maintain hot backups on site. |
Monitor tool needs to detect disk failure and correspond OS files to Oracle files. |
Monitor tool needs to detect disk failures and correspond OS files to Oracle files. |
Types of Outage |
Steps |
Early Detection |
Error Detection |
Cannot allocate extents or tablespace fragmentation |
Add datafile |
Monitor Space Usage and fragmentation of tablespaces. |
ORA-1547 or other error traps need to be trapped within the application |
Loss of Index due to user error or corruption. |
Drop index. Create index. |
Monitor alert.log for ORA-1578 errors that affect index data. Periodic index validation with the analyze command. |
ORA-1578 while using or validating an index. |
Single Table Loss or Corruption of Single Table. |
Object level recovery plan. Restore and recover. |
Monitor alert.log for ORA-1578 errors that affect table data. Table can be analyze to check for block corruptions. |
ORA-1578 while accessing or analyzing table data. |
Reorganize Table - Scheduled Outage. |
Drop / Create Unload/ Direct Load Use of Parallelism would be helpful. |
Monitor and alarm when objects exceed more than 20 extents or some threshold. |
ORA-1556 max extent error should be trapped within the application. |
Reorganize Index - Scheduled Outage |
Drop/ Create |
Monitor and alarm when objects exceed more than 20 extents or some threshold. |
ORA-1556 max extent error should be trapped within the application. |
Read-only tablespaces
Read only tablespaces contain information that is static
and the data is accessible only for viewing and not for modification.
Read-only tablespaces NEED to be backed up once it becomes READ -ONLY.
The recovery session will get an error if a file is READ-ONLY and a
backup controlfile is being utilized for recovery.. All read-only
tablespace files must be taken offline prior to incomplete recovery.
This is what makes backing up READ-ONLY tablespace after changing to
READ-ONLY very important. The moral of the story is that you need to
restore the READ-ONLY tablespace from your previous backup and offline
the read-only tablespace files prior to recovery with a backup
controlfile. After opening the database, you then can online the
READ-ONLY data files.
If you use the current control file, then recovery will complete successfully even if the tablespace changes from read-write to read-only through the redo logs. Read-only tablespace files are still online files and recovery will still apply changes to these files. Changing a file from read-write or read-only will cause some redo and during recovery, we will apply the redo. Thus the flags are changed.
Read-only tablespaces reduce the amount of work during checkpoints. Read only data files are not checkpointed and updated while only read-write files are. Recovery will ignore read-only files when possible; thus, streamlining recovery even further.
Although read-only files can be restored from
backup and should not affect the rest of the database if it is loss,
the application may depend on the data and may stop if the data is not
accessible. In that case, we recommend that read-only data should also
be mirrored to reduce the possibility of an outage due to media failure.
Read-Only Data |
Preventive Steps |
Detection Mechanisms |
Index segments should only hold indexes. |
1. Partition data according to functionality. 2. Hardware mirroring and redundant controllers |
1. Monitor disks and for IO errors |
Data segments should hold both tables and clusters. |
1. Partition data according to functionality. 2. Hardware mirroring and redundant controllers |
1. Monitor disks and for IO errors |
Read-Only Data |
Steps |
Accidental drop of read-only data. |
1. Restore from backup 2. commence Disaster Recovery Plan. |
Loss of a read-only data file. |
1. Restore from backup 2. commence Disaster Recovery Plan. |
Archive logs
Archive log files provide the means to roll forward from
a previous backup of the database. Since archive log files are applied
in order, it is extremely important that all the archive log files be
available in order to roll forward to the desired point in time. For
that reason, Oracle recommends that archive files are also mirrored to
protect from disk failure. For extra protection, copies of the archive
files can be kept onsite and offsite in case of some large disaster.
Furthermore, archive files from all threads in a parallel instance
environment are required for media recovery.
One of the most common errors found on customer sites is a poorly organized archiving procedure which may be backing up incomplete archive files or not detecting corrupted tapes. A scripted methodology and well-tested tape management tool may be required to safeguard from corrupted archive files. Prior to backing up or copying an archive file, the procedure should check from V$LOG table that the corresponding on-line redo log has been completely archived.
Another common problem is that the archive destination becomes full or inadequate to support the number of archive files being created. First, adequate disk space should be provided to allow for sufficient time for the log files to be backed up to tape and to allow at least a day’s worth of archive logs to be onsite or all archive logs starting from the previous database backup (cold or hot) to remain onsite. Second, multiple archival destinations should be created with automated switchover of destination when a certain threshold for available free space is reached. Finally, the archive procedure should dynamically monitor the archive destinations for lack of free space. The following procedure describes in detail how this may be accomplished.
First, two archive destinations or more need to be created with sufficient disk space. Archiving procedure should switch from one destination to another when the previous destination reaches a certain space usage threshold by issuing alter system archive log to destination. More scripts should be developed to periodically monitor free space of the current archive destination. When space usage reaches a certain threshold (75% or some value that will allow for at least two more logs in the current ARCH destination), the monitoring process should alter the archive destination, preferably the oldest archive destination. Concurrently, a separate process or shell script can be created to clean up and backup archive destinations to tape starting from the oldest to the most recent archive destination. The backup script should check that the archive files have been successfully archived and then back them up to tape. Removal of the oldest archive files should only occur prior to switching to the oldest archive destination. The archive destinations should always accommodate at least one day’s worth of archive files onsite. If the archive files are onsite, the recovery time will be reduced by the time needed to restore from tape backup.
Since compression is usually done, we suggest that the archive files are thoroughly tested to ensure that the compression utility is not inadvertently corrupting the archive files or backups. This may be done by comparing checksums of the of the source and tape backed up files or restoring from this backup and checking if the database starts up.
In the event of incomplete or complete database recovery, caution must be exercised to prevent confusion between the current set of archive files and the archive files from the previous version of the database. For example, after successful incomplete recovery, the archive log sequence numbers will reset to one. Therefore, new archive files will be created with sequence number starting from one and up which should not be confused with the older archive logs with the same number. Another example is restoring from a old cold backup and starting up without any roll forward media recovery. The archive logs generated from that point will begin with the number that was stored in the control file. Again, it is necessary to distinguish between the current database’s logs and the previous versions of those logs.
What if you lose an archive log or set of logs?
A complete backup of the database is necessary and should be accomplish as soon as possible: cold backup or hot backup.
Archive Log Files |
Preventive Steps |
Detection Mechanisms |
|
Archive log files are required for media recovery and for standby databases. |
|
|
Archive Log Files and Archivelog |
Detection Mechanisms |
Steps |
Archiver stuck due to lack of free space in the archive destination. |
Monitor free space in the archive destination and alarm when close to reasonable threshold (70%) |
1. Free up space in the archive destination 2. Switch to different archived destination. |
Loss of archive file(s) due to media failure or user error. |
Monitor tool should check for disk failures. Alerts should be available when media failure affects archive logs. When this occurs, automatic procedures should switch to another archive destination. |
1. Backup primary. 2. Refresh standby database if one exists. |
Archiver can not write to the archive destination due to media failure. |
Monitor tool should check for disk failures. Alerts should be available when media failure affects archive logs. When this occurs, automatic procedures should switch to another archive destination. |
1. Switch archive destination. |
Archive logs are not successfully backed up. |
Checksums must be in place during the backup. Investigate problem and retry. |
If backup is not successful because of a bad archive log file, one needs to backup Primary and recreate the standby site. |
Archiver hangs because someone deactivates archivelog or disable archiving. |
This should be checked before every open command. The archive log list command will |
shutdown database |