Wednesday, October 24, 2012

Connect Time Failover & Transparent Application Failover for Data Guard


Connect Time Failover & Transparent Application Failover for Data Guard

 

I was giving a 10g Data Guard course this week in Düsseldorf, demonstrating amongst others the possibility to configure Transparent Application Failover (TAF) for Data Guard. I always try to keep things as simple as seriously possible, in order to achieve an easy and good understanding of what I like to explain. Later on, things are getting complex by themselves soon enough

In my simple scenario, I have one Primary Database (prima) and one Physical Standby Database (physt). On my downloads page, you may find an installation guide for that scenario for 10g and for 11g. After a switchover or after a failover, the primary is going to be physt. The challenge is now to get the connect from the client side to the right (primary) database. That is called Connect Time Failover and is achieved as follows:

First, we make sure that the client uses a tnsnames.ora with a connect descriptor that uses a SERVICE_NAME instead of a SID

MYAPP =

 (DESCRIPTION =

 (ADDRESS_LIST =

 (ADDRESS = (PROTOCOL = TCP)(HOST = HostA)(PORT = 1521))

 (ADDRESS = (PROTOCOL = TCP)(HOST = HostB)(PORT = 1521))

 )

 (CONNECT_DATA =

 (SERVICE_NAME = myapp)

 )

 )

HostA is the host on which prima runs, HostB has physt running.

Second, we take care that the service myapp is offered only at the right database – only on the primary.Notice that the PMON background processes of both databases must be able to communicate with the (local) listeners in order to register the service myapp. If you don’t use the listener port 1521, they can’t. You have to point to that listener port then with the initialization parameter LOCAL_LISTENER.

We create and start now the service myapp manually on the primary:

begin

 dbms_service.create_service('myapp','myapp');

end;

/

begin

 DBMS_SERVICE.START_SERVICE('myapp');

end;

/

Then we create a trigger, that ensures that this service is only offered, if the database is in the primary role:

create trigger myapptrigg after startup on database

declare

 v_role varchar(30);

begin

 select database_role into v_role from v$database;

 if v_role = 'PRIMARY' then

 DBMS_SERVICE.START_SERVICE('myapp');

 else

 DBMS_SERVICE.STOP_SERVICE('myapp');

 end if;

end;

/

The event after startup is fired, if an instance changes from status MOUNT to OPEN. If you use a logical standby, it is not fired, because the logical standby remains in status OPEN. You may use the event after db_role_change in this case. The creation of the trigger and of the service is accompanied with redo protocol (the Data Dictionary has changed) and therefore also present at physt without any additional work to do there for the DBA. With the present setup, we have already achieved Connect Time Failover: Clients can use the same connect descriptor (myapp) to get to the right (primary) database now, regardless of switchover or failover.

But sessions that are connected to prima are disconnected if a switchover or failover to physt takes place. They have got to connect again then. We can change that, so that a Runtime Failover is possible, under ideal circumstances, that failover is even completely transparent to the client and proceeds without error messages. To achieve that, you don’t have to touch the tnsnames.ora on the client side. Instead, you do the following on the primary database:

begin

 dbms_service.modify_service

 ('myapp',

 FAILOVER_METHOD => 'BASIC',

 FAILOVER_TYPE => 'SELECT',

 FAILOVER_RETRIES => 200,

 FAILOVER_DELAY => 1);

end;

/

Connections to the service myapp are now automatically failed over together with the service to the new primary. Should they have done nothing during the time of the failover/switchover, or even if they had run a select statement, they will not receive any error but only notice a short interruption (about 20 seconds, in a typical case). Only if sessions have open transactions during the failover/switchover, they will receive error messages (“transaction must roll back”) after they try commit then.

I use to demonstrate that with a select on a table with 100000 rows that starts on the primary. Then I kill the SMON of that primary and the select stops at row 30000 something, waits a couple of seconds (maximal 200, with the above settings) and then continues on the new primary after the failover, fetching exactly the 100000 rows! That is always quite impressive and shows how robust Oracle Databases – especially combined with Data Guard – are

 
by Uwe Hesse,