How To Work On Cross Site Failure in Exchange 2010 Environment (Failover and Failback)

Lets consider we have 2 Site "SITE ALPHA and SITE BETA" , SITE ALPHA is running 2 Exchange 2010 Server and SITE BETA is running 1 Exchange 2010 Server. SITE ALPHA is configured as Active Cluster Node and SITE BETA as Passive Cluster Node. In this documentation we will try and get the Passive Cluster Node up and running if the Active Cluster goes down and even do a failback. The entire infrastructure is designed keeping the High Availability options of Exchange 2010.

This is how my Exchange Inftrastructure looks like:-

EXCAHNGE SERVER
SITE
DAG
FSW
CAS ARRAY
DATABASE COPY
Act1.exchange.com
Alpha (Prod Site)
DAG 1
FSW 1
Mapi1
D1 ,D2 ,D3
Act2.exchange.com
Alpha (Prod Site)
DAG 1
FSW 1
Mapi1
D1 ,D2 ,D3
Pass.exchange.com
Beta (DR Site)
DAG 1
FSW 2
Mapi2
D1 ,D2 ,D3

Prod Site  - Production Site
DR Site     - Disaster Recovery Site

Before moving further one should understand some important concepts or phrases that we will use in this documentation.

Database Availability Group (DAG)

A database availability group (DAG) is the base component of the high availability and site resilience framework built into Microsoft Exchange Server 2010. A DAG is a group of up to 16 Mailbox servers that hosts a set of databases and provides automatic database-level recovery from failures that affect individual
servers or databases.


File Share Witness :-

As the name suggest, file share witness is a file share residing on the third server outside of the DAG for ensuring quorum availability in the cluster. We can provide the file share witness directory and share name while creating the DAG.  Unlike Exchange 2007, where Microsoft recommended hosting file share
witness on the Hut Transport server, Exchange 2010 provides you full control to host file share witness on non-Exchange Server as well, but you have to add the Exchange Trusted Subsystem Universal Security Group to the local Administrator Group on the FSW Server.


CAS Array :-

A Client Access array is, as the name implies, an array of CAS servers. More specifically, it is an array consisting of all the CAS servers in the Active Directory site where the array is created. So instead of connecting to a FQDN of a CAS server, an Outlook client can connect to the FQDN of the CAS array
(such as outlook.domain.com). This makes sure Outlook clients connecting via MAPI are connected all the time even during mailbox database fail and switch-overs.
Note: When the CAS array has been created you should create an “A record” in your internal DNS named outlook.domain.com pointing to the virtual IP address of your internal load balancing solution.


Majority Cluster Node :-

Both Exchange 2007 and Exchange 2010 Clusters use Majority Node Set Clustering (MNS).This means that 50% of your votes (server votes and/or 1 file share witness) need to be up and running. We have a formula to make this architecture work. The proper formula for this is (n / 2) + 1 where n is the number of DAG nodes within the DAG. With DAGs, if you have an odd number of DAG nodes in the same DAG (Cluster), you have an odd number of votes so you don’t have a witness.  If you have an even number of DAGs nodes, you will have a file share witness in case half of your nodes go down, you have a witness who will act as that extra +1 number.
Eg:- Let’s say we have 3 servers.  We will use the formula  (n / 2) + 1 .This means that we need (number of nodes which is 3 / 2) + 1  which equals 2 as you round down since you can’t have half a server/witness.  This means that at any given time, we need 2 of our nodes to be online which means we can sustain only 1 (either a server or a file share witness) failure in our DAG.  Now let’s say we have 4 servers.  This means that we need (number of nodes which is 4 / 2) + 1 which equals 3.  This means at any given time, we need 3 of our servers/witness to be online which means we can sustain 2 server failures or 1 server failure and 1 witness failure.


Database Activation Coordination :-

Database Activation Coordination (DAC) mode is an optional addition to the new High Availability model to prevent split brain syndrome from occurring during a site failover when utilizing a multi-site DAG configuration with at least 3 DAG members and more than one Active Directory Site.


------ Now that you have gone through some important phrases and concepts, Lets Start will the Failover Process ------

Lets Consider a Scenario -

Due to some reason Site ALPHA goes down in this situation Exchange Server Act1 and Act2 will fail to respond. SITE BETA cannot make Majority Cluster Node due to which it cannot automatically become Active. That means we will have to do a Manual Failover / Switch.

Steps To Be Performed In Such Situation - Failover Process (Keep Infrastructure in Mind)

We need to run the following commands, which will help us performing Manual Failover to SITE BETA (DR)

Log on to Pass Server on SITE BETA and run the following commands from Exchange Management Shell -

Stop-DatabaseAvailabilityGroup -Identity DAG1 -Mailboxserver Act1 -Configurationonly
Stop-DatabaseAvailabilityGroup -Identity DAG1 -Mailboxserver Act2 -Configurationonly
Or
Stop-DatabaseAvailabilityGroup –Identity DAG1 -ActiveDirectorySite SITEALPHA


What does these commands do?

By running these commands we stop the mailbox servers of Act1.exchange.com and Act2.exchange.com
Or in other words we are evicting SITEALPHA Exchange Servers from DAG1.

After this we need to make sure the configurations are set to DR Server, for that we will run :-
Get-DatabaseAvailabilityGroup |fl

You will notice change under :-

StoppedMailboxServers: <Act1.domain.com, Act2.domain.com>
StartedMailboxServers: <Pass.domain.com>

After this process we will start working on SITEBETA "Pass.exchange.com" Server.

Open Windows PowerShell on Pass Server and type NET STOP CLUSSVC(This will stop the Cluster Service)

Once we stop the cluster service Open Exchange Management Shell and run -

Restore-DatabaseAvailabilityGroup -Identity DAG01 -ActiveDirectorySite DRSite

This command will :-

-Create a Cluster Quorum at SITEBETA Pass Server.
-Will Activate alternate File Share Witness (FSW2) at SITE BETA
-Will mark Act1 and Act2 nodes as StoppedMailboxServers and will not be included in DAG1

Now we will run the following commands to Activate Databases on SITEBETA "Pass Server"

Resume-MailboxDatabaseCopy -Identity DB1\Pass
Resume-MailboxDatabaseCopy -Identity DB2\Pass
Resume-MailboxDatabaseCopy -Identity DB3\Pass

Now that we have activated Databases on SITEBETA Pass Server, Mapi Clients should also point to the coerrct CAS Array.

Remember the concept of CAS Array..."It is an array consisting of all the CAS servers in the Active Directory site where the array is created. So instead of connecting to a FQDN of a CAS server, an Outlook client can connect to the FQDN of the CAS array (such as mapi.domain.com)."

To get that done we will run the following command from Exchange Shell :

Get-MailboxDatabase |Set-MailboxDatabase -RPCClientAccessServer MAPI2.exchange.com

After this MAPI clients will talk to the correct CAS Array and SITEBETA will be running as Active Node.

Now we need to make SITEALPHA up and running and do a Failback on to SITEALPHA

When Primary SiteALPHA exchange servers are up and running again we will perform following steps to do a Failback.
Note :- SiteALPHA Databases will not mount automatically , as DACP bit is Enabled which will prevent automatic failback and stop split brain scenario.

Start Cluster Service by running NET STOP CLUSSVC from Windows PowerShell.

Open Exchange Management Shell on Pass Server and run

Start-DatabaseAvailabilitygroup -Identity DAG1 -mailboxServer Act1
Start-DatabaseAvailabilitygroup -Identity DAG1 -mailboxServer Act2
Or
Start-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite SITEALPHA

After the command is successfully completed we need to make sure our Mailbox Servers are listed under StartedMailboxServers.

Get-DatabaseAvailabilityGroup |fl

You will notice change under :-

StoppedMailboxServers: < >
StartedMailboxServers: < Act1.exchange.com, Act2.excahange.com, Pass.exchange.com>

Now we need to replicate and sync databases from siteBETA to siteALPHA Exchange 2010 servers. To do that we will run :-

Set-DatabaseAvailabilitygroup -Identity DAG1

After the synchronization databases at SiteALPHA will be in Passive mode. We need to convert SITEALPHA to Active and SITEBETA to Passive. To make this magic happen we will run :-

Move-ActiveMailboxDatabase D1 -ActivateOnServer Act1 -MountDialOverride: GoodAvailability
Move-ActiveMailboxDatabase D2 -ActivateOnServer Act1 -MountDialOverride: GoodAvailability
Move-ActiveMailboxDatabase D3 -ActivateOnServer Act2 -MountDialOverride: GoodAvailability

To prevent future failover we will have disable Automatic Database Activation on SITEBETA Run the following commands :-

Suspend-MailboxDatabaseCopy -Identity DB1\Pass -ActivationOnly
Suspend-MailboxDatabaseCopy -Identity DB2\Pass -ActivationOnly
Suspend-MailboxDatabaseCopy -Identity DB3\Pass -ActivationOnly

Now that databases are back and replicated we need to modify the CASArray url back to SITEALPHA.

Get-MailboxDatabase |Set-MailboxDatabase -RPCClientAccessServer MAPI1.exchange.com

After these commands SITEALPHA will be Active and SITEBETA will be Passive.

Hope this document will help you while working on a Cluster Failover/Failback...



Comments

Popular posts from this blog

Activate Exchange 2010 Manually

Migrating SBS 2003 to Windows Server 2008 with Exchange 2007