This document provides some methods for testing whether your failover cluster is configured properly.

Test 1: Verifying Connectivity and Name Resolution

To verify that the private and public networks are communicating properly, perform the following steps. It is imperative to know the IP address for each network adapter in the cluster, as well as for the IP cluster resources

From each Server Node ping

Public IP of the other node
Private (heartbeat) IP of the other node
Cluster IP Address
Cluster Name
SQL Server IP Address
SQL Server Virtual Server Name

From a Client machine ping

Cluster IP Address
Cluster Name
SQL Server IP Address
SQL Server Virtual Server Name

Test 2: Verifying the Service Account

For SQL Server to be able to manage its resources and perform correctly the service account must be part of the cluster ACL. To ensure that this is configured properly, execute the following in a SQL Query Analyzer window:

select * from ::fn_virtualservernodes()

If there is no output, ensure the account SQL Server is running under is part of the cluster ACL.

Test3: Verifying SQL Server 2000 Failover Cluster Dependencies

Cluster resources may be dependent on other resources to start before they are brought online. Make sure the following dependencies exist.

Resource	Dependencies
Cluster IP Address	None
Cluster Name	Cluster IP Address
Quorum	None
MS DTC	Cluster Name, Cluster IP, Disk Resource (Quorum is the default)
SQL Server IP Address	None
SQL Server Network Name	SQL Server IP Address
SQL Server (the virtual server itself)	SQL Server Network Name, Disk Resources associated with the instance
SQL Server Agent	SQL Server
SQL Server FullText	SQL Server

Test 4: Failover Validation

Perform a failover of SQL Server virtual server to ensure that all resources fail over and restart on another node with no problems or affecting any other groups. Check Windows Event Log and SQL Server Error Log to ensure that no error occurred.

Test 5: Initiate Failure

Initiating a failure simulates failure of cluster resource and makes cluster service perform all the steps necessary to failover to the other node. By default the cluster service checks the failure condition three times before it decides that it is a real failure and it has to perform a failover. So initiating failure does not actually fail over the resource but performs all checks.

To initiate a failure, right click on any resource under SQL Server Group and select Initiate Failure.

Check Windows Event log and SQL Server Error Log to ensure that no error occurred.

Next three tests involve breaking the cluster physically. Engineering is against doing this as it might damage the disk when you pull the plug.

Test 6: Turn Off Each Node

In this test, you will first need to ensure that all of the default groups are located on the primary node. Then you will physically turn off (flip the switch) the primary node. If it is not possible to switch off the node, shutdown the cluster service in the primary node.

All resources should failover to the secondary node.

Check Windows Event log and SQL Server Error Log for any potential error messages after this occurs. Connect to the database from a SQL Server Client.

Once you have checked for any potential problems, turn on the primary node and wait until it fully boots (or restart the cluster service).

Now turn off the secondary node (or stop the cluster service). All resources should failover to the primary node. (If you have Cluster Administrator open in primary node it will generate an error, open Cluster Administrator in secondary node to see if the node failed over)

Check Windows Event log and SQL Server Error Log for any potential error messages after this occurs. Connect to the database from a SQL Server Client.

Once the groups fail back to the primary node, turn the secondary node back on (or restart the cluster service), and wait until it boots up fully.

Test 7: Break Network Connectivity

This test is similar in concept as the above test, which initiates a failover. But instead of simulating a computer failure, it simulates a network-related error.

From the primary node, remove the network cable from the public network card (or disable the network card from Control Panel - Network and Dial-up Connections). This will simulate a failure of the primary node, and should initiate a failover to the secondary node.

All resources should failover to the secondary node.

Check Windows Event log and SQL Server Error Log for any potential error messages after this occurs. Connect to the database from a SQL Server Client.

If there are no errors plug the network cable back into the primary node (or re-enable the network card), and then remove the network cable from the public network card on the secondary node (or disable the network card).

As before, all resources should failover to the primary node.

Check Windows Event log and SQL Server Error Log for any potential error messages after this occurs. Connect to the database from a SQL Server Client.

If there are no errors plug the network cable back into the public network card on the secondary node (or re-enable the network card).

If you are doing this test remotely, create local user account in each node with administrative privileges and connect to the node which you are testing using remote desktop on the heartbeat IP so that you can re-enable the public network card after the test.

Test 8: Break Heartbeat Network

The cluster should continue to function as other two networks (private and public) are configured to allow internal communications.

Test 9: Break Shared Array Connectivity

This test is designed to help uncover potential issues with the shared disk array. This test is designed to simulate what would happen if the controller card or cable connected from a node to the shared disk array fails.

From the primary node, remove the cable from the card used to connect to the shared array (or disable the Private Network card that connects the node to filer). This will simulate a failure of the primary node, and should initiate a failover to the secondary node.

All resources should failover to the secondary node. It will take some time for cluster to realize that the disks are not accessible, but eventually cluster service will stop and SQL Server will fail over. It takes more time for fail over if the private network card in the node that owns the Quorum drive is disabled.

Check Windows Event log and SQL Server Error Log for any potential error messages after this occurs. Connect to the database from a SQL Server Client.

Once you have checked for any potential problems, plug the cable back into the primary node (or enable the Private Network card that connects the node to filer).

Now remove the cable from the card used to connect to the shared array on the secondary node (or disable the Private Network card that connects the node to filer).

All resources should failover to the primary node.

Check Windows Event log and SQL Server Error Log for any potential error messages after this occurs. Connect to the database from a SQL Server Client.

Once you have checked for any potential problems, plug the cable back into the secondary node (or enable the Private Network card that connects the node to filer).

Test 9: Additional Database Tests

Failover the cluster during the following activities

Activity	Expected Result
Running a large transaction	Rollback
Backup	Incomplete backup � RESTORE HEADERONLY command reveals that the backup file is incomplete. If the backup is appended to an existing file, the file rollbacks to original size.
Restore	Restore fails leaving database in Loading state. Data and log files remain in disks.
DBCC CHECKDB	Fails with connectivity error. No Impact on database

Test 1: Verifying Connectivity and Name Resolution

Test 2: Verifying the Service Account

SQL Server IP Address

Test 4: Failover Validation

Test 5: Initiate Failure

Test 6: Turn Off Each Node

Test 7: Break Network Connectivity

Test 8: Break Heartbeat Network

Test 9: Break Shared Array Connectivity

Test 9: Additional Database Tests