This document provides some methods for testing whether your failover cluster is configured properly.

 

Test 1: Verifying Connectivity and Name Resolution

 

To verify that the private and public networks are communicating properly, perform the following steps. It is imperative to know the IP address for each network adapter in the cluster, as well as for the IP cluster resources

 

From each Server Node ping

 

From a Client machine ping

 

Test 2: Verifying the Service Account

 

For SQL Server to be able to manage its resources and perform correctly the service account must be part of the cluster ACL. To ensure that this is configured properly, execute the following in a SQL Query Analyzer window:

 

select * from ::fn_virtualservernodes()

 

If there is no output, ensure the account SQL Server is running under is part of the cluster ACL.

 

Test3: Verifying SQL Server 2000 Failover Cluster Dependencies

Cluster resources may be dependent on other resources to start before they are brought online. Make sure the following dependencies exist.

 

Resource

Dependencies

Cluster IP Address

None

Cluster Name

Cluster IP Address

Quorum

None

MS DTC

Cluster Name, Cluster IP, Disk Resource (Quorum is the default)

SQL Server IP Address

None

SQL Server Network Name

SQL Server IP Address

SQL Server (the virtual server itself)

SQL Server Network Name, Disk Resources associated with the instance

SQL Server Agent

SQL Server

SQL Server FullText

SQL Server

 

Test 4: Failover Validation

 

Perform a failover of SQL Server virtual server to ensure that all resources fail over and restart on another node with no problems or affecting any other groups. Check Windows Event Log and SQL Server Error Log to ensure that no error occurred.

 

Test 5: Initiate Failure

 

Initiating a failure simulates failure of cluster resource and makes cluster service perform all the steps necessary to failover to the other node. By default the cluster service checks the failure condition three times before it decides that it is a real failure and it has to perform a failover. So initiating failure does not actually fail over the resource but performs all checks.

 

To initiate a failure, right click on any resource under SQL Server Group and select Initiate Failure.

 

Check Windows Event log and SQL Server Error Log to ensure that no error occurred.

Next three tests involve breaking the cluster physically. Engineering is against doing this as it might damage the disk when you pull the plug.

 

 

Test 6: Turn Off Each Node

 

In this test, you will first need to ensure that all of the default groups are located on the primary node. Then you will physically turn off (flip the switch) the primary node. If it is not possible to switch off the node, shutdown the cluster service in the primary node.

 

All resources should failover to the secondary node.

 

Check Windows Event log and SQL Server Error Log for any potential error messages after this occurs. Connect to the database from a SQL Server Client.

 

Once you have checked for any potential problems, turn on the primary node and wait until it fully boots (or restart the cluster service).

 

Now turn off the secondary node (or stop the cluster service). All resources should failover to the primary node. (If you have Cluster Administrator open in primary node it will generate an error, open Cluster Administrator in secondary node to see if the node failed over)

 

Check Windows Event log and SQL Server Error Log for any potential error messages after this occurs. Connect to the database from a SQL Server Client.

 

Once the groups fail back to the primary node, turn the secondary node back on (or restart the cluster service), and wait until it boots up fully.

 

Test 7: Break Network Connectivity

 

This test is similar in concept as the above test, which initiates a failover. But instead of simulating a computer failure, it simulates a network-related error.

 

From the primary node, remove the network cable from the public network card (or disable the network card from Control Panel - Network and Dial-up Connections). This will simulate a failure of the primary node, and should initiate a failover to the secondary node.

 

All resources should failover to the secondary node.

 

Check Windows Event log and SQL Server Error Log for any potential error messages after this occurs. Connect to the database from a SQL Server Client.

 

If there are no errors plug the network cable back into the primary node (or re-enable the network card), and then remove the network cable from the public network card on the secondary node (or disable the network card).

 

As before, all resources should failover to the primary node.

 

Check Windows Event log and SQL Server Error Log for any potential error messages after this occurs. Connect to the database from a SQL Server Client.

 

If there are no errors plug the network cable back into the public network card on the secondary node (or re-enable the network card).

 

If you are doing this test remotely, create local user account in each node with administrative privileges and connect to the node which you are testing using remote desktop on the heartbeat IP so that you can re-enable the public network card after the test.

 

 

Test 8: Break Heartbeat Network

 

The cluster should continue to function as other two networks (private and public) are configured to allow internal communications.

 

Test 9: Break Shared Array Connectivity

 

This test is designed to help uncover potential issues with the shared disk array. This test is designed to simulate what would happen if the controller card or cable connected from a node to the shared disk array fails.

 

From the primary node, remove the cable from the card used to connect to the shared array (or disable the Private Network card that connects the node to filer). This will simulate a failure of the primary node, and should initiate a failover to the secondary node.

 

All resources should failover to the secondary node. It will take some time for cluster to realize that the disks are not accessible, but eventually cluster service will stop and SQL Server will fail over. It takes more time for fail over if the private network card in the node that owns the Quorum drive is disabled.

 

Check Windows Event log and SQL Server Error Log for any potential error messages after this occurs. Connect to the database from a SQL Server Client.

 

Once you have checked for any potential problems, plug the cable back into the primary node (or enable the Private Network card that connects the node to filer).

 

Now remove the cable from the card used to connect to the shared array on the secondary node (or disable the Private Network card that connects the node to filer).

 

All resources should failover to the primary node.

 

Check Windows Event log and SQL Server Error Log for any potential error messages after this occurs. Connect to the database from a SQL Server Client.

 

Once you have checked for any potential problems, plug the cable back into the secondary node (or enable the Private Network card that connects the node to filer).

 

Test 9: Additional Database Tests

 

Failover the cluster during the following activities

 

Activity

Expected Result

Running a large transaction

Rollback

Backup

Incomplete backup � RESTORE HEADERONLY command reveals that the backup file is incomplete. If the backup is appended to an existing file, the file rollbacks to original size.

Restore

Restore fails leaving database in Loading state. Data and log files remain in disks.

DBCC CHECKDB

Fails with connectivity error. No Impact on database