SQL Server AlwaysOn Availability Groups Monitoring and Alerting

By:Alejandro Cobar | Last Updated: 2018-12-03 || Related Tips:More > Availability Groups

Problem

In the part 1 of this tip series , I showed you how to build the inventory of replicas that form an Availability Group. However, that’s just half of the picture, because you must be able to check the status of the databases within Availability Groups and this has to be a constant check to make sure the databases are available.

Solution

I will be presenting a script that gathers basic, but very useful information for each Availability Group database. This information will be obtained from the performance counter objects and with PowerShell we are going to be able to gather this information from all the availability databases from the replicas inventory you already have.

SQL Server Availability Groups Monitoring Script

The Availability Groups inventory is built first, because that way you can run this second script against a smaller subset of SQL Server instances. You can probably do it within one run, but that wouldn’t be as efficient.

Also, there are some performance counter objects that must be obtained from the primary replica and others from the secondary ones.

Information from Primary Replica Log Bytes Flushed/sec: A log flush occurs when data is written from the log cache to the physical transaction log file on disk, every time a transaction is committed. In order to get the rate, you have to obtain a couple of samples and then subtract the values. That’s the reason why the “WAITFOR DELAY” is there, so you can have an accurate delta. This will give you an idea of the activity going on currently at the databases in the Primary Replica. SELECT instance_name, cntr_value
INTO #Logflushes1
FROM sys.dm_os_performance_counters
WHERE object_name = 'SQLServer:Databases'
AND counter_name = 'Log Bytes Flushed/sec';
WAITFOR DELAY '00:00:01';
SELECT instance_name, cntr_value
INTO #Logflushes2
FROM sys.dm_os_performance_counters
WHERE object_name = 'SQLServer:Databases'
AND counter_name = 'Log Bytes Flushed/sec';
SELECT
ag.name AS 'GroupName'
,db_name(drs.database_id) AS 'Database'
,(SELECT (mf.size * 8) / 1024.0 FROM sys.master_files mf WHERE mf.type = 0 AND mf.database_id = drs.database_id) AS 'MDFSize'
,(SELECT (mf.size * 8) / 1024.0 FROM sys.master_files mf WHERE mf.type = 1 AND mf.database_id = drs.database_id) AS 'LDFSize'
,drs.synchronization_state_desc AS 'State'
,drs.synchronization_health_desc 'Health'
,ags.primary_replica AS 'PrimaryReplica'
,CONVERT(DECIMAL(10,2), log_flushes / 1024.0 / 1024.0) AS 'LogFlushed'
FROM sys.availability_groups ag
JOIN sys.dm_hadr_availability_group_states ags ON ag.group_id = ags.group_id
JOIN sys.dm_hadr_availability_replica_cluster_states cs ON ags.group_id = cs.group_id
JOIN sys.dm_hadr_availability_replica_states rs ON rs.replica_id = cs.replica_id
JOIN sys.dm_hadr_database_replica_states drs ON rs.replica_id=drs.replica_id
JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value log_flushes
FROM #Logflushes1 l1
JOIN #Logflushes2 l2 ON l2.instance_name = l1.instance_name
) log_flushes ON log_flushes.instance_name = DB_NAME(drs.database_id)
WHERE rs.role_desc = 'PRIMARY';
DROP TABLE #Logflushes1;
DROP TABLE #Logflushes2; Information from Secondary Replica Redone Bytes/sec: Amount of log records redone on the secondary database in the last second. Log Bytes Received/sec: Amount of log records received by the secondary replica for the database in the last second. Log Send Queue Size: Amount of log records of the primary database that has not been sent to the secondary databases, in kilobytes (KB). Send Rate: Average rate at which primary replica instance sent data during last active period, in kilobytes (KB)/second. Send Latency: Time (in seconds) it will take the Primary Replica to fully send all the log records to the Secondary Replica. Redo Queue Size: Amount of log records that the Secondary Replica has to process after being received by the Primary Replica. Redo Rate: Rate at which the Secondary Replica is effectively processing the log records received from the Primary Replica. SELECT instance_name, cntr_value
INTO #redo1
FROM sys.dm_os_performance_counters
WHERE object_name = 'SQLServer:Database Replica'
AND counter_name = 'Redone Bytes/sec';
SELECT instance_name, cntr_value
INTO #send1
FROM sys.dm_os_performance_counters
WHERE object_name = 'SQLServer:Database Replica'
AND counter_name = 'Log Bytes Received/sec';
WAITFOR DELAY '00:00:01';
SELECT instance_name, cntr_value
INTO #redo2
FROM sys.dm_os_performance_counters
WHERE object_name = 'SQLServer:Database Replica'
AND counter_name = 'Redone Bytes/sec';
SELECT instance_name, cntr_value
INTO #send2
FROM sys.dm_os_performance_counters
WHERE object_name = 'SQLServer:Database Replica'
AND counter_name = 'Log Bytes Received/sec';
SELECT
DB_NAME(rs.database_id) AS 'Database'
,r.replica_server_name AS 'SecondaryReplica'
,CONVERT(DECIMAL(10,2), rs.log_send_queue_size / 1024.0) AS 'LogSendQueueSize'
,CONVERT(DECIMAL(10,2), send_rate / 1024.0 / 1024.0) AS 'LogSendRate'
,CONVERT(DECIMAL(10,2), log_send_queue_size / CASE WHEN send_rate = 0 THEN 1 ELSE send_rate / 1024.0 END) AS 'SendLatency'
,CONVERT(DECIMAL(10,2), rs.redo_queue_size / 1024.0) AS 'RedoQueueSize'
,CONVERT(DECIMAL(10,2), redo_rate.redo_rate / 1024.0 / 1024.0) AS 'RedoRate'
,CONVERT(DECIMAL(10,2), rs.redo_queue_size / CASE WHEN redo_rate.redo_rate = 0 THEN 1 ELSE redo_rate.redo_rate / 1024.0 END) AS 'RedoLatency'
FROM sys.dm_hadr_database_replica_states rs
JOIN sys.availability_replicas r ON r.group_id = rs.group_id AND r.replica_id = rs.replica_id
JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value redo_rate
FROM #redo1 l1
JOIN #redo2 l2 ON l2.instance_name = l1.instance_name
) redo_rate ON redo_rate.instance_name = DB_NAME(rs.database_id)
JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value send_rate
FROM #send1 l1
JOIN #send2 l2 ON l2.instance_name = l1.instance_name
) send_rate ON send_rate.instance_name = DB_NAME(rs.database_id);
DROP TABLE #send1;
DROP TABLE #send2;
DROP TABLE #redo1;
DROP TABLE #redo2;

PowerShell Script to Automate the Data Collection of SQL Server Availability Group Databases

Below is the PowerShell script that unifies all the code described above. This will let you grab all the information from your environment and store it in a central place for data collection, analysis and monitoring. Note: you will have to update some of the parameters in the PowerShell script below to match your environment settings.

$labServer = "XXX"
$inventoryDB = "XXX"
#Clean the table where the monitoring information will be stored
Invoke-Sqlcmd -Query "TRUNCATE TABLE XXX" -Database $inventoryDB -ServerInstance $server
Write-Host "Fetching all the Primary Replicas from the inventory"
#Fetch all the instances that have a PRIMARY role in an Availability Group
#This list of instances should ideally come from the list we already built in Part 1 of this series
/*
This is an example of the result set that your query must return
######################