SCOM 2012: Check greyed out agents

Greyed out agents are can be a nightmare for a System Center Operations Manager admin. An agent gets greyed out if the Health Service is not communicating correctly with the Management Servers. Normally an alert should be created with the name “Health Service Heartbeat Failure” which indicates this status. But sometimes I see the situation that the alert was created, but also auto-resolved by the system after a while (because of an agent recovery etc.). The problem then is if the agent still stays in an unhealthy state but no new alert gets created. I see that from time to time if the agent is stuck or has resource problems. This situation needs to be solved quickly because during that time no monitoring on the agent side takes place.

So how can this be resolved?

I implemented this solution: The management servers already know which agents are greyed out, so I have created a rule which runs on the “All Management Servers Resource Pool” every 5 min (you can select another interval if you like). It checks which agents are greyed out but are not in maintenance mode and then checks for each agent if there is an open “Health Service Heartbeat Failure” alert. It adds the server to a list which will be populated in one alert with the name “Sample – greyed out agents”, if no alert was found.

The main logic of the rule bases on a Powershell script. Here is the part, with the logic – I have skipped everything around it (log function, SCOM module, etc.).

$TotalCount=0
$list=””
$agentclass = Get-SCOMClass -Name “Microsoft.SystemCenter.Agent”
# Find greyed out agents which are not in maintenance mode
$agentobjects = Get-SCOMMonitoringObject -Class:$agentclass | Where-Object {($_.IsAvailable -eq $false) -and ($_.InMaintenanceMode -eq $False)}
if ($agentobjects -is [Object])
{
    $msg = “`r`nFound greyed out agents which are not in maintenance mode.”;
    Log -msg $msg -debug $debug -debugLog $debugLog;
    # Go through agent list
    foreach ($agent in $agentobjects) 
   {
       $msg =  “`r`n”+ $agent.displayname
       Log -msg $msg -debug $debug -debugLog $debugLog;
       #Go on if watcher state for the agent is unhealthy
       if((Get-SCOMClass -name “Microsoft.SystemCenter.HealthServiceWatcher”| get-scomclassinstance |  Where-Object {$_.Displayname   -eq $agent.DisplayName}).HealthState -ne ‘Success’)
       {
           # Find open Health Service Heartbeat Failure alert for the agent
           $alert=get-scomalert -name ‘Health Service Heartbeat Failure’ | where {($_.ResolutionState -ne 255) -and ($_.MonitoringObjectDisplayName -eq $agent.DisplayName)}
           # No alert for greyed out agent found
           if ($alert -isnot [Object])
           {
               $list+=”`r`n”+$agent.displayname
               $msg=”`r`nThe agent “+ $agent.displayname + ” has no open Health Service Heartbeat Failure alerts. Add to list.”
               Log -msg $msg -debug $debug -debugLog $debugLog;
               $Totalcount++
           }
       }
   } 
}

You can find the rule in a small management pack called Sample.BaseMonitoring, which you can download here.
It is designed for SCOM 2012 SP1. Please test it in your development environment before you add it to production!

Advertisements
Post a comment or leave a trackback: Trackback URL.

Comments

  • afzal  On August 9, 2016 at 8:14 am

    dear

    it is very good management pack,

    but sometime I noticed it would stop creating/or updating the alert. look like this stop working when any server in the all management server resource pool in grayout.

    please advise how can we ensure it is running, even if other MS not the RMS emulator get grayout

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: