You are browsing the archive for 2013 October.

SCOM: Disk space monitoring extension pack

1:03 pm in Uncategorized by Dieter Wijckmans

In a constant quest to keep your environment running, Disk space is one of the things that need to be available to satisfy your organization’s continuously growing hunger for storage.

The price of storage has dropped significantly over the last years but unfortunately the demand for more storage has grown as well as files are getting bigger and more and more data is kept.

SCOM has had different processes over the year to make sure you are properly alerted when disk space is running low. In this post I will show you my method of keeping an eye on all the available disk space. This is however my point of view and open for discussion as usual.

I started this blog post because of a case I received from one of my customers:

  • Disk should be monitored on both Free Mb left AND % free space left.
  • SCOM only needs to react when BOTH the thresholds are breached
  • Different threshold apply to critical and non critical servers
  • Different kind of ticket needs to be created for critical and non critical servers
  • A warning and Alert should be send out to warn upfront and send another warning when things get serious.
  • Every day a new ticket should be sent when the condition was not solved the day before.

My initial response was: Great let’s get Orchestrator in here to get a better part of the logic in there. Answer was as predicted => no.

Ok so let’s break this up in the different categories:

  • Detection
  • Notification
  • Reset

Note: I did already create a management pack for this scenario but am explaining the scenario thoroughly so you can use this guide for another monitoring scenario as well

Download the mp from the gallery:

download-button-fertig11

 

Detection

We are in luck because SCOM already has the ability to monitor on both conditions mentioned above (Free Mb left AND %free space). This was the case in the logical disk monitor and it is still present today BUT (yep there will be a lot of BUTS in this post) this is not the case in the Cluster and Cluster shared Volumes (CSV) monitors. They use the new kind of disk space monitoring where the previous 1 monitor with double thresholds is divided in to 2 separate monitors with a rollup monitor on top. In my opinion a good decision.

So at this point we can use for all different kinds of disks the same method: 2 monitors with 1 rollup monitor on top. GREAT.

So let’s start configuring them! Fill in all the different thresholds and you are good to go right?

In theory yes… but in this case not quit. One of the big hurdles was the fact that a monitor can only fire of one notification as long as it is not reset to healthy. As we need a notification on both warning and error we have an issue here. The notification process is by design built that you only will receive an alert once for either warning or error on the monitor.

Because we need to have a warning AND error we need to create additional monitors to cope with this requirement.

This is in fact how I tackled this issue.

Creating the necessary monitors.

To make sure we can have the ability to act on both thresholds we will need to create 3 monitors: Rollup monitor, Free Space Monitor (%) and Free Space Monitor (MB) like the one which ships out of the box.

So let’s get at it:

Note: I’m using the console to quickly create the management pack to show you with a minimum of authoring knowledge to solve this issue however I advise to dig deeper in the different authoring solutions for SCOM.

Note: All the necessary monitors are already in the management pack which I included in this post. I solely mention the process here so you potentially can use this method to do the same thing for another scenario.

Create the Rollup monitor

A rollup monitor will not check a condition itself but will react on the state of the monitors beneath it. Therefore we have to create this first. To make sure it shows up right under the other monitors we keep the same naming but add the word WARNING at the end.

Open the monitor tab and choose to create a monitor => Aggregate Rollup Monitor…

Fill in the name of the monitor

SNAG-0154

In this case we want the best state of any member to rollup because we want both %mb free AND %free to be true and thus in warning state before we want to be alerted:

SNAG-0155

We would like to have an alert when there’s a warning on both monitors underneath this monitor so we change the severity to Warning.

SNAG-0156

Create the monitors underneath this rollup monitor

To make sure are new rollup monitor is correctly influenced by the monitors underneath we now need to create the monitors with the conditions MB free and % free.

These are included in the management pack as well. Keep an eye on the fact that you need to create a monitor and select the appropriate rollup monitor where they need to reside under like shown below:

SNAG-0160

For the performance counter in this case I used these parameters:

object: $Target/Property[Type="Windows5!Microsoft.Windows.Server.ClusterDisksMonitoring.ClusterDisk"]/ClusterName$

Counter: % Free Space

Instance: $Target/Property[Type="Windows5!Microsoft.Windows.Server.ClusterDisksMonitoring.ClusterDisk"]/ClusterDiskName$$Target/Property[Type="Windows5!Microsoft.Windows.Server.ClusterDisksMonitoring.ClusterDisk"]/ClusterResourceName$

NOTE: Make sure to turn off the alerting of these rules as we do not want to receive individual alerts but just the alert of the rollup monitor.

If you have created the monitors correctly it should look like this:

 

SNAG-0161

As you can see the monitors are now shown right beneath the actual monitors.

You can use this scenario for basically all approaches where you need to make double tickets for the same issue if they are caused by the same 3 state monitor.

Last important step in configuring the monitors

Because we now have the condition set for the warning condition with the appropriate thresholds we need to do the same thing for the out of the box monitor to only show us an alert when both critical conditions are met.

Therefore we need to override them with the proper thresholds and configuration:

For the rollup monitor we want to make sure it generates an alert when both the critical conditions are met therefore we set the following overrides to true:

  • Generates alert
  • Enabled
  • Auto-Resolve

For the alerting part we only want to be alerted on Critical state because otherwise the 2 sets of monitors will interfere with each other therefore we need to set the Alert on State to “critical health state” and last but not least the rollup algorithm needs to be best health state of any member because again we only want to be notified when both conditions are met.

 

SNAG-0162

The 2 monitors under the Aggregate Rollup monitor also need to be updated with the correct thresholds + to not generate alerts otherwise we will have useless alerts because we only want to be alerted when both conditions are met.

SNAG-0163

Creating the necessary groups.

After we have created the monitors we need to make sure that we have a clear difference between the critical servers and the non critical servers. These are necessary to give us the opportunity to create different thresholds and different levels of tickets per category of server.

You can create a group of servers with explicit members and go from there. This is however from a manageability standpoint not a good idea as this requires the discipline to add a server to the group when it changes category or is installed. This leaves way to much opening for errors.

Therefore we are going to create groups based on an attribute which is detectable on the servers. In this case I set a Regkey on the servers identifying whether it’s a critical server or not. This can be easily done by running a script through SCCM or doing it during build of the server.

Note: Do this in a separate management pack than the one you use for your monitors as this management pack if sealed can be reused through your entire environment.

To create the attribute go to the authoring pane and under management pack objects select the attributes

 SNAG-0120

Create new attribute

SNAG-0122

In this case I name it Critical server.

In the discovery method we need to tell SCOM how the attribute will be detected. In this case I choose to use a regkey.

In the target you select Windows Server and automatically the Target will be put in as Windows Server_Extended

The management pack should be the same management pack as your groups will reside in because we need to operate within the same unsealed management pack.

SNAG-0123

So after we filled in all the parameters it should look like this:

SNAG-0124

Last thing to do is to identify the key which is monitored by SCOM.

In my case it’s HKEY_LOCAL_Machine\Category\critical

SNAG-0126

Next up is to create both our groups: Critical and non critical servers

Create a new group fro the critical servers:

SNAG-0128

Check out the Dynamic Members rules

SNAG-0129

Select the Windows_Server_Extended class and check whether the Propery Critical server Equals True

SNAG-0133

The group will now be populated with all servers where this key has the value “true”

SNAG-0134

Only thing left to do is do the opposite with a group where there’s only servers not having this key set to true.SNAG-0136SNAG-0137

 

Notification

Because we now have all the building blocks to divide the warning and error on both groups of servers the only thing left to do is create both notification channels with the desired actions configured.

I ended up with 3 scenarios with their notifications to match the requirements:

Notification 1:

I want to be alerted for a critical alert on the Critical servers and create a high priority ticket through my notification channels.

SNAG-0165  

Notification 2:

I want to be alerted for a critical alert on the non critical servers and create a normal priority ticket through my notification channelsSNAG-0166

Notification 3:

I want to be alerted for a warning alert on both the critical servers and the non critical servers and send out a mail through my notification channels.

SNAG-0167

The next steps in how to get the tickets out scom in your organization should be configured for your environment specific but at this point the different scenarios are covered.

RESET

The last thing on the list was to reset the monitors on a daily basis so we are sure that we keep getting alerts as long as the condition was not resolved. This is accomplished by using my resetmonitorsofspecifictype script which I documented in this blogpost: http://scug.be/dieter/2013/10/23/scom-batch-reset-monitors-through-powershell/

 

CONCLUSION

This blogpost covers all the different questions in this scenario + that we did not have to build any complex scenarios outside of SCOM but used all technology within SCOM to accomplish our goal.

The last thing I would recommend is to seal the management pack used for the group creation. That way you can reuse this in other unsealed management packs as well to make a difference between critical and non critical servers.

Again you can use this approach for all different monitors.

Scom: Batch reset monitors through PowerShell

12:42 pm in operations manager, SCOM, SCOM 2012, sysctr by Dieter Wijckmans

Monitors are a very useful addition to SCOM since SCOM 2007 came out back in the days. However for a lot of fresh SCOM administrators the alerts generated by monitors sometimes can create headaches.

An alert is raised when a state is changed and closed when the state changes back to the health condition. This is the really short version…

If you speak to advanced SCOM admins they can all agree that the management of the monitor generated alerts can be tricky from time to time if you work with operators.

If at one point they close an alert in the console which was generated by a monitor but the condition is not changed for the monitor it will remain in unhealthy state until a force reset is done on the monitor itself.

We all know how many monitors are floating around in our environment so it’s just a disaster waiting to happen. Therefore it is wise to reset the unhealthy monitors for your core business services regularly until everybody is aware about the fact that they can not close alerts from a monitor…

However I use this setup also for another annoying thing that can have great impact on your environment. Again this is a scenario to rule out a human error.

  • IF an alert is raised by a monitor going into a unhealthy state, a notification is successfully triggered and a ticket is created… So far so good.
  • BUT if someone closes the ticket or the alert without looking at it the condition remains and no warning will be raised again.
  • As a lot of my customers are using scom as a monitoring tool in the backend and monitor the tickets it generates they will not be alerted again.

Therefore I created this small PowerShell script in combination with a bat file. It will just reset the health of the unhealthy monitors of a specific monitor you specify. Only thing left to do is create a scheduled task for the bat file and you are good to go.

The script can be downloaded at the Gallery together with the bat file.

download-button-fertig11

Example: Fragmentation level is high and we want to be alerted everyday again as long as the condition remains:

SNAG-0168

Check the monitor properties to retrieve the monitor display name:

SNAG-0169

In this case “Logical Disk Fragmentation Level” Copy paste the name.

SNAG-0170

Fill in the name in the batch file and run it.

SNAG-0171

The unhealthy monitors will be reset and their alerts are automatically closed in the console.

SNAG-0172

If we check the monitor again it is now forced to reset state and will fire again the next time it checks the unhealthy condition when this is still true.

 SNAG-0173

This way you will receive a new alert every time this script runs. You could also schedule this during shift change of the helpdesk to get a clear view of the current situation on your environment that they start with a clean sheet.