You are browsing the archive for #scom.

SCOM Quick Query: Logical Disk Space For My Environment

10:31 am in #scom, #sysctr by Jan Van Meirvenne

 

Sometimes I get questions in the style of “What is the current state of my environment in terms of…”. If there is no report in SCOM I can point to I usually create a quick query on the Data Warehouse and provide the data as an excel sheet to the requestor. Afterwards, should the question be repeated over and over, I create a report for it and provide self-service information.

In order to both prevent forgetting these kind of ‘quick and dirty’ queries, and also sharing my work with you I will occasionally throw in a post if I have a query worth mentioning.

Here we go for the first one!

If you are not interested in using the extended Logical Disk MP you can use this query on your DW to quickly get a free space overview of all logical disks in your environment :

select max(time) as time,server,disk,size,free,used from
(
select perf.DateTime as time,e.path as server, e.DisplayName as disk, round(cast(EP.PropertyXml.value(‘(/Root/Property[@Guid="A90BE2DA-CEB3-7F1C-4C8A-6D09A6644650"]/text())[1]’, ‘nvarchar(max)’) as int) / 1024,0) as size, round(perf.SampleValue / 1024,0) as free, round(cast(EP.PropertyXml.value(‘(/Root/Property[@Guid="A90BE2DA-CEB3-7F1C-4C8A-6D09A6644650"]/text())[1]’, ‘nvarchar(max)’) as int) / 1024,0) – round(perf.SampleValue / 1024,0) as used from perf.vPerfRaw perf inner join vManagedEntity e on perf.ManagedEntityRowId = e.ManagedEntityRowId
inner join vPerformanceRuleInstance pri on pri.PerformanceRuleInstanceRowId = perf.PerformanceRuleInstanceRowId
inner join vPerformanceRule pr on pr.RuleRowId = pri.RuleRowId
inner join vManagedEntityProperty ep on ep.ManagedEntityRowId = e.ManagedEntityRowId
where
pr.ObjectName = ‘LogicalDisk’
and
pr.CounterName = ‘Free Megabytes’
and
ep.ToDateTime is null
and Perf.DateTime > dateadd(HOUR,-1,GETUTCDATE())
) data
group by data.server,data.disk,data.size,data.free,data.used
order by server,disk

 

Available fields:

Time: the timestamp of the presented data
Server: the server the disk belongs to
Disk: The name of the logical disk
Size: the size of the disk in GB
Free: the free space on the disk in GB
Used: the used space on the disk in GB

 

Please note that I am not a SQL guru, so if you find a query containing war crimes against best practices, don’t hesitate to let me know!

 

See you in another knowledge dump!

[SCOM 2007–>2012] Minimizing the risk of failures by manually testing your database upgrade

6:14 pm in #scom, #sysctr by Jan Van Meirvenne

When you upgrade the last management server (the RMS) to SCOM 2012, the setup will also upgrade the database. As this step is very sensitive to inconsistencies, it is advised to test this update on a copy of your Operational database.

The following steps are done during the database upgrade:

1. Database infrastructure updates (build_mom_db_admin.sql)

Sets up recovery mode, snapshots, partitions

2. Database upgrade / creation (build_mom_db.sql)

– Populates a new database, or converts a 2007 database

3. Enter localization data (build_momv3_localization.sql)

– Inserts localized message strings

– This is the last step where things can go wrong, it is also the last action you can perform manually

4. Management Pack Import

– This step is performed by the setup itself. It imports new management packs into the database.

5. Post Management Pack import operations (Build_mom_db_postMPimport.sql)

– This step performs some database changes that are needed after the MP import

6. Configuration Service Database setup (Build_om_db_ConfigSvcRole.sql)

– This steps sets up the tables and roles needed for the configuration service. The SCOM configuration service used to be a file-based database containing the current configuration of workflows for the management group. For performance and scalability reasons, this role has been transferred to the database-level.

All the SQL files can be found on the SCOM 2012 installation media in the server\amd64 folder.

First, should you have encountered an upgrade failure, restore your database to the original location, you cannot reuse the current version as it probably will be corrupted beyond repair. Restore the RMS binaries, use the identical database and service account information of the original setup. The RMS will restore itself in the management group. The subscriptions and administrator role will be reset however.

Take another backup copy of the Operational Database, restore it to a temporary database and execute the aforementioned SQL scripts in the correct order. You will usually get a more precise error message in the SQL management studio than in the SCOM setup log (stored procedure names, table names…).

A common cause for upgrade failure is an inconsistent management pack. Pay close attention to the affected items mentioned in the error message. They usually contain a table or column reference. Lookup this reference and try to get hold of a GUID. Query the GUID in the live database using the PowerShell interface (Get-MonitoringClass –Id <GUID> or Get-MonitoringObject –Id <GUID> for example). This way you can find out which element(s) are causing issues and remove the management pack(s) that contain them.

You can uninstall suspect management packs by executing the stored procedure “p_RemoveManagementPack” providing the management pack GUID as parameter. If the stored procedure returns a value of 1 then check if there are any depending management packs you should remove first.

It will be rinse and repeat from here on: restore the backup of the original database again to the temporary database, remove the problematic management pack using the stored procedures and re-execute the scripts until no more errors occur. After you have listed up all troublesome management packs, you can remove them from the live management group, wait for the config churn to settle and retry the upgrade. Usually, the subsequent upgrade steps are pretty safe and you shouldn’t have big issues beyond this point.

Keeping things tidy while authoring (1st edition)

5:24 pm in #scom, #sysctr by Jan Van Meirvenne

I enjoy authoring a management pack. Not only because it is a really fun process to go through, but also to puzzle with the various techniques that you can use to make your work of art lean and mean.

Here are some of the things I try to do every time I embark on a new authoring adventure:

1) Use a naming convention

Using a naming convention internally in your management pack greatly helps when skimming through your XML code, associating multiple items (overrides, groups,…) or  creating documentation. It is also a big advantage when you are working together with other authors: using the same convention improves readability (provided everybody involved is knowledgeable with the convention) and guarantees a standard style in your code.

For example, at my current customer we have the following naming convention:

<Customer>.<Workload>.MP.<ObjectType>_<Component>_<Name>

The object type is actually an abbreviation for the various possible objects a management pack can contain:

Abbreviation Object Type
C Class
D Discovery
R Rule
M Monitor
REC Recovery
G Group
SR Secure Reference
REP Report

So lets just say I work at company ‘Foo’ and I am working on a management  pack for the in-house application ‘Bar’. I have to create a single class for it as it has only one component. It requires a service monitor and a rule that alerts on a certain NT Event. My objects would be called as followed.

– Management Pack: Foo.Bar.MP
– Class: Foo.Bar.MP.C_Application
– Discovery: Foo.Bar.MP.D_Application
– Monitor: Foo.Bar.MP.M_Application_ServiceStatus
– Rule: Foo.Bar.MP.R_Application_EventAlert


2) Avoid manual-reset monitors

For me, the blight in Operations Manager are manual reset monitors. It requires the SCOM operator to be alert and notice that a monitor is not healthy not because there is a genuine issue, but because nobody pushed the button to clear the state. This monitor is only an option in one scenario: There really is no way  to detect when an issue is gone for the monitored application, and the issue must affect the health state of the SCOM object (for uptime or SLA reporting). But still, this might produce inaccurate reports because you don’t know how long ago the issue was solved when you finally reset that monitor.

If you can’t find a good reason to use manual reset monitors, then I suggest you take an alert rule instead. If the reason  to go for a manual reset monitor was to suppress consecutive alerts, then just use a repeat count on the rule (more on that in a jiffy).

3) Enable repeat count on alert rules

One of the biggest risks when going for an alert rule is bad weather: an alert storm. Imagine you have a rule configured to generate an alert on a certain NT Event, and suddenly the monitored application goes haywire and starts spitting out error events like crazy. Enable the repeat count feature to only create one alert for this that uses a counter to indicate the number of occurrences. Is it really useful to have SCOM show 1000 alerts to indicate 1 problem? Well, I had some requests stating: we measure the seriousness of a problem using the SCOM mails…when there are a lot of mails we now we need to take a look and else we don’t.

This brings me to a small tip when dealing with a workload intake meeting with the app owner (maybe I’ll write a full post on this): ALWAYS RECOMMEND AGAINST E-MAIL BASED MONITORING. This is a trend that lives in IT teams not custom to centralized and pro-active monitoring (let the problem come to us). Emphasize the importance of using the SCOM interfaces directly as this keeps the monitoring platform efficient and clean. Mails should only be used to notify on serious (fix me now or I die) issues. Also, either make the alerts in the subscription either self-sustaining (auto-resolve) or train the stakeholders to close the alert themselves after receiving the mail and dealing with the problem.

4) Create a group hierarchy

I love groups. They are the work-horses behind SCOM scoping. For every management pack I make a base group that contains all servers or other objects related to the workload. I can then scope views and overrides to this group as I please, and create a container specific for the workload. To prevent creating a trash bin group and also to be able to do more specific scoping (eg different components, groups for overrides,…) I often do the following:

Create a base group to contain all other groups for the workload. I use a dynamic expression that dictates that all objects of the class ‘Group’ with a DisplayName containing ‘G_<Application>_’ should be added as a subgroup. This is a set once and forget configuration. As long as I adhere to the naming convention, any groups relevant to the application workload should be automatically added. However, watch out with this because you can easily cause either a circular reference (the base group is configured to contain itself, does not compute well with SCOM) or to contain groups you did not intend to add as a subgroup.

Example:

G_Foo (the base group)
    – G_Foo_Servers (all servers used in the workload)
   – G_Foo_Bar (all core component objects of the Bar application, this is useful if you want to scope access to operators on the application objects but not the underlying server ones)
    – G_Foo_DisabledDiscovery (This must be seen like a GPO/OU structure, create groups for your overrides and simple control which objects receive the override by drag-and-dropping them as a member.

5) When you expect a high monitored volume of a certain type, use cookdown

Lets just say our Foo application can host a high amount of Bar components on a single system and that we use a script to run a specific diagnostic command and alert on the results. Do we want to run this script for every Bar-instance? It sounds like this will cause a lot of redundant polls and increased overhead on the system. Why won’t we create a single script data source and let the cookdown principle do its job? Cookdown implies that if multiple monitors and/or rules use the same datasource, they will sync-up. This will result in the datasource only executing once on the monitored system and it will feed all attached rules and monitors with data. There are a lot of requirements to enable this functionality, but here are the basics:

– The script must be cookdown compatible, this means that multiple property bags must be created (preferably one for each instance) and that the consuming rules and monitors are able to extract the property bag of their target-instance (I usually include a value containing the instance-name and use an expression-filter to match this to the name-attribute of the instance in SCOM). Remember, this script must retrieve and publish all data for all attached instances in one go.

– the configuration-parameters of the depending rules and monitors that use the data source must be identical. Every parameter that is passed to the script, and the interval settings must be the same or SCOM will split up the execution into 2 or more runtimes.

-  You can use the cookdown analyzer to determine whether your setup will support cookdown. In Visual Studio you can do this by right-clicking your MP project and choosing “Perform Cookdown Analysis”.

image

This will build your management pack and then let you specify which classes will have few instances. The selected items will not be included in the report as cookdown is not beneficial for low-volume classes.

image

When you click ‘Generate Report’ an HTML file will be rendered containing a color coded report specifying all modules and whether they support cookdown (green) or not (red).

image

6) Don’t forget the description field and product knowledge

Something I made myself guilty of: not caring about description or product knowledge fields. These information-stores can be a great asset when documenting your implementation. I make the following distinction between descriptions and product knowledge: a description contains what the object is meant to do, and product knowledge is what the operator is meant to do (although some information will be reused from the description to provide context). This are invaluable functions that can speed-up the learning process for a new customer, author or operator, informing them on what they are working with and why things are as they are. This does not replace a good design and architecture however, as the information is decentralized and easily modified without control. Always create some documentation external to your management pack, and store it in a safe and versioned place.

An additional and often overlooked tip: you can also provide a description for your overrides. Use it to prevent having to figure out why you put an override on a certain workload years ago.

Bottom line: always provide sufficient context to why something is present in SCOM!

7) For every negative trigger, try to find a positive one

When you are designing the workload monitoring, and you notice the customer gave an error condition for a certain scenario but not a clearing one…challenge him. The best monitoring solution is one that sustains itself: keeping itself clean and up-to-date.

There is one specific type of scenario where rule-based alerts are unavoidable: informative alerts that indicate a past issue or event (eg unexpected reboot, bad audit), although you must consider event collection if this is merely used for reporting purposes.

8) Don’t use object-level overrides

Although it seems useful for quick, precise modifications, I find object-level scoped overrides dangerous. Internally, an override on an object contains a GUID, representing the Id of the overridden object in the database. When this object disappears for some reason (decommissioning, replacement,…) the override will still be there, but only target thin air. Unless you documented the override, you will have no clue why the override is there, and if you have a bunch of these guys, they can contaminate your management group quickly. Like I said earlier, I like to generalize my overrides using groups: instead of disabling a monitor for one specific object, I create a group called ‘G_Foo_Bar_DisabledMonitoringForDisk’, put my override on that group and then use a dynamic query (static works also of course, but contains some of the same issues as object-level overrides) to provision my group. This allows you to centralize your override management, track back why stuff is there and to quickly extend existing overrides without having the risk of duplicating them. Like I said previously, I highly recommend using the override description fields to provide additional information.

This was a first edition of my lessons learned regarding SCOM authoring, I’ll try to add new editions the moment I learn some new tricks! Do you have tips of your own, or do you do things differently? Let me know in the comments! Thanks for reading and until next time!