You are browsing the archive for #scom.

Standardizing overrides

5:19 pm in #scom, #sysctr by Jan Van Meirvenne

Referring back to an old post about SCOM and MOF, SCOM is a (awesome) tool to monitor your environment, but is a glorified logging tool without a process-framework to tie it to.

During my long-term assignment at a large customer, I got tasked with providing a delegation of the ability to define ’monitoring modifications’ towards the second line support unit.

At this point, all overrides were controlled by us and people just had to send us an email to get their modification. There was no guideline, no review, just clickclick and done. This had the following risks:

– Wild growth of (mostly undocumented) overrides
– Second line did not have insight into a server’s actual monitoring configuration when handling tickets
– No lifecycle management: are overrides made a year ago still needed etc?

Since SCOM is going to be the main Wintel monitoring tool, this had to change. I had several meeting with second line to define a process that encompassed transparency, clarity and efficiency. The pilot target of this process in development would be disk monitoring.

First, we had to clear a way through the bushes: get rid of the dozens of wild overrides and cast them into a shape that was clearly documented. We decided to define several ‘templates’ for disk monitoring:

– DEFAULT (no overrides)
– LOW (Warning at 5% or  500Mb free space, Critical at 2% or 200Mb free space)
– HIGH (Warning at 15% or  2000Mb free space, Critical at 10% or 1000Mb free space)
– HIGH-2 (still unsure about this name) (Warning at 25% or  10000Mb free space, Critical at 20% or 5000Mb free space)

These were the ‘override sets’ we defined. Depending on the type of disk (system disk, temp disk, scratch disk, fileserver disk,…) it would get assigned to one of these profiles. To facilitate this, we made 3 groups, one for each profile (except the default one), and scoped the overrides to them.

This would clear up a lot of confusion: only 3 sets of overrides were possible. Second line also uses these profiles to discuss noisy disks with application owners and advises them on one of these 3 profiles, no exceptions are allowed.

Because SCOM isn’t very granular in what an user can do and what not, it wasn’t possible to make second line able to modify the groups without giving them full administrator rights.

We decided to create an excel on SharePoint where second line could request overrides in batch by stating the server, disk(s) and profile to assign. We (the SCOM team) would then bi-weekly process the requests (gradually creating the overrides in the testing environment and then propagate them to production) and log the action in our action list, which is a SharePoint list used to log SCOM modifications. The action id that is created by this then gets logged in the excel and the status of the requests is set to fulfilled. The version of the override management pack (which only contains disk overrides and the profile groups) is increased and in its product knowledge article a reference to the action id is made as well.

Finally, a message is sent to second line that their requests were processed, after which they validate the overrides. The important thing is that every override is logged in the excel, and references in the management pack and action list make sure that it is known when and by who an override was applied. This allows us to keep a record of all overrides without the need to create a report or view in SCOM, which is often not very exact about which system / disk receives an override.

Finally, I proposed one of the key items of the SMC aspect of MOF: the operational health review. A monthly meeting will be held between second line and the SCOM team to talk about new and ongoing issues with alerting, including overrides. This way overrides can be reviewed on validity, effectiveness and usefulness. While the SCOM team might know their platform the best, it is the second line that works with its output every day and experiences first hand what is working properly and what not.

This process is still ‘in the works’, but the benefits are already very noticeable. The next step is to finalize the process in an official document and communicate it to all stakeholders. Afterwards, this process will be extended to other  monitoring, like CPU and Memory.

While this may sound a but exhaustive to implement, here are some basic best practices I use regarding overrides, which is a good start to keep things clean:

– Do not place overrides on specific objects, but make a group which receives the override and put the objects into that one. It will prevent forgotten overrides and gives you a central point of control for the specific override.

– Document the override, you can use the comment section while making the override, the product knowledge of the override MP, or just store the information in an excel or notepad. Review these list from time to time and see what’s still applicable and what not.

– Use tools like Effective Configuration Viewer, MPTuner or if you want an enterprise solution, MP Studio to more easily navigate and document the overrides present in your environment.

I hope to have provided some ideas and insight in how to simplify your SCOM management, enjoy! Jan out.

Objects in SCOM do not disappear when removing an enable-override from a default disabled discovery

3:57 pm in #scom, #sysctr by Jan Van Meirvenne

Symptom:

If you have a discovery which is disabled by default and you remove the override that enabled it, objects that where discovered are not removed. If you run ‘remove-disabledmonitoringobject’ or ‘Remove-SCOMDisabledClassInstance’ (depending on your version of SCOM) the objects remain.

Cause:

This is a design ‘flaw’. Only objects discovered by natively enabled discoveries are scrubbed by the grooming process. If a discovery is disabled by default any objects it created won’t get purged.

Resolution:

The easiest solution is to create a temporary override for the discovery in which you enforce the disabled setting by setting the enabled-property to false and ticking the ‘enforced’ checkbox. If the override is active you can rerun the PowerShell commands mentioned above to successfully groom the stale objects from the database. After this is done you can remove the override.

Aligning Business & IT using distributed applications

12:59 pm in #scom, #sysctr by Jan Van Meirvenne

There is no bigger struggle within the IT market than making IT transparent and understandable for the business it supports. We often fail to realize that IT has a supporting function within an organization, and that it only exists to make the core business processes easier to execute. This principle doesn’t always reflect on the way management software is used.

Specifically for monitoring, there are often misunderstandings or disconnections between the data the business owner is interested in and the data the IT administrator wants to see.

Examples:

 

Business Owner

image 
Which departments and business processes are impacted when this application goes down?
How much downtime has my HR department experienced due to IT problems?
How do I now which business processes own this application?

IT Administrator

image 
What components fall under this application?
What happens if this database goes down?
Did I reach my SLA regarding my database-farm?

I believe that is is rather easy to satisfy both these guys their needs by only using the distributed application concept of Operations Manager:

 

Creating the business organigram

image

A distributed application is normally a skeleton for an application consisting of a set of components, where underlying components rollup their health to the overall state of the application. This has many similarities to a business: a set of departments and processes which report to an executive board.. So by just porting the company’s organigram to a DA you can very quickly set the first step towards a business-focused approach towards application monitoring.

Creating the application models

image

This will be a bigger hassle. If you really want a proper Business to IT overview, all applications used in the business have to be modeled in SCOM. This can be a small or big project depending on the company size, but Rome wasn’t built in a day either.

Bringing the worlds together

Eventually, you can start making the application DA’s components of your business DA’s, effectively linking them together. This will enable the following scenario:

image

By correctly scoping this system and using the correct medium (Sharepoint, Dashboard, Webconsole,…) to bring it to the respective consumer, you satisfy both the business’ as the IT department’s needs. Reporting and health-management is now possible on 2 levels without any chance on inconsistency. As long as the correct processes are implemented to keep this DA-hierarchy up-to-date this is a long-term solution for aligning IT and Business regarding business continuity. Note that not all applications have to be linked to business objects if they have only meaning to the IT department (eg an overview of all servers).

Authenticating agents on DMZ-servers reporting to a domain-joined gateway

5:04 pm in #scom, #sysctr by Jan Van Meirvenne

 

I heard of this issue when one of my colleagues was attaching a customer to our centralized SCOM infrastructure:

SCENARIO

A SCOM gateway is deployed and domain-joined on the customer site. It is provisioned with a root and client certificate to allow communication with the central platform.

However, the customer has DMZ-systems which are not joined to a domain. When they are provisioned with SCOM agents which are pointed to the gateway, they fail to communicate:

Event on the agent:

Log Name:      Operations Manager
Source:        OpsMgr Connector
Date:       
Event ID:      20070
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      XXXXXX.domain.com
Description:
The OpsMgr Connector connected to XXXXXXXXX.com, but the connection was closed immediately after authentication occurred.  The most likely cause of this error is that the agent is not authorized to communicate with the server, or the server has not received configuration.  Check the event log on the server for the presence of 20000 events, indicating that agents which are not approved are attempting to connect.
see below error on agent opsmgr event log:

Events on the gateway:

A device at IP XX.XX.XX.XX attempted to connect but could not be authenticated, and was rejected.

 

CAUSE

The probable cause is the gateway cannot use kerberos or certificates to provide authentication to the DMZ-agents and thus refuses to allow communication with the management group.

SOLUTION

The solution was to import the same root certificate which was used for the gateway in the trusted root certificate store of the DMZ-servers. This allowed a secure link to be established with the gateway server and the communication with the management group to commence.

The SCOM group rename bug demystified

3:20 pm in #scom, #sysctr by Jan Van Meirvenne

This was a long time existing issue I had with SCOM which I finally figured out to the bottom.

Imagine the following scenario: you spend hours creating a management pack. You made a nice group with a very complex query which luckily performs perfectly. And then you see a typo…

You rename the groupname, but then notice the following:

– In the group overview of the SCOM console, the group name refuses to change
– In the property window of the group, the group is named correctly

This can sometimes be corrected by editing the management pack XML.

But usually there is nothing wrong with the management pack, but the way the groupname is stored in the database:

There are 2 locations in the Operations Manager database where the displayname-information is stored:

– The displayname-field in the BaseManagedEntity table
– The LTValue-field in the LocalizedText table

The localizedtext-table is meant to contain multiple language strings depending on the regional settings of the machine that is running the Operations Manager console. The value of the LTValue-field is the localized name for all objects in SCOM. It is here where the correct value is stored. However, this field is used only for the property window of the group, while the list-name (in the overview pane) is linked to the displayname-field of the basemanagedentity-table. Due to a bug in the DBCreateWizard-tool (used to generate the SCOM databases), these 2 values are not kept in sync during a rename. This is because of a missing languagecode in the _MOMManagementGroupInfo_-table, causing the display-information not to be updated correctly.

The fix for this issue is to add the missing language-code, and if you really don’t want to recreate the group with the correct name, manually sync the 2 display-fields in the database. This last action must not be done again after you added the missing language-code, because the rename-issue will not be present anymore for NEW groups that will be created in SCOM.

To fix the missing language code, run the following SQL-query on a backed up Operations Manager database:

update __MOMManagementGroupInfo__ set LanguageCode = ‘ENU’

To fix any existing ‘rename-bugs’, run the following query on the same backed-up database:

update e
set e.DisplayName = lt.LTValue
from BaseManagedEntity e inner join ManagedType et on e.BaseManagedTypeId = et.ManagedTypeId inner join LocalizedText lt on e.FullName = lt.ElementName where et.BaseManagedTypeId = ‘4CE499F1-0298-83FE-7740-7A0FBC8E2449′ and lt.LanguageCode = ‘ENU’ and lt.DisplayStringId is not NULL and lt.LTValue != e.DisplayName;

The GUID is the GUID of the InstanceGroup class, which is the base-class for all console-created groups in SCOM.

The best way to prevent this issue is to run the first query as soon as your SCOM environment is deployed. I am happy to finally have the full story on this issue that has caused a lot of frustration.

Write Actions and the workflow simulator: don’t do it

4:44 pm in #scom, #sysctr by Jan Van Meirvenne

I am currently working on a module which combines 2 submodules, which are both write actions (VBS scripts). The reason for this is that one part of the action that needs to be performed needs special credentials, while another part must be run under localsystem.

Special runas-profile WA –> 
                                                          WA
Local System WA->

image

When I attached this to a rule and started a simulation of the workflow, things went haywire:image

Where are my modules? Why doesn’t it show anything but the scheduler-actions?

Well, apparently, this is a limitation of the simulator:

image

Stupid me! But at least I learned something of this!

The workaround for me is to directly perform a trace on the workflow while it is running inside the management group after I imported the management pack. This is slower, but at least it works!

SCOM Troubleshooting: constant high cpu usage of monitoringhost.exe and event 3006

4:36 pm in #scom, #sysctr by Jan Van Meirvenne

 

Symptoms

On a couple of Exchange 2010 servers monitored by Operations Manager, a constantly high CPU usage is reported for the monitoringhost.exe process. Restarting the agent including clearing the cache do not fix the issue. Meanwhile, the application log is being flooded with events with id 3006 and source EvntAgnt.

Solution

The solution is to restart the SNMP service on the affected machine, this will stop the eventlog-flooding and drop the monitoringhost’s CPU usage to normal levels.

The exact same issue and solution where blogged a long time ago by Cameron Fuller for a Server 2003 system, but the issue can apparently also occur on a Server 2008 R2 SP1 machine.

System Center 2012 SP1 is RTM! How do I upgrade my SCOM environment?

1:39 pm in #scom, sysctr by Jan Van Meirvenne

Hey all!

I am pleased to announce that System Center 2012 SP1 has reached RTM and is now available for Technet and MSDN Subscribers!

I downloaded the binaries for SCOM and here I provide a quick list of steps on how to upgrade your RTM-version of SCOM 2012 to SP1.

First, make a good read of the release notes, so you know what issues or extra after-installation steps are present.

  • First, take a backup of your SCOM-databases. If the upgrade goes wrong your database is toast in most cases, and a restore is inevitable.
  • Okay, get the SP1 binaries on your first management server and launch setup.exe, a splash-screen appears:
    image
  • Okay, the most obvious option is the one we want :‘Install’! We are greeted by an overview of the upgrade wizard-steps (because it detects an RTM installation of SCOM 2012) and an extra warning that we should backup our databases.
    image
  • The next screen is the usual licensing text. After reading it through (all of it of course), lets continue!
    image
  • Just plain and simple, where do you want to install the new binaries? Take a pick and go!
    image
  • Ah, the both dreaded and loved prerequisite scan! And what’s this, we are missing something! This means a prerequisite was added for the service pack. We need the HTTP Activation feature, a sub feature of the .NET Framework 3.5.1 role. (Might be different or not needed on Server 2012)
    image
  • No problem, we just install the prerequisiteimage
  • Lets recheck. Much better!image
  • Next step, providing the SDK service account. I filled in the same one that was used for the RTM-installation.image
  • A summary? Are we done already? Wow, no effort at all (almost). Get ready, set, go & upgrade!
    image image
    image
    And it actually did take this long…
    image
  • But what about my second management server?
    image
    image
    Oh oh…!
    image
    That looks nasty! Lets upgrade the second management server asap!
    image
    You know how the rest goes :)
  • After the 2nd upgrade: much better!
    image
  • Now, the only thing left to do is upgrade the agents.
    image

And that’s it, you’re done!

TAKEAWAYS

  • Do not forget to go through the release notes
  • New prerequisite (on 2008 R2): HTTP Activation (under .NET Framework 3.5.1)
  • Non-upgraded management servers cannot communicate with an upgraded management group
  • Overall, the upgrade is straightforward and foolproof

This was my last blogpost for this year. Happy holidays and see you later!

Troubleshooting: Can not attach new agents to a management group

1:29 pm in #scom, sysctr by Jan Van Meirvenne

Some time ago I was asked for help by a fellow engineer who was troubleshooting a rather spicy SCOM-issue:

He had removed some agents from the management group because their server was reimaged to another OS. Afterwards he wanted to re-approve the agents, but instead of accepting the agents the RMS refused to let them connect. Even waiting for an entire night and only then approving the agents did not resolve the issue. After some time, the agents could neither be approved or declined, because the pending actions didn’t show up anymore. The RMS produced a lot of sickening events:

  • Event 20000: A device which is not part of this management group has attempted to access this Health Service.
  • Event 21042: Operations Manager has discarded X items in management group xxxxx, which came from xxxx. These items have been discarded because no valid route exists at this time. This can happen when new devices are added to the topology but the complete topology has not been distributed yet. The discarded items will be regenerated.
  • Many more distressing events

I performed the standard diagnostics and recovery attempts (flushing the agent cache, denying the agents and then approve them,…) but to no avail. This is where I thought to myself  ‘Lets backup the databases and go dirty’. In the end this where the steps that helped:

(BTW: This is NOT a supported procedure, so make a backup before touching anything!)

  • Stop all RMS services (Data Access, Configuration and Management)
      • Delete the health service state folder
      • Start all 3 services back up

These first steps did a lot to ‘unclog’ the RMS. The bad to good event ratio became a lot more balanced. This led me to believe that the RMS’ cache had become corrupt, and needed some cleaning to get all things running again.

However, did didn’t automagically fix the issue with adding agents. So I started doing some database actions:

  • Delete any traces of the problematic agents
USE [OperationsManager]
UPDATE dbo.[BaseManagedEntity]
SET
[IsManaged] = 0,
[IsDeleted] = 1,
[LastModified] = getutcdate()
WHERE FullName like ‘%computername%’
  • Groom out the marked-for-removal items
DECLARE @GroomingThresholdUTC datetime

SET @GroomingThresholdUTC = DATEADD(d,-2,GETUTCDATE())

UPDATE BaseManagedEntity

SET LastModified = @GroomingThresholdUTC

WHERE [IsDeleted] = 1

UPDATE Relationship

SET LastModified = @GroomingThresholdUTC

WHERE [IsDeleted] = 1

UPDATE TypedManagedEntity

SET LastModified = @GroomingThresholdUTC

WHERE [IsDeleted] = 1

EXEC p_DataPurging

  • Remove hidden pending actions
exec p_AgentPendingActionDeleteByAgentName ‘agentname.domain.com’

Okay, now everything related to our problematic agents was flushed from the database. And as I expected: another restart of the agents and some approvals later, the communication with the RMS was flawless and error free.

Although I have not an exact idea about how the database and/or RMS could get so confused, I suspect that at some point in time corruption introduced a series of sync-issues between the RMS and database.