You are browsing the archive for #sysctr.

OMS overview: chapter 2 – disaster recovery

7:23 am in #msoms, #sysctr by Jan Van Meirvenne

OMS Blog Series Index

Since there is much to say about each of the Operations Management Suite aka OMS services I will break up this post in a blog series:

Chapter 1: Introduction to OMS
Chapter 2: Disaster Recovery with OMS (this post)
Chapter 3: Backup with OMS
Chapter 4: Automation with OMS
Chapter 5: Monitoring and Analysis with OMS
Chapter 6: Conclusion and additional resources overview

This series is actually a recap of a live presentation I gave on the last SCUG event which was shared with another session on SCOM – SCCM better together scenario’s presented by SCUG colleagues Tim and Dieter. You can find my slides here, demo content that is shareable (the ones I don’t need to pay for Smile) will be made available in the applicable chapters.

Disaster Recovery is one of the bigger money sinks of IT. First you need yourself a big secondary datacenter able to maintain your business in case your primary one bites the dust. Not only you need to throw money at something you might never need in your entire career, but even have to invest in setting up DR plans for every service you are planning to protect. This usually requires both additional design and implementation work, and separate tooling that allows you to perform the DR scenario you envisioned. Especially in the world of hybrid cloud, protecting services across platform boundaries might seem like a complex thing to do: needing to integrate multiple platforms in preferably one single DR solution.

Meet Azure Site Recovery

Azure Site Recovery or ASR provides 2 types of DR capabilities:

  • Allowing the replication and failover orchestration of services between 2 physical sites of your own
  • Allowing the replication and failover orchestration of services between a main site that you own and the Azure IaaS platform 

Bear in mind that while the platform is advertised as a DR solution, it is also possible to use it as a migration tool to move workloads to Azure or other on-premise sites.

The big advantage here is that the solution is platform-agnostic, providing scenario’s to protect virtually any type of IT infrastructure platform you use. The DR site can be a secondary VMWare or Hyper-V (with SCVMM) Cloud, or Azure. Vendor-lock-in is becoming a non-issue this way!

Supported Scenario’s

Here is a full overview of the supported flows:

Infrastructure

To Azure

To an own DR-site

Application

  • SQL Always-On
  • Other application types must be orchestrated by using the recovery plan feature or by doing a side-by-side migration / failover

Architecture

Basically, there are 2 major ‘streams’ within ASR to facilitate DR operations, but in any case you’ll always need a Recovery Vault. The recovery vault is an encrypted container that sits on top on a (selectable) storage account. If the target DR site is Azure, the vault will store the replicated data and use it to deploy Azure IaaS VM’s in case of a failover. If the target DR site is another on-premise site, the vault will only store the metadata needed for ASR to protect the main site. The vault is accessed by the on-premise systems using downloadable vault encryption keys. These keys are used during the setup and are accompanied by a passphrase the user must enter and securely store on-premise. Without this passphrase the vault becomes inaccessible should systems be (re-)attached to the vault, so it is very important to double-, no triple-backup this key!

All communication between the different sites is also encrypted using SSL.

In the case that Azure is the target DR site, you must specify an Azure size for each VM you want to protect along with an Azure virtual network to connect it to. This allows you to control the cost impact.

The Microsoft Azure Recovery Services Agent (MARS) and the Microsoft Azure Site Recovery Provider (MASR)

This setup is applicable to any scenario where Hyper-V (with or without SCVMM) is the source site in the DR plan.

The MARS agent needs to be installed on every Hyper-V server which will take part in the DR-scenario (both source and target). This agent will facilitate the replication of the actual VM data from the source Hyper-V servers to the target site (Azure or other Hyper-V server).

The MASR agent needs to be installed on the SCVMM server(s) or in case of a Hyper-V site to Azure scenario it needs to be co-located on the source Hyper-V server together with the MARS agent. The MASR agent is responsible for orchestrating the replication and failover execution and primarily sync meta-data to ASR (actual replication data is done by the MARS agent).

image

image

The Process, Master Target and Configuration Server

This setup is used for any scenario with a VMWare (with some additional components described later on), Cloud (Azure or other) or Physical site as a source. These are the components which facilitate the Replication and DR Orchestration. Note: they are all IaaS level components.

Process Server

This component is placed in the source site and is responsible for pushing the mobility service to the protected servers, and to collect replication-data from the same servers. The process server will store, compress, encrypt and forward the data to the Master Target Server running in Azure.

Mobility Service

This is a helper-agent installed on all systems (Windows or Linux) to be protected. It leverages VSS (on Windows) to capture application-consistent snapshots and upload them to the process server. The initial sync is snapshot-based, but the subsequent replication is done by storing writes in-memory and mirroring them to the process server.

Master Target Server

The Master Target Server is an Azure-based system that receives replication-data from the source site’s process server and stores it in Azure Blob Storage. As a failover will incur heavy resource demands on this system (rollout of the replica’s into Azure IaaS VMs) it is important to choose the correct sizing in regards of storage (standard or premium) to ensure a service can failover within the established RTO.

Configuration Server

This is another Azure-based component that integrates with the other components (Master Target, Mobility Service, Process Server) to both setup and coordinate failover operations.

Failback to VMWare (or even failover to a DR VMWare site instead) is possible with this topology with some additional components. It is nice to see that Microsoft is really upping the ante in regards of providing a truly heterogeneous DR solution in the cloud!

image

Orchestrating workload failover/migration using the Recovery Plan feature

Of course, while you can protect you entire on-prem environment in one go, this is not an application-aware setup. If you want to make sure your services are failed over with respect of their topology (backend -> middleware -> application layer -> front-end) you need to use the recovery plan feature of ASR.

Recovery Plans allows you to define an ordered chain of VMs along with actions needing to be taken at source site shutdown (pre and post) and target site startup (pre and post). Such an action can be the execution of an automation runbook hosted by Azure Automation, or a manual action to be performed by an operator (the failover will actually halt until the action is marked as completed).


Source: https://azure.microsoft.com/en-us/documentation/articles/site-recovery-runbook-automation/

Failing Over

When in the end you want to perform an actual failover operation you can perform 3 types of actions:

– Test Failover: this keeps the source service/system online while booting the replica so you can validate it. Keep in mind that you should take possible resource conflicts (DNS, Network, connected systems) into account.

– Planned Failover: this makes sure that the replica is fully in-sync with the source service/system before shutting it down and then boots the replica. This ensures no data loss occurs. This action can be done when migrating workloads or protecting against a foreseen disaster (storm, flood,…) the protected service will be offline during the failover

– Unplanned Failover: this type only brings online the replica from the last sync. Data loss will be present as a gap between the failure moment and the last sync. This is only for instances where the disaster already occurred and you need to bring the service online at the DR ASAP.

A failover can be executed on the per-VM level or via a recovery plan.

Caveats and gotcha’s

Although the ASR service is production-ready and covers a lot of ground in terms of features, there are some limitations to take into account:

Here are some of the bigger limits:

When using Azure as a DR site

– Azure IaaS uses the VHD format as storage disk format, limiting the protectable size of the VHD or VHDX (conversion is done automatically) to 1024Gb. Larger sizes are not supported
– the amount of per-VM resources (CPU Cores, RAM, Disks) is limited by the supported resources provided by the largest Azure IaaS sizing (eg if you have 64 attached disks in your on-prem you might not be able to protect it if Azure’s maximum is 32)

Overall Restrictions

– Attached Storage setups like Fiber Channel, Pass-through Disks or iSCSI are not supported
– Gen2 Linux VMs are not yet supported

This looks nice! But how much does it cost?

The nice thing about using Azure as a DR-site is that you only pay a basic fee for the service, including storage and WAN traffic, but only pay the full price for IaaS compute resources when an actual failover occurs. This embodies the concept of ‘Pay what you use’ that is one of the big benefits of public cloud. Even better: you only start paying the basic fee after 31 days. So in case you would use ASR as a migration tool (moving workloads to the cloud or another site) you will have a pretty cost-effective solution! Bear in mind that used storage and WAN traffic is always billed.

I won’t bother to list the pricing here as it is as volatile in nature as the service itself. You can use the Azure pricing calculator to figure out the costs.

image

If you have one or more System Center licenses, check out the OMS suite pricing calculator instead to asses if you can benefit from the bundle-pricing.

Ok, I’ll bite, but how do I get started?

The service for now is only accessible from the ‘old’ Azure Portal on https://manage.windowsazure.com

Log in with an account that is associated with an Azure subscription, and click the ‘new’-button in the bottom-left corner.

image

Choose ‘Data Services’ -> ‘Recovery Services’ -> ‘Site Recovery Vault’

image

Click ‘Quick Create’ and then enter a unique name and choose the applicable region where you want to host the service. Then, click ‘Create Vault’.

image

This will create the vault from where you can start the DR setup

image

When the creation is done, go to ‘Recovery Services’ in the left-side Azure Service bar and then click on the vault you created.

image

The first thing you must do is to pick the appropriate scenario you want to execute

image

This will actually provide you with a tutorial to set up the chosen scenario!

image

To re-visit or change this tutorial during operational mode, just click the ‘cloud icon’ in the ASR interface

image

I won’t cover the further steps needed as the tutorials provided by Azure are exhaustive enough. I might add specific tutorials later on in a dedicated post in case that I encounter some advanced subjects.

 

Final Thoughts on ASR

While I am surely not a data protection guy, setting this puppy up was a breeze to me! This service, which is now part of OMS, embodies the core advantages of cloud: immediate value, low complexity and cross-platform. I have already seen several implementations, confirming that this solution is here to stay and will likely be a go-to option for companies looking for a cost-effective DR platform.

Thanks for the long read! And see you next time when we will touch ASR’s sister service Azure Backup! Jan out.

Troubleshooting the Service Manager 2012 ETL processes

3:53 pm in #scsm, #sysctr by Jan Van Meirvenne

This post will aid in troubleshooting the following issues concerning the Service Manager Data Warehouse:
– Slow execution of ETL jobs
– ETL jobs failing to complete
– ETL jobs failing to start

1. Troubleshooting

– Open a remote desktop session to the Service Manager Management Server

– Open the service manager management shell

– Request the data-warehouse jobs

Get-SCDWJob –computername <Your DW Server>|ft Name, Status, CategoryName,IsEnabled

– This will result in a list of data warehouse jobs and their state

image

– If there are jobs with a ‘stopped’status, then resume them:

Start-SCDWJob –jobname <The name of the job to start (eg ‘DWMaintenance’) –computername <Your DW Server>

– If there are jobs that are not enabled (IsEnabled column is ‘false’) AND the MPSyncJob or DWMaintenance jobs are not running (they disable some jobs at runtime) then re-enable them:

Enable-SCDWJob –jobname <The name of the job to start (eg ‘DWMaintenance’) –computername <Your DW Server>

– Run the following script to reset the jobs (it will rerun all jobs in the correct order). This script exists thanks to Travis Wright.

 

$DWComputer = “<Your DW Server>

$SMExtractJobName = "<Operational Management Group Name> "

$DWExtractJobName = "<DW Management Group Name> "

Import-Module ‘C:\Program Files\Microsoft System Center 2012\Service Manager\Microsoft.EnterpriseManagement.Warehouse.Cmdlets.psd1′

function Start-Job ($JobName, $Computer)

{

$JobRunning = 1

while($JobRunning -eq 1)

{

$JobRunning = Start-Job-Internal $JobName $Computer

}

}

function Start-Job-Internal($JobName, $Computer)

{

$JobStatus = Get-JobStatus $JobName

if($JobStatus -eq "Not Started")

{

Write-Host "Starting the $JobName Job…"

Enable-SCDWJob -JobName $JobName -Computer $Computer

Start-SCDWJob -JobName $JobName -Computer $Computer

Start-Sleep -s 5

}

elseif($JobStatus -eq "Running")

{

Write-Host "$JobName Job is already running. Waiting 30 seconds and will call again."

Start-Sleep -s 30

return 1

}

else

{

Write-Host "Exiting since the job is in an unexpected status"

exit

}

$JobStatus = "Running"

while($JobStatus -eq "Running")

{

Write-Host "Waiting 30 seconds"

Start-Sleep -s 30

$JobStatus = Get-JobStatus $JobName

Write-Host "$JobName Job Status: $JobStatus"

if($JobStatus -ne "Running" -and $JobStatus -ne "Not Started")

{

Write-Host "Exiting since the job is in an unexpected status"

exit

}

}

return 0

}

function Get-JobStatus($JobName)

{

$Job = Get-SCDWJob -JobName $JobName -Computer $Computer

$JobStatus = $Job.Status

return $JobStatus

}

#DWMaintenance

Start-Job "DWMaintenance" $DWComputer

#MPsyncJob

Start-Job "MPSyncJob" $DWComputer

#ETL

Start-Job $SMExtractJobName $DWComputer

Start-Job $DWExtractJobName $DWComputer

Start-Job "Transform.Common" $DWComputer

Start-Job "Load.Common" $DWComputer

#Cube processing

Start-Job "Process.SystemCenterConfigItemCube" $DWComputer

Start-Job "Process.SystemCenterWorkItemsCube" $DWComputer

Start-Job "Process.SystemCenterChangeAndActivityManagementCube" $DWComputer

Start-Job "Process.SystemCenterServiceCatalogCube" $DWComputer

Start-Job "Process.SystemCenterPowerManagementCube" $DWComputer

Start-Job "Process.SystemCenterSoftwareUpdateCube" $DWComputer

– If a particular job keeps stalling / failing during or after the script execution, check which job-module is having problems:

Get-SCDWJobModule –jobname <The name of the job experiencing issues> –computername <Your DW Server>

– Check how long the jobs has been failing / stalling

Get-SCDWJob –jobname <The name of the job experiencing issues> -NumberOfBatches 10 –computername  <Your DW Server>

– Check the ‘Operations Manager’ eventlog on the data warehouse server. Look for events with as source ‘Data Warehouse’. Error or Warning events might pinpoint the issue with the job.

– Check the CPU and Memory of the data warehouse server, and check if one or both are peaking a lot.

 

2. Common possible causes

 

2.1. Resource Pressure

The data warehouse server takes up a lot of resources to process data. Job duration and reliability can be greatly increased by providing sufficient CPU and memory resources. Exact requirements depend on each individual setup, but these are some guidelines:

CPU

Memory

Hard Drive

4-core 2.66Ghz

Server Component: 8-16GB

Databases: 8-32Gb

Server Component: 10Gb

Databases: 400Gb

2.2. Service Failure

The ETL process of the Data Warehouse depends on multiple services to function correctly:

– Microsoft Monitoring Agent

– System Center Data Access

– System Center Management Configuration

– SQL Server SCSMDW

– SQL Serer Analysis Services

– SQL Server Agent

– SQL Server

Verify if these services are running correctly (the ‘Application’ and / or ‘Operations Manager’ event logs can hold clues as to why a service can not run correctly.

2.3. Authentication Failure

Various runas-accounts are used to execute the ETL jobs:

– A workflow account that executes program logic on the data warehouse server. This account must have local administrator privileges on the data warehouse server.

– An operational database account that has access to the SCSM databases for data extraction. This account must be owner of all databases.

– A runas-account that has administrator privileges on both the operational and the data warehouse management groups.

Most of these accounts are entered during setup and should not be changed afterwards. If these accounts do not have the required permissions then some or all functionalities related to the ETL process can be impacted.

Should error events indicate that a permission issue is the cause, then verify and repair the necessary permissions for these accounts.

SCOM Quick Query: Logical Disk Space For My Environment

10:31 am in #scom, #sysctr by Jan Van Meirvenne

 

Sometimes I get questions in the style of “What is the current state of my environment in terms of…”. If there is no report in SCOM I can point to I usually create a quick query on the Data Warehouse and provide the data as an excel sheet to the requestor. Afterwards, should the question be repeated over and over, I create a report for it and provide self-service information.

In order to both prevent forgetting these kind of ‘quick and dirty’ queries, and also sharing my work with you I will occasionally throw in a post if I have a query worth mentioning.

Here we go for the first one!

If you are not interested in using the extended Logical Disk MP you can use this query on your DW to quickly get a free space overview of all logical disks in your environment :

select max(time) as time,server,disk,size,free,used from
(
select perf.DateTime as time,e.path as server, e.DisplayName as disk, round(cast(EP.PropertyXml.value(‘(/Root/Property[@Guid="A90BE2DA-CEB3-7F1C-4C8A-6D09A6644650"]/text())[1]’, ‘nvarchar(max)’) as int) / 1024,0) as size, round(perf.SampleValue / 1024,0) as free, round(cast(EP.PropertyXml.value(‘(/Root/Property[@Guid="A90BE2DA-CEB3-7F1C-4C8A-6D09A6644650"]/text())[1]’, ‘nvarchar(max)’) as int) / 1024,0) – round(perf.SampleValue / 1024,0) as used from perf.vPerfRaw perf inner join vManagedEntity e on perf.ManagedEntityRowId = e.ManagedEntityRowId
inner join vPerformanceRuleInstance pri on pri.PerformanceRuleInstanceRowId = perf.PerformanceRuleInstanceRowId
inner join vPerformanceRule pr on pr.RuleRowId = pri.RuleRowId
inner join vManagedEntityProperty ep on ep.ManagedEntityRowId = e.ManagedEntityRowId
where
pr.ObjectName = ‘LogicalDisk’
and
pr.CounterName = ‘Free Megabytes’
and
ep.ToDateTime is null
and Perf.DateTime > dateadd(HOUR,-1,GETUTCDATE())
) data
group by data.server,data.disk,data.size,data.free,data.used
order by server,disk

 

Available fields:

Time: the timestamp of the presented data
Server: the server the disk belongs to
Disk: The name of the logical disk
Size: the size of the disk in GB
Free: the free space on the disk in GB
Used: the used space on the disk in GB

 

Please note that I am not a SQL guru, so if you find a query containing war crimes against best practices, don’t hesitate to let me know!

 

See you in another knowledge dump!

SCOM authoring: the aftermath

6:52 am in #scom, #sysctr by Jan Van Meirvenne

 

For the people who attended my SCOM authoring session, thanks once again for your attention. While it was quite an advanced topic I hope it shed some light on how SCOM functions and how it can be optimized regarding creating and maintaining monitoring definitions.

My slide deck can be found here: http://www.slideshare.net/JanVanMeirvenne/scom-authoring

My demo project can be found here: http://1drv.ms/1t0OPIG

Please note that Microsoft announced the retirement of the Visio authoring method for SCOM. Although it is a useful tool especially with designing management packs alongside with customers, I guess it was a bit too much of an odd bird with many limitations. The recommendation is to use MPAuthor or the Visual Studio addin (links included in the wiki link below).

If you want to learn more on this topic and maybe try some things out for yourself, there are some excellent resources available:

– Microsoft Virtual Academy: http://channel9.msdn.com/Series/System-Center-2012-R2-Operations-Manager-Management-Packs

– Authoring section of the SCOM wiki: http://social.technet.microsoft.com/wiki/contents/articles/20796.the-system-center-2012-r2-operations-manager-survival-guide.aspx#Management_Packs_and_Management_Pack_Authoring

– MSDN Library (contains technical documentation of the modules used while authoring): http://msdn.microsoft.com/en-us/library/ee533840.aspx

If you have questions regarding these topics, don’t hesitate to drop me a comment, tweet @JanVanMeirvenne or mail to jan.vanmeirvenne@ferranti.be

See you in another blogpost!

SCOM DW not being updated with operational data?

7:38 am in #scom, #sysctr by Jan Van Meirvenne

With this blogpost I want to make a ‘catch-all’ knowledge-article containing problems and fixes I learned in the field regarding SCOM DW synchronization issues. I will update this post regularly if I encounter any new phenomenon on this subject.

Possible Cause 1: The synchronization objects and settings are missing from the management group

Diagnosis

Run the following powershell-commands in the SCOM powershell interface of the affected management group:

get-SCOMClass -name:Microsoft.SystemCenter.DataWarehouseSynchronizationService|Get-ScomClassInstance

If no objects are returned it means that the workflows responsible for synchronizing data are not running.

Add-pssnapin microsoft.enterprisemanagement.operationsmanager.client
Set-location OperationsManagerMonitoring::
New-managementgroupconnection <SCOM management server>
get-DefaultSetting ManagementGroup\DataWarehouse\DataWarehouseDatabaseName $DataWarehouseDatabaseName

get-DefaultSetting ManagementGroup\DataWarehouse\DataWarehouseServerName $DataWarehouseSqlServerInstance

If the default settings are not set this indicates that the DW registration has been broken.

Causes

The breakage and disappearance of the DW synchronization objects and settings can happen when a SCOM 2007->2012 upgrade fails and you have to recover the RMS. The issue is hard to detect (especially if you do not use reporting much) as no errors are generated.

Solution

The settings and objects need to be regenerated manually using the script below. This will add all necessary objects to SCOM with the correct server and database references. The DW properties will also be added to the default-settings section.

You will have to edit the script and enter the Operations Manager Database server name, Data Warehouse servername and console path in the script . This is a PowerShell script which needs to copied to text and rename to .ps1 after entering the required information to run under PowerShell.

#Populate these fields with Operational Database and Data Warehouse Information

#Note: change these values appropriately

$OperationalDbSqlServerInstance = “<OpsMgrDB server instance. If its default instance, only server name is required>”

$OperationalDbDatabaseName = “OperationsManager”

$DataWarehouseSqlServerInstance = “<OpsMgrDW server instance. If its default instance, only server name is required>”

$DataWarehouseDatabaseName = “OperationsManagerDW”

$ConsoleDirectory = “<OpsMgr Console Location by default it will be C:\Program Files\System Center 2012\Operations Manager\Console”

$dataWarehouseClass = get-SCOMClass -name:Microsoft.SystemCenter.DataWarehouse

$seviewerClass = get-SCOMClass -name:Microsoft.SystemCenter.OpsMgrDB.AppMonitoring

$advisorClass = get-SCOMClass -name:Microsoft.SystemCenter.DataWarehouse.AppMonitoring 

$dwInstance = $dataWarehouseClass | Get-SCOMClassInstance

$seviewerInstance = $seviewerClass | Get-SCOMClassInstance

$advisorInstance = $advisorClass | Get-SCOMClassInstance 

#Update the singleton property values

$dwInstance.Item($dataWarehouseClass.Item(“MainDatabaseServerName”)).Value = $DataWarehouseSqlServerInstance

$dwInstance.Item($dataWarehouseClass.Item(“MainDatabaseName”)).Value = $DataWarehouseDatabaseName 

$seviewerInstance.Item($seviewerClass.item(“MainDatabaseServerName”)).Value = $OperationalDbSqlServerInstance

$seviewerInstance.Item($seviewerClass.item(“MainDatabaseName”)).Value = $OperationalDbDatabaseName

$advisorInstance.Item($advisorClass.item(“MainDatabaseServerName”)).Value = $DataWarehouseSqlServerInstance

$advisorInstance.Item($advisorClass.item(“MainDatabaseName”)).Value = $DataWarehouseDatabaseName 

$dataWarehouseSynchronizationServiceClass = get-SCOMClass -name:Microsoft.SystemCenter.DataWarehouseSynchronizationService

#$dataWarehouseSynchronizationServiceInstance = $dataWarehouseSynchronizationServiceClass | Get-SCOMClassInstance 

$mg = New-Object Microsoft.EnterpriseManagement.ManagementGroup -ArgumentList localhost

$dataWarehouseSynchronizationServiceInstance = New-Object Microsoft.EnterpriseManagement.Common.CreatableEnterpriseManagementObject -ArgumentList $mg,$dataWarehouseSynchronizationServiceClass 

$dataWarehouseSynchronizationServiceInstance.Item($dataWarehouseSynchronizationServiceClass.Item(“Id”)).Value = [guid]::NewGuid().ToString()

#Add the properties to discovery data

$discoveryData = new-object Microsoft.EnterpriseManagement.ConnectorFramework.IncrementalDiscoveryData 

$discoveryData.Add($dwInstance)

$discoveryData.Add($dataWarehouseSynchronizationServiceInstance)

$discoveryData.Add($seviewerInstance)

$discoveryData.Add($advisorInstance)

$momConnectorId = New-Object System.Guid(“7431E155-3D9E-4724-895E-C03BA951A352″)

$connector = $mg.ConnectorFramework.GetConnector($momConnectorId) 

$discoveryData.Overwrite($connector)

#Update Global Settings. Needs to be done with PS V1 cmdlets

Add-pssnapin microsoft.enterprisemanagement.operationsmanager.client

cd $ConsoleDirectory

.\Microsoft.EnterpriseManagement.OperationsManager.ClientShell.NonInteractiveStartup.ps1

Set-DefaultSetting ManagementGroup\DataWarehouse\DataWarehouseDatabaseName $DataWarehouseDatabaseName

Set-DefaultSetting ManagementGroup\DataWarehouse\DataWarehouseServerName $DataWarehouseSqlServerInstance

If the script ran successfully and you run the commands specified in the diagnosis-section you should receive valid object- and settings information. The synchronization should start within a few moments.

Sources

http://support.microsoft.com/kb/2771934

Agents fail to connect to their management group with error message “The environment is incorrect”

4:20 pm in #scom, #sysctr by Jan Van Meirvenne

Symptoms

– Agents stop receiving new monitoring configuration
– When restarting the agent service, the following events are logged in the Operations Manager event log:

clip_image001

clip_image002

clip_image003

Cause

This indicates that the agent can not find the SCOM connection information in AD. This is usually because it is not permitted to do so.

Resolution

All connection info is found in the Operations Manager container in the AD root. If you do not see it using “Active Directory Users and Computers” then click “View” and enable “Advanced Features”.

image

(Screenshot taken from http://elgwhoppo.com/2012/07/25/scom-2012-ad-integration-not-populating-in-ad/)

The container will contain a subcontainer for each management group using AD integration. In the subcontainer there are a set of SCP-objects, containing the connection information for each management server, and 2 security groups per SP: PrimarySG… and SecondarySG…. These groups will be populated with computer objects using the LDAP queries you provided in the AD integration wizard of the SCOM console. So for example if your LDAP query specifies only servers ending with a 1, only those objects matching the criteria will be put in the group.

These security groups should normally both have read-access on their respective SCP-object (eg for management server “foobar” the groups with “PrimarySG_Foobar…” and “SecondarySG_Foobar…” should have read access on the SCP-object for this management server.

If the security is correct the agent can only see the SCP-objects to which it should connect in a normal and failover situation.

If these permissions are not correct then you can safely adjust them manually (only provide read access). The agents will almost immediately pick up the SCP once they have permission. If this is not the case, restart the agent service.

Fixing missing stored procedures for dashboards in the datawarehouse

1:45 pm in #scom, #sysctr by Jan Van Meirvenne

When you are creating a dashboard in SCOM, you might receive strange errors stating that a certain stored procedure was not found in the database. This is an issue I often encounter and which indicates that something went wrong during the import of the visualization management packs. These management pack bundles contain the SQL scripts that create these procedures. By extracting the MPB’s and manually executing each SQL script, you can fix this issue rather easily. However, this is not a supported fix and you should make a backup of your DW in case something explodes!

Extract the following MPB files using this script:

– Microsoft.SystemCenter.Visualization.Internal.mpb
(found on the installation media under ‘ManagementPacks’)

– Microsoft.SystemCenter.Visualization.Library.mpb
(use the one from the patch folder if you applied one: %programfiles%\System Center 2012\Operations Manager\Server\Management Packs for Update Rollups)

First, execute the single SQL script that came from the Internal MPB (scripts with ‘drop’ in the name can be ignored) on the Datawarehouse

Secondly, execute each SQL script from the Library MPB on the DW (again, ignore the drop scripts). The order does not matter.

The dashboards should now be fully operational. Enjoy SCOM at its finest!

[SCOM 2007–>2012] Minimizing the risk of failures by manually testing your database upgrade

6:14 pm in #scom, #sysctr by Jan Van Meirvenne

When you upgrade the last management server (the RMS) to SCOM 2012, the setup will also upgrade the database. As this step is very sensitive to inconsistencies, it is advised to test this update on a copy of your Operational database.

The following steps are done during the database upgrade:

1. Database infrastructure updates (build_mom_db_admin.sql)

Sets up recovery mode, snapshots, partitions

2. Database upgrade / creation (build_mom_db.sql)

– Populates a new database, or converts a 2007 database

3. Enter localization data (build_momv3_localization.sql)

– Inserts localized message strings

– This is the last step where things can go wrong, it is also the last action you can perform manually

4. Management Pack Import

– This step is performed by the setup itself. It imports new management packs into the database.

5. Post Management Pack import operations (Build_mom_db_postMPimport.sql)

– This step performs some database changes that are needed after the MP import

6. Configuration Service Database setup (Build_om_db_ConfigSvcRole.sql)

– This steps sets up the tables and roles needed for the configuration service. The SCOM configuration service used to be a file-based database containing the current configuration of workflows for the management group. For performance and scalability reasons, this role has been transferred to the database-level.

All the SQL files can be found on the SCOM 2012 installation media in the server\amd64 folder.

First, should you have encountered an upgrade failure, restore your database to the original location, you cannot reuse the current version as it probably will be corrupted beyond repair. Restore the RMS binaries, use the identical database and service account information of the original setup. The RMS will restore itself in the management group. The subscriptions and administrator role will be reset however.

Take another backup copy of the Operational Database, restore it to a temporary database and execute the aforementioned SQL scripts in the correct order. You will usually get a more precise error message in the SQL management studio than in the SCOM setup log (stored procedure names, table names…).

A common cause for upgrade failure is an inconsistent management pack. Pay close attention to the affected items mentioned in the error message. They usually contain a table or column reference. Lookup this reference and try to get hold of a GUID. Query the GUID in the live database using the PowerShell interface (Get-MonitoringClass –Id <GUID> or Get-MonitoringObject –Id <GUID> for example). This way you can find out which element(s) are causing issues and remove the management pack(s) that contain them.

You can uninstall suspect management packs by executing the stored procedure “p_RemoveManagementPack” providing the management pack GUID as parameter. If the stored procedure returns a value of 1 then check if there are any depending management packs you should remove first.

It will be rinse and repeat from here on: restore the backup of the original database again to the temporary database, remove the problematic management pack using the stored procedures and re-execute the scripts until no more errors occur. After you have listed up all troublesome management packs, you can remove them from the live management group, wait for the config churn to settle and retry the upgrade. Usually, the subsequent upgrade steps are pretty safe and you shouldn’t have big issues beyond this point.

Keeping things tidy while authoring (1st edition)

5:24 pm in #scom, #sysctr by Jan Van Meirvenne

I enjoy authoring a management pack. Not only because it is a really fun process to go through, but also to puzzle with the various techniques that you can use to make your work of art lean and mean.

Here are some of the things I try to do every time I embark on a new authoring adventure:

1) Use a naming convention

Using a naming convention internally in your management pack greatly helps when skimming through your XML code, associating multiple items (overrides, groups,…) or  creating documentation. It is also a big advantage when you are working together with other authors: using the same convention improves readability (provided everybody involved is knowledgeable with the convention) and guarantees a standard style in your code.

For example, at my current customer we have the following naming convention:

<Customer>.<Workload>.MP.<ObjectType>_<Component>_<Name>

The object type is actually an abbreviation for the various possible objects a management pack can contain:

Abbreviation Object Type
C Class
D Discovery
R Rule
M Monitor
REC Recovery
G Group
SR Secure Reference
REP Report

So lets just say I work at company ‘Foo’ and I am working on a management  pack for the in-house application ‘Bar’. I have to create a single class for it as it has only one component. It requires a service monitor and a rule that alerts on a certain NT Event. My objects would be called as followed.

– Management Pack: Foo.Bar.MP
– Class: Foo.Bar.MP.C_Application
– Discovery: Foo.Bar.MP.D_Application
– Monitor: Foo.Bar.MP.M_Application_ServiceStatus
– Rule: Foo.Bar.MP.R_Application_EventAlert


2) Avoid manual-reset monitors

For me, the blight in Operations Manager are manual reset monitors. It requires the SCOM operator to be alert and notice that a monitor is not healthy not because there is a genuine issue, but because nobody pushed the button to clear the state. This monitor is only an option in one scenario: There really is no way  to detect when an issue is gone for the monitored application, and the issue must affect the health state of the SCOM object (for uptime or SLA reporting). But still, this might produce inaccurate reports because you don’t know how long ago the issue was solved when you finally reset that monitor.

If you can’t find a good reason to use manual reset monitors, then I suggest you take an alert rule instead. If the reason  to go for a manual reset monitor was to suppress consecutive alerts, then just use a repeat count on the rule (more on that in a jiffy).

3) Enable repeat count on alert rules

One of the biggest risks when going for an alert rule is bad weather: an alert storm. Imagine you have a rule configured to generate an alert on a certain NT Event, and suddenly the monitored application goes haywire and starts spitting out error events like crazy. Enable the repeat count feature to only create one alert for this that uses a counter to indicate the number of occurrences. Is it really useful to have SCOM show 1000 alerts to indicate 1 problem? Well, I had some requests stating: we measure the seriousness of a problem using the SCOM mails…when there are a lot of mails we now we need to take a look and else we don’t.

This brings me to a small tip when dealing with a workload intake meeting with the app owner (maybe I’ll write a full post on this): ALWAYS RECOMMEND AGAINST E-MAIL BASED MONITORING. This is a trend that lives in IT teams not custom to centralized and pro-active monitoring (let the problem come to us). Emphasize the importance of using the SCOM interfaces directly as this keeps the monitoring platform efficient and clean. Mails should only be used to notify on serious (fix me now or I die) issues. Also, either make the alerts in the subscription either self-sustaining (auto-resolve) or train the stakeholders to close the alert themselves after receiving the mail and dealing with the problem.

4) Create a group hierarchy

I love groups. They are the work-horses behind SCOM scoping. For every management pack I make a base group that contains all servers or other objects related to the workload. I can then scope views and overrides to this group as I please, and create a container specific for the workload. To prevent creating a trash bin group and also to be able to do more specific scoping (eg different components, groups for overrides,…) I often do the following:

Create a base group to contain all other groups for the workload. I use a dynamic expression that dictates that all objects of the class ‘Group’ with a DisplayName containing ‘G_<Application>_’ should be added as a subgroup. This is a set once and forget configuration. As long as I adhere to the naming convention, any groups relevant to the application workload should be automatically added. However, watch out with this because you can easily cause either a circular reference (the base group is configured to contain itself, does not compute well with SCOM) or to contain groups you did not intend to add as a subgroup.

Example:

G_Foo (the base group)
    – G_Foo_Servers (all servers used in the workload)
   – G_Foo_Bar (all core component objects of the Bar application, this is useful if you want to scope access to operators on the application objects but not the underlying server ones)
    – G_Foo_DisabledDiscovery (This must be seen like a GPO/OU structure, create groups for your overrides and simple control which objects receive the override by drag-and-dropping them as a member.

5) When you expect a high monitored volume of a certain type, use cookdown

Lets just say our Foo application can host a high amount of Bar components on a single system and that we use a script to run a specific diagnostic command and alert on the results. Do we want to run this script for every Bar-instance? It sounds like this will cause a lot of redundant polls and increased overhead on the system. Why won’t we create a single script data source and let the cookdown principle do its job? Cookdown implies that if multiple monitors and/or rules use the same datasource, they will sync-up. This will result in the datasource only executing once on the monitored system and it will feed all attached rules and monitors with data. There are a lot of requirements to enable this functionality, but here are the basics:

– The script must be cookdown compatible, this means that multiple property bags must be created (preferably one for each instance) and that the consuming rules and monitors are able to extract the property bag of their target-instance (I usually include a value containing the instance-name and use an expression-filter to match this to the name-attribute of the instance in SCOM). Remember, this script must retrieve and publish all data for all attached instances in one go.

– the configuration-parameters of the depending rules and monitors that use the data source must be identical. Every parameter that is passed to the script, and the interval settings must be the same or SCOM will split up the execution into 2 or more runtimes.

-  You can use the cookdown analyzer to determine whether your setup will support cookdown. In Visual Studio you can do this by right-clicking your MP project and choosing “Perform Cookdown Analysis”.

image

This will build your management pack and then let you specify which classes will have few instances. The selected items will not be included in the report as cookdown is not beneficial for low-volume classes.

image

When you click ‘Generate Report’ an HTML file will be rendered containing a color coded report specifying all modules and whether they support cookdown (green) or not (red).

image

6) Don’t forget the description field and product knowledge

Something I made myself guilty of: not caring about description or product knowledge fields. These information-stores can be a great asset when documenting your implementation. I make the following distinction between descriptions and product knowledge: a description contains what the object is meant to do, and product knowledge is what the operator is meant to do (although some information will be reused from the description to provide context). This are invaluable functions that can speed-up the learning process for a new customer, author or operator, informing them on what they are working with and why things are as they are. This does not replace a good design and architecture however, as the information is decentralized and easily modified without control. Always create some documentation external to your management pack, and store it in a safe and versioned place.

An additional and often overlooked tip: you can also provide a description for your overrides. Use it to prevent having to figure out why you put an override on a certain workload years ago.

Bottom line: always provide sufficient context to why something is present in SCOM!

7) For every negative trigger, try to find a positive one

When you are designing the workload monitoring, and you notice the customer gave an error condition for a certain scenario but not a clearing one…challenge him. The best monitoring solution is one that sustains itself: keeping itself clean and up-to-date.

There is one specific type of scenario where rule-based alerts are unavoidable: informative alerts that indicate a past issue or event (eg unexpected reboot, bad audit), although you must consider event collection if this is merely used for reporting purposes.

8) Don’t use object-level overrides

Although it seems useful for quick, precise modifications, I find object-level scoped overrides dangerous. Internally, an override on an object contains a GUID, representing the Id of the overridden object in the database. When this object disappears for some reason (decommissioning, replacement,…) the override will still be there, but only target thin air. Unless you documented the override, you will have no clue why the override is there, and if you have a bunch of these guys, they can contaminate your management group quickly. Like I said earlier, I like to generalize my overrides using groups: instead of disabling a monitor for one specific object, I create a group called ‘G_Foo_Bar_DisabledMonitoringForDisk’, put my override on that group and then use a dynamic query (static works also of course, but contains some of the same issues as object-level overrides) to provision my group. This allows you to centralize your override management, track back why stuff is there and to quickly extend existing overrides without having the risk of duplicating them. Like I said previously, I highly recommend using the override description fields to provide additional information.

This was a first edition of my lessons learned regarding SCOM authoring, I’ll try to add new editions the moment I learn some new tricks! Do you have tips of your own, or do you do things differently? Let me know in the comments! Thanks for reading and until next time!

 

Applying update rollup 5 to Operations Manager 2012 Service Pack 1

3:19 pm in #scom, #sysctr by Jan Van Meirvenne

Hey guys, this is a quick procedure based on the official KB guide @ http://support.microsoft.com/kb/2904680

This guide assumes that you are running with administrative rights.

  • Validate the sources
    • Download all updates from Microsoft Update
    • Extract the appropriate cab files
      For each downloaded update, enter its downloaded folder and extract the cab file. If more than one cab file is present then extract each file containing your language code (usually ENU) and / or processor architecture (x64 or x86). The other cab files can be deleted.
    • Create folder structure
      • Create a folder structure like this:

      • Move the extracted MSP files in their appropriate folder
  • Apply the update to
    • Management Servers
      • Perform these steps for each management server
        • Log in on the system
        • Use a file explorer to go to the remote or local location of the patch file structure you created
        • Go to ‘Server’, right-click the MSP file and click ‘Apply’
        • When the installation completes, you will be prompted for a reboot, do this as soon as you can.
    • Gateway Servers
      • Perform these steps for each gateway server
        • Log in on the system
        • Use a file explorer to go to the remote or local location of the patch file structure you created
        • Go to ‘Server’, right-click the MSP file and click ‘Apply’
        • When the installation completes, you might be prompted for a reboot, do this as soon as you can.
    • ACS
      • Perform these steps for each gateway server
        • Log in on the system
        • Use a file explorer to go to the remote or local location of the patch file structure you created
        • Go to ‘ACS’, right-click the MSP file and click ‘Apply’
    • Web console servers
      • Perform these steps for each web console server
        • Log in on the system
        • Use a file explorer to go to the remote or local location of the patch file structure you created
        • Go to ‘WebConsole’, right-click the MSP file and click ‘Apply’
        • Use a file explorer to navigate to ‘C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG’ and open the web.config file
        • Add the following line under the system.web node in the xml structure:

<machineKey validationKey=”AutoGenerate,IsolateApps” decryptionKey=”AutoGenerate,IsolateApps” validation=”3DES” decryption=”3DES”/>
  • Operations Console servers
    • Perform the following steps on all systems with an Operations Console
      • Log in on the system
      • Use a file explorer to go to the remote or local location of the patch file structure you created
      • Go to ‘Console’, right-click the appropriate (x86 or x64)MSP file and click ‘Apply’
  • Update the databases
  • Data Warehouse
  • Connect to the Operations Manager Data Warehouse using SQL management studio
  • Use the studio to load the following query-file from a Operations Manager management server:
< Base Installation Location of SCOM>\Server\SQL Script for Update Rollups\UR_Datawarehouse.sql

  • Execute the loaded query against the Operations Manager Data Warehouse database
  • Operational
    • Connect to the Operations Manager database using SQL management studio
    • Use the studio to load the following query-file from a Operations Manager management server:
< Base Installation Location of SCOM>\Server\SQL Script for Update Rollups\update_rollup_mom_db.sql

  • Execute the loaded query against the Operations Manager database
  • Import the management packs
    • Open the Operations Console and connect to the management group you are updating
    • In the left hand pane, go to ‘Administration’, right-click on ‘Management Packs’ and choose ‘Import Management Packs’
    • In the dialog, click ‘Add’ and select ‘Add from disk…’. When prompted to use the online catalog, choose ‘no’.
    • In the file select dialog, navigate to the following location on a management server and select all files, click ‘Open':
<Root SCOM Installation Directory>\Server\Management Packs for Update Rollups

  • The import dialog will complain about a missing dependency management pack called ‘Microsoft.SystemCenter.IntelliTraceCollectorInstallation.mpb’. You must use add this dependency by using the same method as the previous management packs. This file can be found under the ‘ManagementPacks’ folder on the SCOM installation media.
  • When all management packs are ready to import, click ‘Import’ to import them.
  • When prompted for security consense, click ‘Yes’