Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0
Table of Contents

Overview

Opsview (fault monitoring software) is the preferred, established platform for systems monitoring within Yale ITS. One of its features is out-of-the-box integration with Service Desk functions, including RT/Jira/ServiceNow.

Opsview is a commercial fork of Nagios. Their codebases are extremely similar and have many of the same integration points.

There is a Confluence Space dedicated to Yale's implementation of Opsview here.

Opsview Integration Architecture

The opsview-servicedesk-connector package for Red Hat Enterprise Linux (RHEL) allows for this integration via opsview-notifications. Core nagios events hit an external notifier. These spool to disk and are pulled into a relational db via the notifications daemon. It also sends this same data offsite to Service Now. Once Service Now items have attributes set/changes (ticket number, ownership change, etc), relevant details are brought back down to Opsview via the same notifications daemon.

In order to interface with Service Now, the YML necessitates username, password, and instance URL.

Opsview Vendor Documentation

Opsview maintains a fairly deep wiki online. The following are relevant documentation pages regarding their Service Desk Connector as it related to Service Now integration.

Overview
Architecture
Installation
Configuration

Service Now data from Service Desk Connector, stock

Code Block
titlehttp://yalesandbox.service-now.com/nav_to.do?uri=ecc_queue.do?sys_id=f89666ae71a42000a8d1496cf964454c
<notification><priority>1</priority><short_description>HTTPS - 443 - ADDR1 - Dumb is CRITICAL on host trapeze3.its.yale.edu</short_description><comments>Service: HTTPS - 443 - ADDR1 - Dumb
Host: trapeze3.its.yale.edu
Address: trapeze3.its.yale.edu
State: CRITICAL
Date/Time: Tue Mar 6 14:26:18 EST 2012

Additional Info:

CRITICAL - Socket timeout after 10 seconds</comments><category>Network</category><checktime>1331061975</checktime><correlation_id>trapeze3.its.yale.edu;HTTPS - 443 - ADDR1 - Dumb;1331061544</correlation_id><state>CRITICAL</state><servicename>HTTPS - 443 - ADDR1 - Dumb</servicename><hostname>trapeze3.its.yale.edu</hostname><contact_type>Opsview</contact_type></notification>
Code Block
titlehttp://yalesandbox.service-now.com/nav_to.do?uri=ecc_queue.do?sys_id=549626ae71a42000a8d1496cf96445d0
<notification><priority>1</priority><short_description>physical2.virtual.yale.edu is DOWN</short_description><comments>Host: physical2.virtual.yale.edu
Address: physical2.virtual.yale.edu
State: DOWN
Date/Time: Tue Mar 6 14:25:44 EST 2012

Additional Info:

CRITICAL - physical2.virtual.yale.edu: rta nan, lost 100%</comments><category>Network</category><checktime>1331061942</checktime><correlation_id>physical2.virtual.yale.edu;;1330856341</correlation_id><state>DOWN</state><hostname>physical2.virtual.yale.edu</hostname><contact_type>Opsview</contact_type></notification>
Code Block
titlehttp://yalesandbox.service-now.com/nav_to.do?uri=ecc_queue.do?sys_id=e8966eead4a0200054d3e703207fe9e4
<notification><priority>1</priority><short_description>Load Average is CRITICAL on host meg.its.yale.edu</short_description><comments>Service: Load Average
Host: meg.its.yale.edu
Address: meg.its.yale.edu
State: CRITICAL
Date/Time: Tue Mar 6 14:23:59 EST 2012

Additional Info:

CRITICAL - load average: 13.19, 12.05, 10.57</comments><category>Network</category><checktime>1331061769</checktime><correlation_id>meg.its.yale.edu;Load Average;1331061048</correlation_id><state>CRITICAL</state><servicename>Load Average</servicename><hostname>meg.its.yale.edu</hostname><contact_type>Opsview</contact_type></notification>

ServiceNow-Side Configuration

Concerns

There is a lot of state change within Opsview. Sometimes this state change is considered spiurious upon human inspection while its always deemed a genuine issue per Opsview. Yale should tread carefully regarding opening this Opsview dataflow up to Service Now so as to avoid a firehose condition.

Moreover this integration should be used as a lightning rod to initiate and push for amending the monitoring stack where appropriate to dial back the amount of state change.
-nick, 20120305

  1. We can work on dialing down yelling by tracking Top Talkers
  2. Need to confirm that renotifies dont generate dupe incidents
  3. Vet process of auto-acking of states during incident open @ service now
  4. Nail process on how to target top talkers + optimize minimizing false poz where possible (timeout increases, consecutive hit increases, etc)

Tickets

Opsview

Service Now

  • inquire about least privilege needed for bind account
  • inquire about addresses to permit egress to from monitoring stations

Next Steps

  • Consider keywords for service now only ; not all events would enter SN only those we say to per tags

Final considerations

  • per Lou: Critical Alerts from Opsview should be set to a priority level of "High" in SN and Warnings be set as "Average".
  • field mappings: "CLIENT=OPsView, "CONTACT=Operator" "NOTIFY=NONE", "CONTACT TYPE=Tier 2", "Incident Type=Service Event", and Priority=3-High" on the SN screen.
  • left-nav item for unassigned opsview tickets (or bookmark/view) for ops staff
  • Downtimed state change => handled => open ticket?
  • Yellows/warnings + tickets as they effect workflow
    • OVS-3236
    • Assign warns to proper group ; assign crit to DC Ops
  • Acknowledgements
    • auto-ack @ Opsview and how this impacts Yale workflow
    • this means that Unhandled column in Opsview goes away, becomes meaningless, etc. Important if thats a catalyst for starting Ops work (procedure, call, etc).
  • Getting keyword data transmitted
  • Oddballs (non-ITS like YUHS)
    • Oddballs can stay behind if needed; they dont block ITS from moving. This is all driven with notification profiles where we can poz match keywords.