Opsview Integration
Overview
Opsview (fault monitoring software) is the preferred, established platform for systems monitoring within Yale ITS. One of its features is out-of-the-box integration with Service Desk functions, including RT/Jira/ServiceNow.
Opsview is a commercial fork of Nagios. Their codebases are extremely similar and have many of the same integration points.
There is a Confluence Space dedicated to Yale's implementation of Opsview here.
Opsview Integration Architecture
The opsview-servicedesk-connector
package for Red Hat Enterprise Linux (RHEL) allows for this integration via opsview-notifications
. Core nagios events hit an external notifier. These spool to disk and are pulled into a relational db via the notifications daemon. It also sends this same data offsite to Service Now. Once Service Now items have attributes set/changes (ticket number, ownership change, etc), relevant details are brought back down to Opsview via the same notifications daemon.
In order to interface with Service Now, the YML necessitates username, password, and instance URL.
Opsview Vendor Documentation
Opsview maintains a fairly deep wiki online. The following are relevant documentation pages regarding their Service Desk Connector as it related to Service Now integration.
Overview
Architecture
Installation
Configuration
Service Now data from Service Desk Connector, stock
<notification><priority>1</priority><short_description>HTTPS - 443 - ADDR1 - Dumb is CRITICAL on host trapeze3.its.yale.edu</short_description><comments>Service: HTTPS - 443 - ADDR1 - Dumb Host: trapeze3.its.yale.edu Address: trapeze3.its.yale.edu State: CRITICAL Date/Time: Tue Mar 6 14:26:18 EST 2012 Additional Info: CRITICAL - Socket timeout after 10 seconds</comments><category>Network</category><checktime>1331061975</checktime><correlation_id>trapeze3.its.yale.edu;HTTPS - 443 - ADDR1 - Dumb;1331061544</correlation_id><state>CRITICAL</state><servicename>HTTPS - 443 - ADDR1 - Dumb</servicename><hostname>trapeze3.its.yale.edu</hostname><contact_type>Opsview</contact_type></notification>
<notification><priority>1</priority><short_description>physical2.virtual.yale.edu is DOWN</short_description><comments>Host: physical2.virtual.yale.edu Address: physical2.virtual.yale.edu State: DOWN Date/Time: Tue Mar 6 14:25:44 EST 2012 Additional Info: CRITICAL - physical2.virtual.yale.edu: rta nan, lost 100%</comments><category>Network</category><checktime>1331061942</checktime><correlation_id>physical2.virtual.yale.edu;;1330856341</correlation_id><state>DOWN</state><hostname>physical2.virtual.yale.edu</hostname><contact_type>Opsview</contact_type></notification>
<notification><priority>1</priority><short_description>Load Average is CRITICAL on host meg.its.yale.edu</short_description><comments>Service: Load Average Host: meg.its.yale.edu Address: meg.its.yale.edu State: CRITICAL Date/Time: Tue Mar 6 14:23:59 EST 2012 Additional Info: CRITICAL - load average: 13.19, 12.05, 10.57</comments><category>Network</category><checktime>1331061769</checktime><correlation_id>meg.its.yale.edu;Load Average;1331061048</correlation_id><state>CRITICAL</state><servicename>Load Average</servicename><hostname>meg.its.yale.edu</hostname><contact_type>Opsview</contact_type></notification>
ServiceNow-Side Configuration
- set up user
s_nagios
(SN user) in https://yalesandbox.service-now.com - give all
soap*
roles anditil
role tos_nagios
- add a business rule to process ECC queue entries from Opsview Opsview Business Rule
- add an entry to the
sys_impex_map
table to do the transformation Opsview Import-Export Rule
Concerns
There is a lot of state change within Opsview. Sometimes this state change is considered spiurious upon human inspection while its always deemed a genuine issue per Opsview. Yale should tread carefully regarding opening this Opsview dataflow up to Service Now so as to avoid a firehose condition.
Moreover this integration should be used as a lightning rod to initiate and push for amending the monitoring stack where appropriate to dial back the amount of state change.
-nick, 20120305
- We can work on dialing down yelling by tracking Top Talkers
- Need to confirm that renotifies dont generate dupe incidents
- Vet process of auto-acking of states during incident open @ service now
- Nail process on how to target top talkers + optimize minimizing false poz where possible (timeout increases, consecutive hit increases, etc)
Tickets
Opsview
- OVS-3105 servicedesk notifications db on external mysqld and not localhost?
- RESOLVED
the three lines equate to a normal MySQL DSN, a username and finally a password
- RESOLVED
- OVS-3107 curious about how opsview treats state when service now incident record isnt created
- CLARIFIED
If an incident record is not created, an acknowledgement will be received in Opsview, but no ID will be associated. If an error is reported by ServiceNow (for example no authority to create a ticket, or internal error) then no acknowledgement will be sent to Opsview.
- CLARIFIED
- OVS-3110 notifications not received by service now ; need help debugging
- WONTFIX
At this time I cannot make any definitive comment on privileges - we originally developed the integration against another customers implementation and it worked with no problems. We didn't make a note of the access we had at the time (to be fair, we didn't have access to see what our access was )
The code we currently have only raises new tickets rather than update or delete existing ones - the only call I see to the soap service in the code is 'insert' (which probably equates to the 'create' role)
- WONTFIX
- OVS-3111 userland configurability of fields sent to service now during servicedesk-connector use?
- MIGHTFIX
Since this module was originally written for another customer, we made the code generic enough for their needs on changing fields, but at this time not all fields can be amended.
Do you have specific fields you want to amend, and if so I can track those in the code
- MIGHTFIX
- OVS-3113 support for running servicedesk-connector @ slaves in cluster as opposed to master
- RFE
This is not currently possible but I have raised a dev ticket to add this functionality in as NTFY-28
- RFE
- OVS-3114 describe use of Contact Variables in Notification Methods
- RFE
At the moment the ServceNow integration doesn't make use of the extra variables.
For other notification methods where a template is used such as for an email, the variables can be used to augent the data in whatever format you need.For ServiceNow since xml is generated on the fly there is no template to amend, hence no use of the contact variables currently.
What is your purpose behind asking? What did you want to achieve?
I have raised NTFY-30 for this issue
- RFE
- OVS-3115 notifications never make their way to notifications db, service now
- OPEN
- OVS-3116 support for admin purging of events at spool for servicedesk connector
- RFE
At the moment we do not have a script for achieving this; I'll raise a dev ticket for it as there should be a supported method.
In the mean time the only solution I have is to-shut down opsview-notifyd
-connect to the notifications db and remove the entries from the spool table
-restart opsview-notifydNTFY-29 raised
- RFE
- OVS-3121 ability to make servicedesk-connector comments on services not persistent
- RESOLVED
This can be amended by patching /opt/opsview/notifications/lib/Opsview/Notifications/Output.pm as follows ... and then restarting opsview-notifyd.
I have raised a dev ticket to fix this in the future
I'll resolve this for now, but we'll let you know when this is fixed correctly
- RESOLVED
- OVS-3123 help clarify when service now incident number is handed back to opsview
- RESOLVED
The docs at http://wiki.service-now.com/index.php?title=Tivoli_Enterprise_Console_Integration are really a guide to setting up a new soap interaction. I believe with SN you cannot easily create the rules until you have at least one submission to create the rules against.
In usual operation, the info is submitted to SN, the business rules are processed, a ticket created (or not, depending on the rules) and a response to the soap request is returned
I have also updated the doc page. Let me know if you need any more info http://docs.opsview.com/doku.php?id=servicedesk-connector-latest:configuration#service_now
- RESOLVED
- OVS-3125 length of comment field for acknowledgement + qualified service now incident url in comment via connector incident open?
- RESOLVED/NEEDSYALEPATCH
The ack text has a hard limit of 250 characters, so as long as the total length of the comment is under this, then it is fine to include.
The place to amend would be in /opt/opsview/notifications/lib/Opsview/Notifications/Output.pm at around line 111We are happy to accept patches if you make the change, but at the moment we are busy beavering away at Opsview 4 so I am not sure when we would be able to code this up ourselves.
- RESOLVED/NEEDSYALEPATCH
- OVS-3135 queries on servicedesk-connector + service now wrt recovery and flap
- PARTIAL
1.) Would it be possible to put the incident is a RESOLVED state once the OpsView reports it is green?
At the moment no, but we have already got a dev ticket open for this: https://secure.opsview.com/jira/browse/NTFY-26
2.) We want to make sure if a machine or service is flapping that we do not open a ticket for every flap. What happens prior to flap, during flap, on flap recovery, etc? Can flap be adjusted?
Servicedesk-connector should provide a 'correlation id' to ServiceNow which is an ID made up of hostname, service name and partial timestamp, which ServiceNow should be able to recognise and collapse into the same ticket., so there shouldn't be multiple alerts raised from a flapping state.
- PARTIAL
Service Now
- inquire about least privilege needed for bind account
- inquire about addresses to permit egress to from monitoring stations
Next Steps
- Consider keywords for service now only ; not all events would enter SN only those we say to per tags
Final considerations
- per Lou: Critical Alerts from Opsview should be set to a priority level of "High" in SN and Warnings be set as "Average".
- field mappings: "CLIENT=OPsView, "CONTACT=Operator" "NOTIFY=NONE", "CONTACT TYPE=Tier 2", "Incident Type=Service Event", and Priority=3-High" on the SN screen.
- left-nav item for unassigned opsview tickets (or bookmark/view) for ops staff
- Downtimed state change => handled => open ticket?
- Yellows/warnings + tickets as they effect workflow
- OVS-3236
- Assign warns to proper group ; assign crit to DC Ops
- Acknowledgements
- auto-ack @ Opsview and how this impacts Yale workflow
- this means that Unhandled column in Opsview goes away, becomes meaningless, etc. Important if thats a catalyst for starting Ops work (procedure, call, etc).
- Getting keyword data transmitted
- OVS-3237
- PS engagement underway, no ETC
- Oddballs (non-ITS like YUHS)
- Oddballs can stay behind if needed; they dont block ITS from moving. This is all driven with notification profiles where we can poz match keywords.