Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

Code Block
titlehttp://yalesandbox.service-now.com/nav_to.do?uri=ecc_queue.do?sys_id=e8966eead4a0200054d3e703207fe9e4
<notification><priority>1</priority><short_description>Load Average is CRITICAL on host meg.its.yale.edu</short_description><comments>Service: Load Average
Host: meg.its.yale.edu
Address: meg.its.yale.edu
State: CRITICAL
Date/Time: Tue Mar 6 14:23:59 EST 2012

Additional Info:

CRITICAL - load average: 13.19, 12.05, 10.57</comments><category>Network</category><checktime>1331061769</checktime><correlation_id>meg.its.yale.edu;Load Average;1331061048</correlation_id><state>CRITICAL</state><servicename>Load Average</servicename><hostname>meg.its.yale.edu</hostname><contact_type>Opsview</contact_type></notification>

ServiceNow-Side Configuration

Concerns

There is a lot of state change within Opsview. Sometimes this state change is considered spiurious upon human inspection while its always deemed a genuine issue per Opsview. Yale should tread carefully regarding opening this Opsview dataflow up to Service Now so as to avoid a firehose condition.

Moreover this integration should be used as a lightning rod to initiate and push for amending the monitoring stack where appropriate to dial back the amount of state change.
-nick, 20120305

  1. We can work on dialing down yelling by tracking Top Talkers
  2. Need to confirm that renotifies dont generate dupe incidents
  3. Vet process of auto-acking of states during incident open @ service now
  4. Nail process on how to target top talkers + optimize minimizing false poz where possible (timeout increases, consecutive hit increases, etc)

Tickets

Opsview

Next Steps

  • Bind creds @ Service Now
  • Generate dummy data in SN sbx ECC queue to verify conduit function
    • http://docs.opsview.com/doku.php?id=servicedesk-connector-latest:configuration#testing
    • No Format
      $ NAGIOS_LASTHOSTCHECK=1234567891 NAGIOS_LASTHOSTUP=1234567890 NAGIOS_HOSTOUTPUT="Test failure" NAGIOS_HOSTADDRESS=10.11.12.13 NAGIOS_LONGDATETIME="Dec 1 2009" NAGIOS_HOSTALIAS="temp host 1" NAGIOS_LASTHOSTDOWN=0 NAGIOS_LASTHOSTSTATECHANGE=0 NAGIOS_NOTIFICATIONTYPE=PROBLEM NAGIOS_HOSTSTATE=DOWN NAGIOS_HOSTNAME=host1 /opt/opsview/notifications/bin/opsview_notifications servicenow
      Notification submitted
    • Egress packet filter in the way. Eval scalable means to open up.
  • Verify endpoints for Service Now for egress packet filter purposes (Yale firewalls outbound)
    • Is 199.91.136.0/21 too much, too little, etc?
  • Bootstrap Notification Methods
    • RFE

      At the moment we do not have a script for achieving this; I'll raise a dev ticket for it as there should be a supported method.
      In the mean time the only solution I have is to

      -shut down opsview-notifyd
      -connect to the notifications db and remove the entries from the spool table
      -restart opsview-notifyd

      NTFY-29 raised

  • OVS-3121 ability to make servicedesk-connector comments on services not persistent
    • RESOLVED

      This can be amended by patching /opt/opsview/notifications/lib/Opsview/Notifications/Output.pm as follows ... and then restarting opsview-notifyd.
      I have raised a dev ticket to fix this in the future
      I'll resolve this for now, but we'll let you know when this is fixed correctly

  • OVS-3123 help clarify when service now incident number is handed back to opsview Verification of fields being fixed, unchangeable
  • OVS-3125 length of comment field for acknowledgement + qualified service now incident url in comment via connector incident open?
    • RESOLVED/NEEDSYALEPATCH

      The ack text has a hard limit of 250 characters, so as long as the total length of the comment is under this, then it is fine to include.
      The place to amend would be in /opt/opsview/notifications/lib/Opsview/Notifications/Output.pm at around line 111

      We are happy to accept patches if you make the change, but at the moment we are busy beavering away at Opsview 4 so I am not sure when we would be able to code this up ourselves.

  • OVS-3135 queries on servicedesk-connector + service now wrt recovery and flap
    • PARTIAL

      1.) Would it be possible to put the incident is a RESOLVED state once the OpsView reports it is green?

      At the moment no, but we have already got a dev ticket open for this: https://secure.opsview.com/jira/browse/

      OVS-3111

      NTFY-26

      2.) We want to make sure if a machine or service is flapping that we do not open a ticket for every flap. What happens prior to flap, during flap, on flap recovery, etc? Can flap be adjusted?

      Servicedesk-connector should provide a 'correlation id' to ServiceNow which is an ID made up of hostname, service name and partial timestamp, which ServiceNow should be able to recognise and collapse into the same ticket., so there shouldn't be multiple alerts raised from a flapping state.

Service Now

  • inquire about least privilege needed for bind account
  • inquire about addresses to permit egress to from monitoring stations

Next Steps

  • Consider keywords for service now only ; not all events would enter SN only those we say to per tags

Final considerations

  • per Lou: Critical Alerts from Opsview should be set to a priority level of "High" in SN and Warnings be set as "Average".
  • field mappings: "CLIENT=OPsView, "CONTACT=Operator" "NOTIFY=NONE", "CONTACT TYPE=Tier 2", "Incident Type=Service Event", and Priority=3-High" on the SN screen.
  • left-nav item for unassigned opsview tickets (or bookmark/view) for ops staff
  • Downtimed state change => handled => open ticket?
  • Yellows/warnings + tickets as they effect workflow
    • OVS-3236
    • Assign warns to proper group ; assign crit to DC Ops
  • Acknowledgements
    • auto-ack @ Opsview and how this impacts Yale workflow
    • this means that Unhandled column in Opsview goes away, becomes meaningless, etc. Important if thats a catalyst for starting Ops work (procedure, call, etc).
  • Getting keyword data transmitted
  • Oddballs (non-ITS like YUHS)
    • Oddballs can stay behind if needed; they dont block ITS from moving. This is all driven with notification profiles where we can poz match keywords.