User import issues in ServiceNow
We have had repeated Incidents of ServiceNow IST User import failing. Root cause seems to be the midserver Java daemon getting corrupted.
In June 2014, Backeberg added a nagios monitor, tracked on R1199 to proactively look for the situation where the IST User import has gone wonky.
The result of a wonky IST User import is new users stop getting imported to ServiceNow, and existing users do not receive updates to their user data.
The nagios monitor is called HTTPS - 443 - Service-now monitor_ist_user_import. Green means everything is okay with the last few business days worth of IST user imports.
Red means something went wrong, and the most common thing that goes wrong is that the PROD midserver loses its marbles, and the correct procedure is to go restart the midserver service.
System change
Note, that in August 2014, Backeberg made a System Change on CHG0011643 to automatically restart the PROD midserver service every second day. Since that change, we have only had one import error. Before that change, import errors were happening about once every two weeks.
Things nagios opsview says when things goes wrong
HTTP CRITICAL: Status line output matched 200 - string 'OK' not found on 'https://yale.service-now.com:443/monitor_ist_user_import.do' - 829 bytes in 0.961 second response time
If you go to ServiceNow, while logged in, and review the page, you'll see:
MonitorProgressWorker: Errors Logged
The actual check in nagios is called
Restarting the midserver01 service
There is a machine called vm-snprdmid-01.its.yale.edu. On vm-snprdmid-01 machine, there is a start-stop script called /etc/init.d/midrpd01
When the midserver daemon loses its marbles Backeberg hasn't found any tell-tale signs when looking at the linux host. The java process is running, things look okay, but Oracle imports passing through fail 100% of the time. Forcing an import attempt as ServiceNow admin is the only reliable way to tell there is an issue with the midserver, and a red in opsview for this service is a reliable proxy of failing Oracle imports in ServiceNow.
to restart the java daemon, do a:
sudo /etc/init.d/midprd01 restart
And wait a few minutes for the Java stack to reach steady state. After that, the ServiceNow imports in PROD should succeed. You can manually force the IST import if you want, but that requires PROD ServiceNow access, and most people do not have that access. As of September 2014,Soren Sonson and Dan Franko are the only Yale employees with PROD access.
If you are a member of Unix Systems, you can stop here, and this is good enough to resolve the issue. Future imports should succeed, and after a maximum of one business week, the red error on opsview will clear itself. Read on for the details of why that is. It's safe to just ACK the red in opsview if you've restarted the midserver service.
Forcing the IST import
The IST import will run on its own without intervention. The schedule as of June 2014 is set to fire at 1:11AM. When it completes, it kicks off a series of child imports that do not-as-much-good if the IST import doesn't run first.
To force the IST import, login to yale.service-now.com side-door as an admin equivalent user.
You should first test that you've waited long enough since restarting the midserver.
Navigate to Data Sources then IST-User then in Related Links, click Test Load 20 Records. This will jump you to a progress indicator, and you should get a prompt green Complete Success message that 20 records were processed. If you get a long orange process followed by a timeout, you probably haven't waited long enough since restarting the midserver. If you think you have, its possible something else is going on, like needing to change database credentials, or the service is down. Consult Yale DBAs group to confirm access or service is not the issue with IST1. The settings are in the same place you went to click the Test Load 20 Records dialogue.
If success on Test 20 Records import, then navigate to Scheduled Imports then IST-User then click the Execute Now button.
The actual IST import process takes more than an hour. It pulls 200,000+ records out of Oracle, copies them to a temp table, then processes them one-by-one. It takes a while. Probably more than two hours, but I don't remember. I always just click it and walk away.
Clearing the nagios error
This also requires admin rights in ServiceNow PROD.
The way we designed the report from ServiceNow, which then gets scraped by opsview, we look at the last business week of imports in arrears. Thus if you have an issue on Monday, get a red error on import which flips the ServiceNow import monitor to red, you will continue seeing a red error in opsview even once you restored the service.
The actual report from ServiceNow is a poll on the Progress Workers table.
Thus if you really want to clear the alert, you can pick the entry that failed and toggle the State (or just delete the bad entry entirely). This will clear the error in opsview. Then you will know if your import again fails in the same business week.