Background:
Early on in ServiceNow implementation, somehow we used some bad sources of userdata. Bad in that there are netids in that data that are colliding with newer netids.
We have received at least five ServiceNow tickets with these kind of complaints since the beginning of ServiceNow go-live. The basic pattern is:
- quite new netid conflicts with mystery netid already in Service Now
- The legitimate netid is the one that was just created
- The bad netid is the old one, and it has a create date older than Feb. 24, 2012
- The legitimate netid is active=true
- The bad net is active=true
- Often both accounts have employer ids
An easy solution is to look for the duplicates, and go sanitize those out. Backeberg is concerned that there will be lurking netid collisions because the original imports had bad accounts.
We may need a superior algorithmic approach to go clean those out.
Here is one way to export a subset of the user table...
- Go to ServiceNow, ideally not PROD instance, and login as admin user.
- Navigate to application filter (top left corner), type in sys_user.list to dump the user table.
- This will find potentially 200k or so entries, and you cannot export that many accounts.
- Define a filter, which shows you a subset of users. Here's a way: "NetID starts with b"
- Apply that filter, which limited me down to 7715 hits
- Right click on the results page, along the column headers.
- You will get a new menu, with an Export submenu
- Pick CSV out of that list
- Choose to download the export
How to parse the export csv with a subset of the user table...
zuse:Downloads db692$ head sys_user_from_test_starts_with_b.csv "user_name","employee_number","name","title","source","u_facit","u_organization.u_department.u_division.u_id","u_organization.u_department.id","u_organization.u_id","active","mobile_phone","sys_created_on" "b1912a","10793883","Brian Marshall","","","false","","","","false","","2012-02-23 11:43:23" "ba2","10089420","Beatrice Abetti","","","false","D00609","G00806","872108","false","","2012-02-23 09:23:43" "ba22","10097002","Barbara Amato-Kuslan","","","false","D03474","G01951","960003","true","","2012-02-23 09:24:22"
You want to look for the subset with colliding netids...
zuse:Downloads db692$ cat sys_user_from_test_starts_with_b.csv | awk -F, '{print $1}' | uniq -d | wc 34 34 271
My concerns about this approach
If you go through the set of colliding netids, they seem to be sequential. This makes me concerned that there are lingering bad accounts in our sys_user table that have not yet collided, but will collide when new netids are created as new people develop relationships with Yale.
Here's some work I did...
cat sys_user_from_test_starts_with_b.csv | awk -F, '{print $1}' | uniq -d > duplicate_netids_starts_with_b.csv cat duplicate_netids_starts_with_b.csv | while read line ; do grep $line sys_user_from_test_starts_with_b.csv ; done
I can redirect with >> that last output, and I end up with a file that seems to show what I suspected. Forwarding this to Bill and having a discussion.
Moving onto the process of determining not-yet-collided junk netid accounts
Bill has gathered the legitimate set of netids from the authoritative source. I need to do complicated set theory operations on the set of lists of real netids versus what I have in my dumps from service now.
So Backeberg installed MySQL on his Mac, he's loading the data onto MySQL so he can use SQL against these sets and isolate records and find patterns more readily.
Downloaded MySQL 64-bit dmg. Installed the two packages. Started up the service after reading the readme.
I had to cleanup the Bill dump from prod NetID source, because it was chock full of whitespace...
sed "s/\ //g" valid-IST1-netids.csv > valid-IST1-netids_stripped_whitespace.csv cat valid-IST1-netids_stripped_whitespace.csv | sed "2,50d" > valid-IST1-netids_stripped_whitespace_start_with_a_through_z.csv
I had to cleanup my dump from Service Now to get just the fields I wanted...
cat a_through_z.csv | awk -F'","' '{print $1","$10",\""$12}' > a_through_z_netids_active_createdate_with_garbage.csv cat a_through_z_netids_active_createdate_with_garbage.csv | sed 's/^"//g' > a_through_z_netids_active_createdate.csv
mysql -uroot (no password initially). CREATE DATABASE service_now_duplicates; use service_now_duplicates; create table legit_netids ( netid VARCHAR(15) ); create table netids_from_service_now(netid VARCHAR(15),active ENUM('true','false'),creation DATETIME); create table netids_in_sn_not_in_idm (netid VARCHAR(15)); create table uniq_between_both_sets (netid VARCHAR(15),source VARCHAR(25)); LOAD DATA INFILE '/Users/db692/service_now_duplicates/round_four/valid-IST1-netids_stripped_whitespace_start_with_a_through_z.csv' INTO TABLE service_now_duplicates.legit_netids; LOAD DATA INFILE '/Users/db692/service_now_duplicates/round_four/a_through_z_netids_active_createdate.csv' INTO TABLE service_now_duplicates.netids_from_service_now FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'; delete from netids_from_service_now WHERE netid="user_name"; delete from legit_netids WHERE netid="USERNAME";
Now we're loaded and ready to start doing some set logic.
This shows me the duplicate netids...
select netid from netids_from_service_now group by netid HAVING count(1)>1;
This shows me that there was a bogus load of users on 4-22-2012. What's more interesting is that these accounts ARE SET TO Active, which the original bad accounts were not. Also interesting is that every account I've checked so far does not have a cn set (called 'name' is sys_user).
select netid,active,creation from netids_from_service_now where DATE(creation) = '2012-04-22';
There were 3818 hits for that set, and 6251 hits on entries that are NOT in both sets. There ARE a few hits for legit netids that were NOT YET in Service Now as of when I dumped the dataset, but most of those 6251 hits are accounts that are in Service Now and should not be. I need to figure out a way to discern which is which. I may just import the 6251 hits into a NEW table, and try joining that against the other two tables to isolate the bad accounts we can safely blow away.
INSERT INTO netids_in_sn_not_in_idm select netids_from_service_now.netid FROM netids_from_service_now LEFT JOIN (legit_netids) ON ( netids_from_service_now.netid != legit_netids.netid);
If that JOIN on non-common fields ever completes, I should have the EXACT SET from the original mistakes that I can blow away in PROD service now. This query has been running for almost two days on my local Mac. I figured out my Mac was putting itself to sleep after inactivity (even when plugged in), so I've fixed that, and the query will complete at some point.
I have another idea...
cat a_through_z_just_netid.csv valid-IST1-netids_stripped_whitespace_start_with_a_through_z.csv | sort > combined_netids_collections.csv uniq -u combined_netids_collections.csv > netids_only_appearing_once.csv wc netids_only_appearing_once.csv 6251 6252 38792 netids_only_appearing_once.csv
So that ran pretty fast and just got me NetIDs that are NOT common between the two sets. Unfortunately it also doesn't tell me which set has the unique member, whereas if my MySQL query ever completes it should do that for me.
But if I can import just the uncommon set into MySQL, and then JOIN that against the authoritative NetID list, I should be able to find the netids I can safely delete from PROD. Giving that a try...
LOAD DATA INFILE "/Users/db692/service_now_duplicates/round_four/netids_only_appearing_once.csv" INTO TABLE uniq_between_both_sets; UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid ;
While that UPDATE is running, I came up with another problem, which is that not all NetIDs are normalized to lower case, and that's breaking my original uniq command before my load. Fixing that in round five...
Round five
cp ../round_four/valid-IST1-netids_stripped_whitespace_start_with_a_through_z.csv . cp ../round_four/a_through_z_netids_active_createdate.csv . cat a_through_z_netids_active_createdate.csv | tr '[:upper:]' '[:lower:]' > a_through_z_netids_active_createdate_LOWERED.csv cut -d, -f1 a_through_z_netids_active_createdate_LOWERED.csv > a_through_z_netids_just_netid.lowered.csv cat valid-IST1-netids_stripped_whitespace_start_with_a_through_z.csv a_through_z_netids_just_netid.lowered.csv > combined_collection_of_netids.csv vi combined_collection_of_netids.csv sort combined_collection_of_netids.csv > combined_collection_of_netids.sorted.csv uniq -u combined_collection_of_netids.sorted.csv > just_unique_netids.csv wc just_unique_netids.csv 6233 6234 38688 just_unique_netids.csv
That's a different result than we got in round four. Round four said 6251.
Going to cancel the SQL job, reimport, and really, because this is a valid method (and because we had mixed case for our uniqueness comparisons) I'm also going to cancel the original job that has not completed.
DROP TABLE uniq_between_both_sets; create table uniq_between_both_sets (netid VARCHAR(15),source VARCHAR(25)); LOAD DATA INFILE "/Users/db692/service_now_duplicates/round_five/just_unique_netids.csv" INTO TABLE uniq_between_both_sets; UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid AND legit_netids.netid LIKE "aa%";
The last time I ran that, it came back fairly promptly and about twenty entries, which I spot checked and found accurate. This time it's taking much longer and I have not a clue why. Perhaps because things haven't read into ram properly after I killed the other job or something. But anyway, I would have expected this much faster. Killing it and reducing the problem size further.
I had the query wrong. Tried it like:
mysql> UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid AND uniq_between_both_sets.netid LIKE "aa7%"; Query OK, 5 rows affected (2.26 sec) Rows matched: 5 Changed: 5 Warnings: 0
And got the results I was expecting. Trying it again with a broader query.
mysql> UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid AND uniq_between_both_sets.netid LIKE "aa%"; Query OK, 2 rows affected (5.52 sec) Rows matched: 7 Changed: 2 Warnings: 0
Trying it again with a broader query.
Status
As of the end of June 1, 2012, Backeberg has isolated just under 1000 duplicate netid entries. It's fairly straightforward to wipe these out, but we're going to put that off until after I'm out of ServiceNow training next week. I've built a script to do the wipeout; Bill has asked that I make a no-op version that collects a log, and run that in pre-prod. That's running June 12, 2012. June 15, 2012 I checked on the job; still running. Bill helped me find a bug in my transform (I didn't coalesce on common rows; he said he makes the same mistake occasionally). I tried stopping the job, but it wouldn't stop. Based on the burn rate, it should finish on its own Monday or Tuesday, by which time I'll probably be out on leave.