Background:

Early on in ServiceNow implementation, somehow we used some bad sources of userdata. Bad in that there are netids in that data that are colliding with newer netids.

We have received at least five ServiceNow tickets with these kind of complaints since the beginning of ServiceNow go-live. The basic pattern is:

quite new netid conflicts with mystery netid already in Service Now
The legitimate netid is the one that was just created
The bad netid is the old one, and it has a create date older than Feb. 24, 2012
The legitimate netid is active=true
The bad net is active=true
Often both accounts have employer ids

An easy solution is to look for the duplicates, and go sanitize those out. Backeberg is concerned that there will be lurking netid collisions because the original imports had bad accounts.

We may need a superior algorithmic approach to go clean those out.

Here is one way to export a subset of the user table...

Go to ServiceNow, ideally not PROD instance, and login as admin user.
Navigate to application filter (top left corner), type in sys_user.list to dump the user table.
This will find potentially 200k or so entries, and you cannot export that many accounts.
Define a filter, which shows you a subset of users. Here's a way: "NetID starts with b"
Apply that filter, which limited me down to 7715 hits
Right click on the results page, along the column headers.
You will get a new menu, with an Export submenu
Pick CSV out of that list
Choose to download the export

How to rebuild a user_sys locally

I had a collection of .csv files with fewer than 50k entries, but it's nicer to work with the full set when performing queries.

cat users_starts_numeric.csv users_starts_with_a-c.csv users_starts_with_d-i.csv users_starts_with_j-l.csv users_starts_with_m-r.csv users_starts_with_s-z.csv > all_users.csv
# catch the header lines, and delete all but the first one
grep -n user_name all_users.csv
sed "175824d;127985d;81587d;41395d;125d" all_users.csv > all_users_one_header.csv 
cut -d, -f1 all_users_one_header.csv > all_users_just_usernames.csv

How to parse the export csv with a subset of the user table...

zuse:Downloads db692$ head sys_user_from_test_starts_with_b.csv 
"user_name","employee_number","name","title","source","u_facit","u_organization.u_department.u_division.u_id","u_organization.u_department.id","u_organization.u_id","active","mobile_phone","sys_created_on"
"b1912a","10793883","Brian Marshall","","","false","","","","false","","2012-02-23 11:43:23"
"ba2","10089420","Beatrice Abetti","","","false","D00609","G00806","872108","false","","2012-02-23 09:23:43"
"ba22","10097002","Barbara Amato-Kuslan","","","false","D03474","G01951","960003","true","","2012-02-23 09:24:22"

You want to look for the subset with colliding netids...

zuse:Downloads db692$ cat sys_user_from_test_starts_with_b.csv | awk -F, '{print $1}' | uniq -d | wc
      34      34     271

My concerns about this approach

If you go through the set of colliding netids, they seem to be sequential. This makes me concerned that there are lingering bad accounts in our sys_user table that have not yet collided, but will collide when new netids are created as new people develop relationships with Yale.

Here's some work I did...

cat sys_user_from_test_starts_with_b.csv | awk -F, '{print $1}' | uniq -d > duplicate_netids_starts_with_b.csv
cat duplicate_netids_starts_with_b.csv | while read line ; do grep $line sys_user_from_test_starts_with_b.csv ; done

I can redirect with >> that last output, and I end up with a file that seems to show what I suspected. Forwarding this to Bill and having a discussion.

Moving onto the process of determining not-yet-collided junk netid accounts

Bill has gathered the legitimate set of netids from the authoritative source. I need to do complicated set theory operations on the set of lists of real netids versus what I have in my dumps from service now.

So Backeberg installed MySQL on his Mac, he's loading the data onto MySQL so he can use SQL against these sets and isolate records and find patterns more readily.

Downloaded MySQL 64-bit dmg. Installed the two packages. Started up the service after reading the readme.

I had to cleanup the Bill dump from prod NetID source, because it was chock full of whitespace...

sed "s/\ //g" valid-IST1-netids.csv > valid-IST1-netids_stripped_whitespace.csv
cat valid-IST1-netids_stripped_whitespace.csv | sed "2,50d"  > valid-IST1-netids_stripped_whitespace_start_with_a_through_z.csv

I had to cleanup my dump from Service Now to get just the fields I wanted...

cat a_through_z.csv | awk -F'","' '{print $1","$10",\""$12}' > a_through_z_netids_active_createdate_with_garbage.csv
cat a_through_z_netids_active_createdate_with_garbage.csv | sed 's/^"//g' > a_through_z_netids_active_createdate.csv

mysql -uroot (no password initially).
CREATE DATABASE service_now_duplicates;
use service_now_duplicates;
create table legit_netids ( netid VARCHAR(15) );
create table netids_from_service_now(netid VARCHAR(15),active 
ENUM('true','false'),creation DATETIME);
create table netids_in_sn_not_in_idm (netid VARCHAR(15));
create table uniq_between_both_sets (netid VARCHAR(15),source VARCHAR(25));
LOAD DATA INFILE '/Users/db692/service_now_duplicates/round_four/valid-IST1-netids_stripped_whitespace_start_with_a_through_z.csv' INTO TABLE service_now_duplicates.legit_netids;
LOAD DATA INFILE '/Users/db692/service_now_duplicates/round_four/a_through_z_netids_active_createdate.csv' INTO TABLE service_now_duplicates.netids_from_service_now FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"';
delete from netids_from_service_now WHERE netid="user_name";
delete from legit_netids WHERE netid="USERNAME";

Now we're loaded and ready to start doing some set logic.

This shows me the duplicate netids...

select netid from netids_from_service_now group by netid HAVING count(1)>1;

This shows me that there was a bogus load of users on 4-22-2012. What's more interesting is that these accounts ARE SET TO Active, which the original bad accounts were not. Also interesting is that every account I've checked so far does not have a cn set (called 'name' is sys_user).

select netid,active,creation from netids_from_service_now where DATE(creation) = '2012-04-22';

There were 3818 hits for that set, and 6251 hits on entries that are NOT in both sets. There ARE a few hits for legit netids that were NOT YET in Service Now as of when I dumped the dataset, but most of those 6251 hits are accounts that are in Service Now and should not be. I need to figure out a way to discern which is which. I may just import the 6251 hits into a NEW table, and try joining that against the other two tables to isolate the bad accounts we can safely blow away.

INSERT INTO netids_in_sn_not_in_idm select netids_from_service_now.netid FROM netids_from_service_now LEFT JOIN (legit_netids) ON ( netids_from_service_now.netid != legit_netids.netid);

If that JOIN on non-common fields ever completes, I should have the EXACT SET from the original mistakes that I can blow away in PROD service now. This query has been running for almost two days on my local Mac. I figured out my Mac was putting itself to sleep after inactivity (even when plugged in), so I've fixed that, and the query will complete at some point.

I have another idea...

cat a_through_z_just_netid.csv valid-IST1-netids_stripped_whitespace_start_with_a_through_z.csv | sort > combined_netids_collections.csv
uniq -u combined_netids_collections.csv > netids_only_appearing_once.csv
wc netids_only_appearing_once.csv 
    6251    6252   38792 netids_only_appearing_once.csv

So that ran pretty fast and just got me NetIDs that are NOT common between the two sets. Unfortunately it also doesn't tell me which set has the unique member, whereas if my MySQL query ever completes it should do that for me.

But if I can import just the uncommon set into MySQL, and then JOIN that against the authoritative NetID list, I should be able to find the netids I can safely delete from PROD. Giving that a try...

LOAD DATA INFILE "/Users/db692/service_now_duplicates/round_four/netids_only_appearing_once.csv" INTO TABLE uniq_between_both_sets;
UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid ;

While that UPDATE is running, I came up with another problem, which is that not all NetIDs are normalized to lower case, and that's breaking my original uniq command before my load. Fixing that in round five...

Round five

cp ../round_four/valid-IST1-netids_stripped_whitespace_start_with_a_through_z.csv .
cp ../round_four/a_through_z_netids_active_createdate.csv .
cat a_through_z_netids_active_createdate.csv | tr '[:upper:]' '[:lower:]' > a_through_z_netids_active_createdate_LOWERED.csv 
cut -d, -f1 a_through_z_netids_active_createdate_LOWERED.csv > a_through_z_netids_just_netid.lowered.csv
cat valid-IST1-netids_stripped_whitespace_start_with_a_through_z.csv a_through_z_netids_just_netid.lowered.csv > combined_collection_of_netids.csv
vi combined_collection_of_netids.csv 
sort combined_collection_of_netids.csv > combined_collection_of_netids.sorted.csv
uniq -u combined_collection_of_netids.sorted.csv > just_unique_netids.csv
wc just_unique_netids.csv 
    6233    6234   38688 just_unique_netids.csv

That's a different result than we got in round four. Round four said 6251.

Going to cancel the SQL job, reimport, and really, because this is a valid method (and because we had mixed case for our uniqueness comparisons) I'm also going to cancel the original job that has not completed.

DROP TABLE uniq_between_both_sets;
create table uniq_between_both_sets (netid VARCHAR(15),source VARCHAR(25));
LOAD DATA INFILE "/Users/db692/service_now_duplicates/round_five/just_unique_netids.csv" INTO TABLE uniq_between_both_sets;
UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid AND legit_netids.netid LIKE "aa%";

The last time I ran that, it came back fairly promptly and about twenty entries, which I spot checked and found accurate. This time it's taking much longer and I have not a clue why. Perhaps because things haven't read into ram properly after I killed the other job or something. But anyway, I would have expected this much faster. Killing it and reducing the problem size further.

I had the query wrong. Tried it like:

mysql> UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid AND uniq_between_both_sets.netid LIKE "aa7%";
Query OK, 5 rows affected (2.26 sec)
Rows matched: 5  Changed: 5  Warnings: 0

And got the results I was expecting. Trying it again with a broader query.

mysql> UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid AND uniq_between_both_sets.netid LIKE "aa%";
Query OK, 2 rows affected (5.52 sec)
Rows matched: 7  Changed: 2  Warnings: 0

Trying it again with a broader query.

mysql> UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid AND uniq_between_both_sets.netid LIKE "a%";
Query OK, 23 rows affected (1 min 57.18 sec)
Rows matched: 30  Changed: 23  Warnings: 0

Trying it again with a broader query. That was 1.2 seconds per row in uniq_between_both_sets, and there are 6233 rows in the table. I estimate the answer will come back in a little more than two hours of chunking through the tables.

mysql> UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid;

This completed in less time than I expected, and got interesting results...

mysql> select DATE(netids_from_service_now.creation),count(1) FROM netids_from_service_now, uniq_between_both_sets where netids_from_service_now.netid = uniq_between_both_sets.netid GROUP BY DATE(netids_from_service_now.creation) WITH ROLLUP;
+----------------------------------------+----------+
| DATE(netids_from_service_now.creation) | count(1) |
+----------------------------------------+----------+
| 2004-05-01                             |        1 |
| 2012-02-19                             |        1 |
| 2012-02-22                             |       15 |
| 2012-02-23                             |     2266 |
| 2012-02-28                             |        1 |
| 2012-02-29                             |       12 |
| 2012-03-02                             |        1 |
| 2012-03-07                             |        2 |
| 2012-03-19                             |        1 |
| 2012-03-22                             |        3 |
| 2012-03-29                             |        4 |
| 2012-03-30                             |        1 |
| 2012-04-16                             |        1 |
| 2012-04-18                             |       22 |
| 2012-04-21                             |       16 |
| 2012-04-22                             |     3514 |
| 2012-04-24                             |        1 |
| 2012-04-25                             |        1 |
| 2012-04-26                             |       15 |
| 2012-04-27                             |        3 |
| 2012-05-01                             |        1 |
| 2012-05-02                             |        2 |
| 2012-05-03                             |        2 |
| 2012-05-04                             |        5 |
| 2012-05-07                             |        1 |
| 2012-05-09                             |        1 |
| 2012-05-14                             |        1 |
| 2012-05-16                             |        1 |
| 2012-05-17                             |        1 |
| 2012-05-18                             |        3 |
| 2012-05-22                             |       14 |
| 2012-05-23                             |        1 |
| 2012-05-24                             |        2 |
| 2012-05-25                             |        4 |
| 2012-05-28                             |        4 |
| 2012-05-31                             |       29 |
| NULL                                   |     5953 |
+----------------------------------------+----------+
37 rows in set (1 min 24.52 sec)

The two big days; in February and April, I already knew about from intermediate spot checking. What I'm surprised about is how broad of a collection of hits I have on accounts that are in Service Now, but are not valid in Central Auth. These are small enough collections that I should be able to pick a few of them by hand and see if I can figure out what is going on.

Status

As of the end of June 1, 2012, Backeberg has isolated just under 1000 duplicate netid entries. It's fairly straightforward to wipe these out. I've built a script to do the wipeout; Bill has asked that I make a no-op version that collects a log, and run that in pre-prod. That's running June 12, 2012. June 15, 2012 I checked on the job; still running. Bill helped me find a bug in my transform (I didn't coalesce on common rows; he said he makes the same mistake occasionally). I tried stopping the job, but it wouldn't stop. Based on the burn rate, it should finish on its own Monday or Tuesday, by which time I'll probably be out on leave.

I've also approached the problem of finding the bad account entries in Service Now that have not YET collided with a legitimate account, but will in the future as new NetIDs are created. Said differently, these are the difference between the overall set of users in Service Now and the set of users in Central Auth. I've been using a collection of MySQL and command line tricks to tease out these entries, and I think my fifth round results are both accurate and my data analysis approach will get a result in a reasonable amount of time. Then we will probably put those results through the same scrutiny as my original set of duplicates data set.

The correct Service Now script to wipe bad user entries

/**
 * For variables go to: http://wiki.service-now.com/index.php?title=Import_Sets
 **/ 

var netid = source.netid;

var rec = new GlideRecord('sys_user');
/*rec.addQuery('user_name','=',source.u_netid);*/
rec.addQuery('user_name',source.u_netid);
rec.addQuery('sys_created_on','<','2012-02-24 12:00:00');
rec.addInactiveQuery();
rec.query();
if (rec.next()){
  rec.deleteRecord();
}

Operation removing the conflicting or broken user accounts