Table of Contents |
---|
Background:
Early on in ServiceNow implementation, somehow we used some bad sources of userdata. Bad in that there are netids in that data that are colliding with newer netids.
...
- Go to ServiceNow, ideally not PROD instance, and login as admin user.
- Navigate to application filter (top left corner), type in sys_user.list to dump the user table.
- This will find potentially 200k or so entries, and you cannot export that many accounts.
- Define a filter, which shows you a subset of users. Here's a way: "NetID starts with b"
- Apply that filter, which limited me down to 7715 hits
- Right click on the results page, along the column headers.
- You will get a new menu, with an Export submenu
- Pick CSV out of that list
- Choose to download the export
How to rebuild a user_sys locally
I had a collection of .csv files with fewer than 50k entries, but it's nicer to work with the full set when performing queries.
Code Block |
---|
cat users_starts_numeric.csv users_starts_with_a-c.csv users_starts_with_d-i.csv users_starts_with_j-l.csv users_starts_with_m-r.csv users_starts_with_s-z.csv > all_users.csv
# catch the header lines, and delete all but the first one
grep -n user_name all_users.csv
sed "175824d;127985d;81587d;41395d;125d" all_users.csv > all_users_one_header.csv
cut -d, -f1 all_users_one_header.csv > all_users_just_usernames.csv
|
How to parse the export csv with a subset of the user table...
...
Code Block |
---|
sed "s/\ //g" valid-IST1-netids.csv > valid-IST1-netids_stripped_whitespace.csv cat valid-IST1-netids_stripped_whitespace.csv | sed "2,50d" > valid-IST1-netids_stripped_whitespace_start_with_a_through_z.csv |
I had to cleanup my dump from Service Now to get just the fields I wanted...
Code Block |
---|
cat a_through_z.csv | awk -F'","' '{print $1","$10",\""$12}' > a_through_z_netids_active_createdate_with_garbage.csv cat a_through_z_netids_active_createdate_with_garbage.csv | sed 's/^"//g' > a_through_z_netids_active_createdate.csv |
Code Block |
---|
mysql -uroot (no password initially). CREATE DATABASE service_now_duplicates; use service_now_duplicates; create table legit_netids ( netid VARCHAR(15) ); create table netids_from_service_now(netid VARCHAR(15),active ENUM('true','false'),creation DATETIME); create table netids_in_sn_not_in_idm (netid VARCHAR(15)); create table uniq_between_both_sets (netid VARCHAR(15),source VARCHAR(25)); LOAD DATA INFILE '/Users/db692/service_now_duplicates/round_four/valid-IST1-netids_stripped_whitespace_start_with_a_through_z.csv' INTO TABLE service_now_duplicates.legit_netids; LOAD DATA INFILE '/Users/db692/service_now_duplicates/round_four/a_through_z_netids_active_createdate.csv' INTO TABLE service_now_duplicates.netids_from_service_now FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'; delete from netids_from_service_now WHERE netid="user_name"; delete from legit_netids WHERE netid="USERNAME"; |
Now we're loaded and ready to start doing some set logic.
This shows me the duplicate netids...
Code Block |
---|
select netid from netids_from_service_now group by netid HAVING count(1)>1; |
This shows me that there was a bogus load of users on 4-22-2012. What's more interesting is that these accounts ARE SET TO Active, which the original bad accounts were not. Also interesting is that every account I've checked so far does not have a cn set (called 'name' is sys_user).
Code Block |
---|
select netid,active BOOLEAN); ,creation from netids_from_service_now where DATE(creation) = '2012-04-22'; |
There were 3818 hits for that set, and 6251 hits on entries that are NOT in both sets. There ARE a few hits for legit netids that were NOT YET in Service Now as of when I dumped the dataset, but most of those 6251 hits are accounts that are in Service Now and should not be. I need to figure out a way to discern which is which. I may just import the 6251 hits into a NEW table, and try joining that against the other two tables to isolate the bad accounts we can safely blow away.
Code Block |
---|
INSERT INTO netids_in_sn_not_in_idm select netids_from_service_now.netid FROM netids_from_service_now LEFT JOIN (legit_netids) ON ( netids_from_service_now.netid != legit_netids.netid);
|
If that JOIN on non-common fields ever completes, I should have the EXACT SET from the original mistakes that I can blow away in PROD service now. This query has been running for almost two days on my local Mac. I figured out my Mac was putting itself to sleep after inactivity (even when plugged in), so I've fixed that, and the query will complete at some point.
I have another idea...
Code Block |
---|
cat a_through_z_just_netid.csv valid-IST1-netids_stripped_whitespace_start_with_a_through_z.csv | sort > combined_netids_collections.csv
uniq -u combined_netids_collections.csv > netids_only_appearing_once.csv
wc netids_only_appearing_once.csv
6251 6252 38792 netids_only_appearing_once.csv
|
So that ran pretty fast and just got me NetIDs that are NOT common between the two sets. Unfortunately it also doesn't tell me which set has the unique member, whereas if my MySQL query ever completes it should do that for me.
But if I can import just the uncommon set into MySQL, and then JOIN that against the authoritative NetID list, I should be able to find the netids I can safely delete from PROD. Giving that a try...
Code Block |
---|
LOAD DATA INFILE '"/Users/db692/service_now_duplicates/round_four/netids_only_appearing_once.csv" INTO TABLE uniq_between_both_sets; UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid ; |
While that UPDATE is running, I came up with another problem, which is that not all NetIDs are normalized to lower case, and that's breaking my original uniq command before my load. Fixing that in round five...
Round five
Code Block |
---|
cp ../round_four/valid-IST1-netids_stripped_whitespace_start_with_a_through_z.csv . cp ../round_four/a_through_z_netids_active_createdate.csv . cat a_through_z_netids_active_createdate.csv | tr '[:upper:]' '[:lower:]' INTO TABLE service_now_duplicates.legit_netids; > a_through_z_netids_active_createdate_LOWERED.csv cut -d, -f1 a_through_z_netids_active_createdate_LOWERED.csv > a_through_z_netids_just_netid.lowered.csv cat valid-IST1-netids_stripped_whitespace_start_with_a_through_z.csv a_through_z_netids_just_netid.lowered.csv > combined_collection_of_netids.csv vi combined_collection_of_netids.csv sort combined_collection_of_netids.csv > combined_collection_of_netids.sorted.csv uniq -u combined_collection_of_netids.sorted.csv > just_unique_netids.csv wc just_unique_netids.csv 6233 6234 38688 just_unique_netids.csv |
That's a different result than we got in round four. Round four said 6251.
Going to cancel the SQL job, reimport, and really, because this is a valid method (and because we had mixed case for our uniqueness comparisons) I'm also going to cancel the original job that has not completed.
Code Block |
---|
DROP TABLE uniq_between_both_sets;
create table uniq_between_both_sets (netid VARCHAR(15),source VARCHAR(25));
LOAD DATA INFILE "/Users/db692/service_now_duplicates/round_five/just_unique_netids.csv" INTO TABLE uniq_between_both_sets;
UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid AND legit_netids.netid LIKE "aa%";
|
The last time I ran that, it came back fairly promptly and about twenty entries, which I spot checked and found accurate. This time it's taking much longer and I have not a clue why. Perhaps because things haven't read into ram properly after I killed the other job or something. But anyway, I would have expected this much faster. Killing it and reducing the problem size further.
I had the query wrong. Tried it like:
Code Block |
---|
mysql> UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid AND uniq_between_both_sets.netid LIKE "aa7%";
Query OK, 5 rows affected (2.26 sec)
Rows matched: 5 Changed: 5 Warnings: 0
|
And got the results I was expecting. Trying it again with a broader query.
Code Block |
---|
mysql> UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid AND uniq_between_both_sets.netid LIKE "aa%";
Query OK, 2 rows affected (5.52 sec)
Rows matched: 7 Changed: 2 Warnings: 0
|
Trying it again with a broader query.
Code Block |
---|
mysql> UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid AND uniq_between_both_sets.netid LIKE "a%";
Query OK, 23 rows affected (1 min 57.18 sec)
Rows matched: 30 Changed: 23 Warnings: 0
|
Trying it again with a broader query. That was 1.2 seconds per row in uniq_between_both_sets, and there are 6233 rows in the table. I estimate the answer will come back in a little more than two hours of chunking through the tables.
Code Block |
---|
mysql> UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid;
|
This completed in less time than I expected, and got interesting results...
Code Block |
---|
mysql> select DATE(netids_from_service_now.creation),count(1) FROM netids_from_service_now, uniq_between_both_sets where netids_from_service_now.netid = uniq_between_both_sets.netid GROUP BY DATE(netids_from_service_now.creation) WITH ROLLUP;
+----------------------------------------+----------+
| DATE(netids_from_service_now.creation) | count(1) |
+----------------------------------------+----------+
| 2004-05-01 | 1 |
| 2012-02-19 | 1 |
| 2012-02-22 | 15 |
| 2012-02-23 | 2266 |
| 2012-02-28 | 1 |
| 2012-02-29 | 12 |
| 2012-03-02 | 1 |
| 2012-03-07 | 2 |
| 2012-03-19 | 1 |
| 2012-03-22 | 3 |
| 2012-03-29 | 4 |
| 2012-03-30 | 1 |
| 2012-04-16 | 1 |
| 2012-04-18 | 22 |
| 2012-04-21 | 16 |
| 2012-04-22 | 3514 |
| 2012-04-24 | 1 |
| 2012-04-25 | 1 |
| 2012-04-26 | 15 |
| 2012-04-27 | 3 |
| 2012-05-01 | 1 |
| 2012-05-02 | 2 |
| 2012-05-03 | 2 |
| 2012-05-04 | 5 |
| 2012-05-07 | 1 |
| 2012-05-09 | 1 |
| 2012-05-14 | 1 |
| 2012-05-16 | 1 |
| 2012-05-17 | 1 |
| 2012-05-18 | 3 |
| 2012-05-22 | 14 |
| 2012-05-23 | 1 |
| 2012-05-24 | 2 |
| 2012-05-25 | 4 |
| 2012-05-28 | 4 |
| 2012-05-31 | 29 |
| NULL | 5953 |
+----------------------------------------+----------+
37 rows in set (1 min 24.52 sec)
|
The two big days; in February and April, I already knew about from intermediate spot checking. What I'm surprised about is how broad of a collection of hits I have on accounts that are in Service Now, but are not valid in Central Auth. These are small enough collections that I should be able to pick a few of them by hand and see if I can figure out what is going on.
Status
As of the end of June 1, 2012, Backeberg has isolated just under 1000 duplicate netid entries. It's fairly straightforward to wipe these out, but we're going to put that off until after I'm out of ServiceNow training next week. I've built a script to do the wipeout; Bill has asked that I make a no-op version that collects a log, and run that in pre-prod. That's running June 12, 2012. June 15, 2012 I checked on the job; still running. Bill helped me find a bug in my transform (I didn't coalesce on common rows; he said he makes the same mistake occasionally). I tried stopping the job, but it wouldn't stop. Based on the burn rate, it should finish on its own Monday or Tuesday, by which time I'll probably be out on leave.
I've also approached the problem of finding the bad account entries in Service Now that have not YET collided with a legitimate account, but will in the future as new NetIDs are created. Said differently, these are the difference between the overall set of users in Service Now and the set of users in Central Auth. I've been using a collection of MySQL and command line tricks to tease out these entries, and I think my fifth round results are both accurate and my data analysis approach will get a result in a reasonable amount of time. Then we will probably put those results through the same scrutiny as my original set of duplicates data set.
The correct Service Now script to wipe bad user entries
Code Block |
---|
/** * For variables go to: http://wiki.service-now.com/index.php?title=Import_Sets **/ var netid = source.netid; var rec = new GlideRecord('sys_user'); /*rec.addQuery('user_name','=',source.u_netid);*/ rec.addQuery('user_name',source.u_netid); rec.addQuery('sys_created_on','<','2012-02-24 12:00:00'); rec.addInactiveQuery(); rec.query(); if (rec.next()){ rec.deleteRecord(); } |