...
Code Block |
---|
mysql -uroot (no password initially). CREATE DATABASE service_now_duplicates; use service_now_duplicates; create table legit_netids ( netid VARCHAR(15) ); create table netids_from_service_now(netid VARCHAR(15),active ENUM('true','false'),creation DATETIME); create table netids_in_sn_not_in_idm (netid VARCHAR(15)); create table uniq_between_both_sets (netid VARCHAR(15),source VARCHAR(25)); LOAD DATA INFILE '/Users/db692/service_now_duplicates/round_four/valid-IST1-netids_stripped_whitespace_start_with_a_through_z.csv' INTO TABLE service_now_duplicates.legit_netids; LOAD DATA INFILE '/Users/db692/service_now_duplicates/round_four/a_through_z_netids_active_createdate.csv' INTO TABLE service_now_duplicates.netids_from_service_now FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'; delete from netids_from_service_now WHERE netid="user_name"; delete from legit_netids WHERE netid="USERNAME"; |
...
There were 3818 hits for that set, and 6251 hits on entries that are NOT in both sets. There ARE a few hits for legit netids that were NOT YET in Service Now as of when I dumped the dataset, but most of those 6251 hits are accounts that are in Service Now and should not be. I need to figure out a way to discern which is which. I may just import the 6251 hits into a NEW table, and try joining that against the other two tables to isolate the bad accounts we can safely blow away.
Code Block |
---|
INSERT INTO netids_in_sn_not_in_idm select netids_from_service_now.netid FROM netids_from_service_now LEFT JOIN (legit_netids) ON ( netids_from_service_now.netid != legit_netids.netid);
|
If that JOIN on non-common fields ever completes, I should have the EXACT SET from the original mistakes that I can blow away in PROD service now. This query has been running for almost two days on my local Mac. I figured out my Mac was putting itself to sleep after inactivity (even when plugged in), so I've fixed that, and the query will complete at some point.
I have another idea...
Code Block |
---|
cat a_through_z_just_netid.csv valid-IST1-netids_stripped_whitespace_start_with_a_through_z.csv | sort > combined_netids_collections.csv
uniq -u combined_netids_collections.csv > netids_only_appearing_once.csv
wc netids_only_appearing_once.csv
6251 6252 38792 netids_only_appearing_once.csv
|
So that ran pretty fast and just got me NetIDs that are NOT common between the two sets. Unfortunately it also doesn't tell me which set has the unique member, whereas if my MySQL query ever completes it should do that for me.
But if I can import just the uncommon set into MySQL, and then JOIN that against the authoritative NetID list, I should be able to find the netids I can safely delete from PROD. Giving that a try...
Code Block |
---|
LOAD DATA INFILE "/Users/db692/service_now_duplicates/round_four/netids_only_appearing_once.csv" INTO TABLE uniq_between_both_sets;
UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid ;
|
While that UPDATE is running, I came up with another problem, which is that not all NetIDs are normalized to lower case, and that's breaking my original uniq command before my load. Fixing that in round five...
Round five
Code Block |
---|
cp ../round_four/valid-IST1-netids_stripped_whitespace_start_with_a_through_z.csv .
cp ../round_four/a_through_z_netids_active_createdate.csv .
cat a_through_z_netids_active_createdate.csv | tr '[:upper:]' '[:lower:]' > a_through_z_netids_active_createdate_LOWERED.csv
cut -d, -f1 a_through_z_netids_active_createdate_LOWERED.csv > a_through_z_netids_just_netid.lowered.csv
cat valid-IST1-netids_stripped_whitespace_start_with_a_through_z.csv a_through_z_netids_just_netid.lowered.csv > combined_collection_of_netids.csv
vi combined_collection_of_netids.csv
sort combined_collection_of_netids.csv > combined_collection_of_netids.sorted.csv
uniq -u combined_collection_of_netids.sorted.csv > just_unique_netids.csv
wc just_unique_netids.csv
6233 6234 38688 just_unique_netids.csv
|
That's a different result than we got in round four. Round four said 6251.
Going to cancel the SQL job, reimport, and really, because this is a valid method (and because we had mixed case for our uniqueness comparisons) I'm also going to cancel the original job that has not completed.
Code Block |
---|
DROP TABLE uniq_between_both_sets; create table uniq_between_both_sets (netid VARCHAR(15),source VARCHAR(25)); LOAD DATA INFILE "/Users/db692/service_now_duplicates/round_five/just_unique_netids.csv" INTO TABLE uniq_between_both_sets; UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid AND legit_netids.netid LIKE "aa%"; |
The last time I ran that, it came back fairly promptly and about twenty entries, which I spot checked and found accurate. This time it's taking much longer and I have not a clue why. Perhaps because things haven't read into ram properly after I killed the other job or something. But anyway, I would have expected this much faster. Killing it and reducing the problem size further.
I had the query wrong. Tried it like:
Code Block |
---|
mysql> UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid AND uniq_between_both_sets.netid LIKE "aa7%";
Query OK, 5 rows affected (2.26 sec)
Rows matched: 5 Changed: 5 Warnings: 0
|
And got the results I was expecting. Trying it again with a broader query.
Code Block |
---|
mysql> UPDATE uniq_between_both_sets, legit_netids SET source="AUTH" WHERE uniq_between_both_sets.netid = legit_netids.netid AND uniq_between_both_sets.netid LIKE "aa%";
Query OK, 2 rows affected (5.52 sec)
Rows matched: 7 Changed: 2 Warnings: 0
|
Trying it again with a broader query.
Code Block |
---|
Status
As of the end of June 1, 2012, Backeberg has isolated just under 1000 duplicate netid entries. It's fairly straightforward to wipe these out, but we're going to put that off until after I'm out of ServiceNow training next week. I've built a script to do the wipeout; Bill has asked that I make a no-op version that collects a log, and run that in pre-prod. That's running June 12, 2012. June 15, 2012 I checked on the job; still running. Bill helped me find a bug in my transform (I didn't coalesce on common rows; he said he makes the same mistake occasionally). I tried stopping the job, but it wouldn't stop. Based on the burn rate, it should finish on its own Monday or Tuesday, by which time I'll probably be out on leave.