Content Comparison

As part of the migration from Oracle ACS1 to SQL Server, it was necessary to build a tool that could quickly copy the content of ACS1 tables to converted versions of the same table on SQL Server. While there are ways to copy Oracle to Oracle and SQL Server to SQL Server, copying data between the different types of servers is not supported by any standard tools. We also needed the ability to compare data across systems and perform custom modifications.

Initially, the tool was used to regularly copy the authoritative data from ACS1 to a development SQL Server. As applications migrate to the new database, it will be used in reverse to synchronize the ACS1 tables with changes made to the newly authoritative SQL Server.

Although Yale production applications are written in Java, this particular operation is better done with .NET and Powershell. Java is a strongly typed compiled language. SQL has less precise types, and using an interpretive language with dynamic type conversion allows a program where any conversion is between Oracle and SQL Server without an intermediate complied program forcing the data to be a specific binary value distinct from either database.

This Powershell script can copy one table to another without discovering the type of any data. It uses bulk transfer operations to create generic .NET Table objects containing .NET Row objects. The Oracle .NET database driver decides what .NET type is used to hold the values for each column. Then the SQL Server database driver is told to write this generic set of .NET objects to the table in its database. At that point it might convert the type of data, but it will be converting a type chosen by Oracle to a type of its own choice. Nothing in my script, or in Powershell has touched the data or influenced any of these decisions.

The script provides each database with basic parameters. There is a connection string with some userid authentication. The user has to have read access to the input database and write access to the output.

The script creates some generic objects. A DataAdapter, for example, is created from the database connection and a text string containing a SELECT statement to read data from the input table. Normally this is just a “SELECT * FROM tablename”, which is good enough unless there is some special reason to do more. DataAdapter is really an abstract class. The actual code that does everything comes from the database driver supplied by the vendor of the input database. The Oracle driver goes to the Oracle database to determine the number and types of the columns and then decides how it will read the data.

My script creates an empty DataSet object, which is really a Collection that will hold Table objects returned by the DataAdapter. The DataSet is then passed to the DataAdapter.Fill() method which reads the table into memory, creating a .NET Table object containing .NET Row objects.

$this.DataAdapter.Fill($this.dataSet)

In most cases it then simply passes the first Table in the DataSet to a BulkCopy object provided by the .NET database driver of the output database. BulkCopy.WriteToServer() is a single operation that inserts rows into the output table more efficiently than a sequence of individual INSERT statements. It writes out the entire table that was just read in. While BulkCopy can append data to an existing table, if you are copying entire tables this way it is simpler to empty the output table (with TRUNCATE) and then do the BulkCopy to replace the old table contents. BulkCopy disables triggers during the write, and delays enforcing some constraints.

If you need to avoid disrupting other programs, you can do this whole sequence under a Transaction.

Replacing the entire output table contents with a new copy works in all cases and requires no code. The script has, however, the ability to selectively update only changed rows. This is not typically faster than replacing the entire table, but the script can then report how many rows were changed, added, or deleted. The one restriction is that the tables must have a primary key that can be used in the compare.

If the two database drivers choose the same binary .NET types to use for all columns, the script can use a bulk binary comparison between the two Row objects. If not, then the cells have to be compared one at a time.

This is one case where you can see the benefit of an interpreted weakly typed language like Powershell. If Oracle and SQL Server choose different .NET data types, but the two types are convertable and comparable, then Powershell will dynamically figure out how to convert them to a common type to see if they are equal. It will, for example, compare any numeric type to any other numeric type.

This is where providing a more specific SELECT to build the DataAdapter may be useful. When the two databases use different character sets, it may be appropriate to tell one or both databases what common character set to convert varchar data toScript in Git: https://git.yale.edu/gilbert/DBScripts/blob/main/CopyTableRows.ps1

Microsoft has a bcp.exe program to copy a table from one SQL Server database to another, and if you link one database server to another you can copy both the definition and the contents of a table from one server to another using “SELECT INTO” or copy data from one table to another using “INSERT SELECT”.

However, things become more complicated if you are moving tables or data from one type of database to another. The requirement to migrate the Oracle ACS1 database to SQL Server produced a Powershell script that can be more generally used to synchronize information between databases and support Sandbox development and reporting.

Since our applications are written in Java, not use it? Powershell is based on the .NET framework, and while both Java and .NET have a similar plug in design for database drivers, .NET has full table operations that simply copying and migration.

The SQL for CREATE TABLE is much the same in all databases. The exact Types of columns may need adjustment because specific Type names mean different things in different systems. That is a manual process that you have to work out on your own. We assume you have generated an equivalent definition of each table in a target output system. The columns have the same name, and the types are similar enough that data can be automatically converted from one to the other. NUMERIC, DECIMAL, TINYINT, SMALLINT, INT, or BIGINT are different sizes of numbers, CHAR, VARCHAR, VARCHAR2, NVARCHAR are different character strings with different sizes and maybe character sets. Dates and timestamps will similarly convert even though different systems may use different starting dates and precisions.

There is no automatic way to handle data types like Blobs, Clobs, XML, etc. If you have columns of these types, you may need to write special conversion rules anyway.

While we want to copy the exact data between production systems, when we are moving data to a TEST or developer Sandbox, we may want to audit or transform the contents. For testing, we may want to assign a single common dummy SSN or Date of Birth to everyone, since the actual values are not needed for testing and we then reduce the sensitivity of the copied tables.

During initial development the script can be used to periodically refresh the important tables from a source database to another. However, as we start to migrate application, different components will move at different times. The tool can synchronize some Oracle tables to SQL Server (to test new code) and some SQL Server tables back to Oracle (to support unconverted legacy applications and reports after the production application has moved to SQL Server).

The script can also be used for ad hoc reporting. For example, an analyst can capture point int time snapshots of data, compare the before and after, and generate reports on the changes over time.

Read and Write entire Tables

.NET defines some abstract classes of objects that perform generic functions. Oracle and SQL Server provide drivers that implement these template classes.

Two .NET templates that Java JDBC does not have are the DataAdapter and BulkCopy.

You provide a text string containing a SELECT statement to a DataAdapter. It discovers the names and types of the columns from metadata supplied by the database, then reads the entire result into objects in memory. The .NET program can then work on the entire result as a Collection of objects.

The Row objects act like “Entities”. If you change data in a Row, or delete the Row, or add a new Row, the objects in memory track the changes you made. When you are done, you can tell the DataAdapter to alter the database table to match the changes you just made to the memory object, and the DataAdapter generates whatever INSERT, UPDATE, and DELETE operations are needed. (Of course, for this to work you either have to lock the result or be the only program making changes to it during the period).

Alternately, BulkCopy inserts new rows into a table, but it is optimized to transfer all the rows in a single operation rather than doing a sequence of INSERT statements each of which has to be sent and then the return has to be checked. BulkCopy can fill an empty table or append new rows after existing data. Triggers and Constraints are also disabled or optimized as you choose.

Since the objects in memory are the same no matter what database system you are using, .NET defines templates for a DataSet, which is a generic collection of DataTables, which are generic collections of DataRows which are a collection of Items.

Normally, however, the script does not have to not deal with anything below the level of the entire table.

An example of the normal use of the script to copy all data in an Oracle table to a SQL Server table (which may have an older copy of the data in it) is to

Connect to the two databases with userids and password. The Oracle userid has to be able to read the table and the SQL Server userid has to be able to write to the table.
Have Oracle create a DataAdapter for the input table based on a “SELECT *”. This discovers the definition of the input table in the Oracle database including its column names and types.
Have SQL Server create a BulkCopy for the output table.
Create an empty DataSet object in memory.
Call the DataAdapter.fill() method passing the empty DataSet. This reads the table into memory generating one DataTable with DataRows containing the contents of the table. Oracle decides what native binary data types to use to hold each of the values in each row.
Start a Transaction on the output database.
TRUNCATE the output table to delete all the old data.
Call BulkCopy.WriteToServer() passing the DataTable that Oracle generated. At this point the classes that implement the SQL Server BulkCopy object match the DataTable and DataRows created by Oracle to the column types in SQL Server and do any necessary type conversion, then send all the data out to the database to repopulate the table.
Commit the Transaction to swap the old and new version of the output table.
You can now close the connections or copy another table.

There are alternate Copy-DBTable functions or options of a function that combine steps 2 through 9. Any table can be emptied and repopulated.

If the tables have a unique key, then there is an option to read both the input and output tables into memory, compare the rows to see which ones are different, and then use an output DataAdapter instead of a BulkCopy to only generate INSERT, UPDATE, or DELETE for the changed rows. This code had to be written to see if it was better, but it turned out that TRUNCATE and rewrite is just as fast and a lot simpler.

If the default version of the data provided by the input database doesn’t work, the best option is to write a custom SELECT statement that transforms columns as they are read in. This avoids writing loops in Powershell. However, there are examples of using Powershell to examine values of selected columns of every row.

The normal case is just to let the database drivers do the work, and since they are the same drivers and this is the same .NET framework that every application would have to use, there is no performance loss using interpreted Powershell instead of a compiled language.

Use

Download the script from Git.

The input and output databases must have the same table name, with the same column names and compatible types.

Create a connection to the input and output databases. Then call the functions in the script to copy tables.

The following code truncates (empties) EXAMPLE.TABLE in the output database, then replaces its contents with the rows of the same table in the input database.

Code Block

. .\CopyTableRows.ps1

# Connect to an Oracle input database. For protection we separate the Userid (who should have read-only access) from the password
$inConnectionString = 'User Id={0};Password={1};Data Source=server.its.yale.edu/xxx' -f 'inuser', 'inpassword'
$inconn = New-Object Oracle.ManagedDataAccess.Client.OracleConnection($inConnectionString)
$inconn.Open()   # Get a backing connection from the Oracle pool manager to reactivate the OracleConnection object

# Connect to a SqlSrv output database using the current Windows login as the credential
$outconn = New-Object System.Data.SqlClient.SqlConnection('Server=localhost;Integrated Security=true;Database=xxx;')
$outconn.Open()

Copy-DbTable -InputTableName 'EXAMPLE.TABLE' -InConn $inconn -OutConn $outconn -truncate
# Repeat as needed

$inconn.Close()
$outconn.Close()

Considerations

Performance

Suppose you are copying a table every night and typically it only has a few dozen rows changed since the last copy. Is it better to compare the two tables and only make the necessary changes?

It was necessary to write the script and try both a full copy and an incremental update. It turns out that there is very little difference between the time it takes to perform each approach. It takes the same amount of time to read a whole table and write a whole table as it does to read the table twice from two sources.

Therefore, the choice may depend on special factors about the table. Are there constraints, indexes, triggers, or something else that makes it better to make the smallest number of changes possible? If not, then the full copy works as well as anything else.

Primary Key

In order to compare the input and output rows, the input and output tables must have Primary Keys. If you use the -truncate option and empty the output table and replace all the rows, then there is no key requirement.

Copy-History

Important “base” tables may have an associated History table. When any row is changed or deleted, the previous data in the row is copied to a new row at the end of the corresponding History table. This can happen automatically if there is a Trigger on the base table that is called with each UPDATE or DELETE.

The best practice is for rows in the History table to have an “identity” column, where a new sequential integer is generated each time a row is added. However, we have found that existing Oracle applications may not have any unique row identifier. It appears that in both Oracle and SQL Server new rows are appended to “the end of the table” and when reading rows, they are by default returned chronologically. This is regarded as bad practice in SQL textbooks.

If you use -truncate then you can treat history tables like any other table.

The Copy-History function provides experimental support for identifying the new records added onto the end of an Oracle history table and appending copies of just those records onto the end of a SQL Server history table. The copy works only in the one direction, because only Oracle has a SQL operation to skip over the old rows.

To copy a History table from SQL Server to Oracle, you must specify -truncate and copy the whole table from the beginning.

Earliest DATE

While both Oracle and SQL Server support DATE values going back millennia, the .NET SqlDateTime class has a minimum date of 1/1/1753. This tripped us up because a few Yale tables set unknown dates to the day that Yale was founded, Oct 5, 1701. That value could not be converted by Microsoft utilities and cannot be copied by this script. To get around this, a custom SELECT statement is created in the calling program and is passed with the optional parameter -SelectSQL. This SELECT statement contains expressions of the form:

GREATEST(CHANGE_EMAIL_ADDRESS_DATE,DATE '1776-07-04') AS CHANGE_EMAIL_ADDRESS_DATE

So that the earliest date allowed is July 4, 1776, but you could choose any other date after 1753 if you prefer.

Version	Old Version 2	New Version Current
Changes made by	Howard Gilbert	Howard Gilbert
Saved on	Jan 24, 2023	Feb 03, 2023