Content Comparison

Script in Git: https://git.yale.edu/gilbert/DBScripts/blob/main/CopyTableRows.ps1

As part of the migration from Oracle ACS1 to SQL Server, it was necessary to build a tool that could quickly copy the content of ACS1 tables to converted versions of the same table on SQL Server. The Oracle and SQL Server tables have the same names, column names, and the same or at least a compatible type. There are ways to copy Oracle to Oracle and SQL Server to SQL Server, but you have to write a program to copy tables across different databases. While Talend does data copying in production, a developer needs something they can run on demand. The same program can be run in a way to compare the two tables and report differences.

While ACS1 is in production, the tool is used to copy the real data to a development SQL Server. As we migrate some functions to SQL Server, specific groups of tables will be copied back from the now authoritative SQL Server to Oracle to support anyone with an old Oracle based reporting tool. Eventually the Oracle database will go away.

Because Yale applications are written in Java, it is natural to consider writing the function in that language. Although JDBC is portable and supports both types of databases, Java is still a strongly typed language which may add a third point of conversion (Oracle to Java to SQL Server rather than just Oracle to SQL Server).

Powershell is an interpreted language built on top of the .NET framework. However, in normal processing Powershell sets up but does not even participate in the data copy. It provides each database with a connection string and authenticates with a userid having appropriate privileges to read and write data.Microsoft has a bcp.exe program to copy a table from one SQL Server database to another, and there are SQL statements to SELECT INTO a new table or INSERT from a SELECT that will bulk copy data and even replicate table column definitions. However, the documentation makes it clear that if you want to move data from another type of database (like Oracle) you need some custom programming.

Java has its JDBC standard into which you plug drivers for various systems. .NET has essentially the same idea. However, while Java works on a row by row basis, .NET has some full table read and write operations that speed things up and transfer responsibility from the programmer to the drivers themselves. The simplest language in which to configure a table by table copy is Powershell.

While some Powershell programming was involved in writing the script, and may be needed to customize it if you require more than the basic operation, in most cases you just have to load the script and then write one copy command per table.

The input and output tables are assumed to be roughly identical. They would have the same column names and types that allow the data to be automatically converted. Specifically, the database systems and their client drivers will automatically convert any type of number to any other type of number big enough to hold the largest data value. Columns with types like NUMERIC, DECIMAL, TINYINT, SMALLINT, INT, or BIGINT are compatible with each other, and different types of character strings (CHAR, VARCHAR, VARCHAR2, NVARCHAR) can be converted automatically although you may want to explicitly TRIM or declare character sets in the process. Dates and timestamps will similarly convert even though different systems may use different starting dates and precisions.

In most cases the script can simply do a SELECT * FROM the input table and write the data to the output table. If you have special requirements, you can provide a custom SELECT statement. For example, in the special case where you are copying data from a production database to a test system (especially a developer’s Sandbox) you may want to avoid copying the real values of any sensitive information. You can then specify a SELECT that gives everyone in the output table the same Social Security Number or Date of Birth if the real values are not needed for testing.

The tool was built to migrate an Oracle database to SQL Server. Initially the Oracle data is copied to a development Sandbox. Later it is copied to pre-production. As the system rolls out, different tables may migrate at different times. Some authoritative Oracle tables will be copied to shadow SQL Servers table to support testing, while other already migrated SQL Server tables will be copied back to Oracle to support legacy applications or reports that have not yet been converted.

The script can also be used to capture a whole database at points in time, then compare two snapshots to report what changed over time.

The base version of the script has no special support for data types like Blobs, Clobs, XML, etc. If you have columns of these types, you may need to write special conversion rules anyway.

Read and Write entire Tables

.NET defines some abstract classes of objects that perform generic functions. Oracle and SQL Server provide drivers that implement these template classes. When the script configures Oracle as the input connection and asks for a DataAdapter object, Oracle complied code provides it. Similarly, the SQL Server output connection provides a SQL Server BulkCopy object.

The only parameter to the DataAdapter is a SELECT statement to read data from a table. In most cases it is just “SELECT * FROM tablename”. The DataAdapter communicates with the database, reads metadata about the named table, and discovers the column names and types for you.

The script has to create one generic .NET object, and empty DataSet object. This is an empty collection that can hold a .NET DataTable object.

The script can then pass the empty .NET DataSet to the Oracle DataAdapter.Fill(DataSet) method. Oracle then reads the entire table into memory, creating one DataTable object with one DataRow object for each row in the table.

The table was read by Oracle and stored in objects of whatever type Oracle chose based on the definition of the table and its columns in its database. The Powershell script does not need to know how many columns or rows there are, nor the names or type of anything. The most it has to do is select the first DataTable in the DataSet collection and then pass it to the BulkCopy.WriteToServer(DataTable) method.

With this call, the SQL Server BulkCopy object looks at the DataTable and DataRows that Oracle created. It also can read the metadata from the SQL Server database to discover how the output table is defined. It develops a strategy for using the objects in memory to populate the output table.

As long as the Oracle and SQL Server tables have the same number of columns, with the same names and compatible types, the data can be copied as is or with whatever transformations the database driver regards as necessary. For example, an Oracle NUMERIC or DECIMAL column can be converted to a SQL Server INT or BIGINT column, but that is up to the databases and does not involve the script programming.

All the real work is done by compiled code provided by the database vendors. Essentially the script simply introduces Oracle to SQL Server, tells them to copy the data, and then they make all the decisions until the operation has completed.

BulkCopy.WriteToServer by default temporarily disables triggers and optimizes the evaluation of constrains. As the name suggests, it moves the data in bulk rather than doing one INSERT per row.

BulkCopy can append data to a table that is already filled, but when copying an entire table, the script will typically TRUNCATE the output table before refilling it with BulkCopy. If other programs may be accessing the database at the same time, the entire operation can be performed in a Transaction.

In the normal case, the script is trivial. Additional code has been added, but it is optional and used only to meet special needs. For example, the script can be run

Two .NET templates are the DataAdapter and BulkCopy.

A DataAdapter will execute a SELECT and read the entire result into memory. A program can then change specific values in the collection of Row objects in memory. The objects track what has been changed. A single operation tells the DataAdapter to generate whatever INSERT, UPDATE, and DELETE operations are needed to transfer the changes previously made in memory to the database table.

If you give the DataAdapter a “SELECT *” statement, it will read metadata from the database and discover the names and types of each column. It then creates in memory native types for the client system that can hold any data that might be stored in that type of database column.

BulkCopy inserts new rows into a table, but it is optimized to transfer the data in a single operation rather than doing a sequence of INSERT statements each of which has to be sent and then the return has to be checked. It can add new data, or if the table has been emptied it can load the table with entirely new data. Triggers and Constraints are also disabled or optimized.

.NET also defines templates for a DataSet, which is a generic collection of DataTables, which are generic collections of DataRows. However, while the DataAdapter and BulkCopy are implemented by the database drivers from the vendors (Oracle or SQL Server for example), the DataSet, DataTable, and DataRow are .NET classes independent of and generic to all databases.

An example of the normal use of the script to copy all data in an Oracle table to a SQL Server table (which may have an older copy of the data in it) is to

Connect to the two databases with userids and password. The Oracle userid has to be able to read the table and the SQL Server userid has to be able to write to the table.
Have Oracle create a DataAdapter for the input table based on a “SELECT *”. This discovers the definition of the input table in the Oracle database including its column names and types.
Have SQL Server create a BulkCopy for the output table.
Create an empy DataSet object in memory.
Call the DataAdapter.fill() method passing the empty DataSet. This reads the table into memory generating one DataTable with DataRows containing the contents of the table. Oracle decides what native binary data types to use to hold each of the columns.
Start a Transaction on the output database.
TRUNCATE the output table to delete all the old data.
Call BulkCopy.WriteToServer() passing the DataTable that Oracle generated. SQL Server converts whatever data Oracle put in memory to whatever type it needs to send the data to the database and repopulate its contents.
Commit the Transaction to swap the old and new version of the output table.
You can now close the connections or copy another table.

There are alternate Copy-Table functions that combine steps 2 through 9. If the tables have a unique key, then there is an option to read both the input and output tables into memory, perform a generic object compare by primary key of the corresponding rows in each table, and then either report the differences or make only the individual changes needed to synchronize the output with the input. For this to work, the data types of the two tables have to be identical so the values of the data in each row can be compared at the binary level.

Alternately, the script can be run to compare the data in each cell of each table. Running nested loops in an interpreted language like Powershell uses a lot of time, but this is just CPU on the desktop computer which costs nothing and can be run in the background. The advantage here is that Powershell as an interpreted language with weak, dynamic typing will convert compatible types as needed to determine if the values are equivalent even if they are not identical at the binary level.

The simplest and fastest thing is to TRUNCATE the output and copy all the data.

If you also want a report, you can compare and only update changed rows, and then the script will tell you the number of rows modified, added, or deleted since the last copy. If you just want the report, turn on the Read-Only flag and it will not change the “output” table.

The script can be modified to add custom conversions if the default behavior is inadequate. However, before writing a lot of code, you can write a custom SELECT that creates the input DataAdater, and in the SQL you can CAST or CONVERT the data in certain columns to a more appropriate type. This is simpler and runs much faster than Powershell loopsrows to see which ones are different, and only make the specific changes needed. However, it turns out that copying all the data is usually as fast or faster than comparing the tables. There is also a special function to append new data on the end of a table that doesn’t have a primary key.

If you need some special transformation of the data, the best option is to create a string containing a specific SELECT statement using SQL functions to transform the data (SELECT … NULL AS SOCIAL_SECURITY_NUMBER …) and pass it to the Powershell command. If you need to do a more elaborate transformation, then you have to write some Powershell loop. Examples are provided in the code.

However, in the normal case the Powershell script does not need any loops and does not look at individual rows let alone individual cells. The copy is done by complied code provided by Oracle and SQL Server. They choose the .NET data types and do any conversion needed. The script performs so little work itself that the copy runs as fast in Powershell as it would in any complied language.

Use

Download the script from Git.

...

Considerations

Performance

...

Suppose you are copying a table every night and typically it only has a few dozen rows changed since the last copy. Is it better to compare the two tables and only make the necessary changes?
It was necessary to write the script and try both a full copy and an incremental update. It turns out that there is very little difference between the time it takes to perform each approach. It takes the same amount of time to read a whole table and write a whole table as it does to read the table twice from two sources.
Therefore, the choice may depend on special factors about the table. Are there constraints, indexes, triggers, or something else that makes it better to make the smallest number of changes possible? If not, then the full copy works as well as anything else.

Primary Key

In order to compare the input and output rows, the input and output tables must have Primary Keys. If you use the -truncate option and empty the output table and replace all the rows, then there is no key requirement.

...

Version	Old Version 5	New Version 6
Changes made by	Howard Gilbert	Howard Gilbert
Saved on	Jan 25, 2023	Feb 02, 2023

Content Comparison

Versions Compared

Key

Read and Write entire Tables

Use

Considerations

Performance

Primary Key