How do I use Data Sync

Objective

Understand the purpose of Data Sync and how you might use it on Spinup.

Introduction

Data Sync is based on a managed AWS service that can transfer data between different storage backends (currently we support S3 buckets and NFS shares in Spinup).

When you create a Data Sync in Spinup, you pick the source and destination targets. For example, you can sync from an S3 bucket to another S3 bucket, or NFS share to another NFS share, or S3 to NFS (and vice versa). You can also choose a specific subfolder if you don’t want to sync all the files.

Once created, the Data Sync has to be explicitly started (it doesn’t run automatically on creation) and will show some progress details. It will run unattended in AWS until all of the data is copied. You can run it multiple times after that if you want to keep the source and destination in sync, and it will only copy the differences.

Important Things To Remember With Data Sync

  • There is a charge based on the amount of data that is being transferred, but no cost for the Data Sync resource itself

  • The source and destination targets need to already be in Spinup and have to be in the same risk level

  • Data Sync has to be explicitly started


How do I use it?

  1. From Spinup create a new Data Sync resource in the same space where your destination is.

  1. Fill out the details in the form

    1. Give it a name (in this case I’ll be syncing a Covid-related dataset).

    2. Select the source where you have your data.
      The drop-down will show you all existing S3 buckets (or NFS volumes) that you have access to in Spinup. They can be in the same or different space (that’s shared with you by someone else, for example), but need to be in the same risk level as the current space (e.g. both low-risk, or high-risk).

      In my case I have a covid-19-data bucket in the Data_Repository space which has the data set I need, and new S3 bucket copy in the Research space where I want to copy it to. So I’ll pick the covid-19-data from the drop-down and leave the Source Subdirectory as / since I want to sync all the files.

    3. Scroll down and select your destination.
      In my case I want all the copied files to be placed into a /covid subfolder (otherwise you can just leave the /). Note that by default the objects will be created in the Standard S3 tier.

  2. Click Create and wait until the Data Sync resource is ready (this will not initiate the actual sync, we'll do that next)

Parameters

  • Overwrite Mode - always updates files in the destination that have changed in the source location

  • Preserve Deleted Files - will not delete files in the destination if they have been deleted from the source

These options are currently set by default and cannot be modified.

  1. Go to the Sync tab on the left and click the Start Sync button. This will immediately initiate a sync from the source to the destination.

Under the Active Sync section you can see details about the currently running sync operation. Keep in mind that it can sometimes say LAUNCHING for 10-20 minutes (even for small data sets) - this is the initial phase when AWS sets up the required servers in the backend to perform the sync. Note that you can also stop a running sync at any time by clicking the Stop Sync button.

After a while, you can notice that the sync has completed successfully (in my case I only have 449 MB, so it finished very quickly). If you run this Data Sync multiple times you can see details about each execution under “Sync History”:

 

A quick check in my copy S3 bucket confirms that all of the files from my source bucket have now been copied into the /covid folder.

  1. At this point you may want to delete this Data Sync resource if you only needed to do a one-time sync, or keep it around if you plan to do additional syncs.

There’s no cost for the actual Data Sync resource, only when you run it. Deleting the Data Sync does not impact any of the related data in the source or destination.