Migrating Data From Shamu to Arc

This guide provides several tools for migrating data from the legacy Shamu HPC environment to the new Arc HPC environment. In some cases, it may be easier and more efficient to bundle your data prior to moving to ARC. Those options are discussed below.


Shamu & Arc Server Access

For both Shamu and Arc, Secure Shell (SSH) is used to access each system. Users use their UTSA Active Directory (AD) account credentials to gain access.

Shamu Connection Information

Server Hostname: login.shamu.utsa.edu

SSH Port: 1209

Note: To successfully connect using SSH, the 1209 port must be specified. For example, to connect to Shamu using a linux host, issue the command:

$ ssh -p 1209 login.shamu.utsa.edu

Arc Connection Information

Server Hostname: arc.utsa.edu

SSH Port: 22

Port 22 is the default SSH port and does not have to be specified.

$ ssh arc.utsa.edu


Bundling Files into an Archive

Transferring large collections of files and directories between systems can be cumbersome. In some cases it can be easier and much more manageable to bundle them all together into a single archive file before moving them to a remote system. Also, the archive file can be compressed, which reduces the amount of data to transfer, and transferring a single archive file can be much more efficient than transferring large numbers of individual files, especially when they are small. The "tar" utility in Linux can be used to archive and compress collections of files and directories on Shamu.

Creating a Compressed Tar Archive

The "tar" command can bundle collections of files and directories and place them into a single archive file.

As an example, suppose you wanted to bundle "mydirectory" with all of it's files and subdirectories:

/home/user
       └── mydirectory
           ├── file1
           ├── file2
           ├── file3
           └── mysubdirectory
               ├── subdirectory-file1
               ├── subdirectory-file2
               └── subdirectory-file3

To create an archive of the directory structure and contents, use the tar command:
$ tar -czvf <archive filename> <files/directories to include>

Where:
  • c = Create the tar archive
  • z = Compress the tar archive
  • v = Verbose output, which prints the file names as they are added to the tar file
  • f = Name of the tar archive
Example:
$ ll
total 0
drwxrwxr-x. 3 user group  67 Aug 13 14:16 mydirectory
$ tar -czvf mytarfile.tgz mydirectory
mydirectory/
mydirectory/file1
mydirectory/file2
mydirectory/file3
mydirectory/mysubdirectory/
mydirectory/mysubdirectory/subdirectory-file1
mydirectory/mysubdirectory/subdirectory-file2
mydirectory/mysubdirectory/subdirectory-file3
$ ll
total 4
drwxrwxr-x. 3 user group  67 Aug 13 14:16 mydirectory
-rw-rw-r--. 1 user group 299 Aug 13 14:25 mytarfile.tgz
$

The file "mytarfile.tgz" is now a compressed archive of "mydirectory". That single file can now be transferred to a another system, such as Arc, and then all the files and directories contained in the archive can be extracted there.

Viewing the Contents of a Compressed Tar Archive

To see what's contained in a tar file, use this command:

$ tar -tzvf mytarfile.tgz
drwxrwxr-x user/group       0 2021-08-13 14:16 mydirectory/
-rw-rw-r-- user/group      13 2021-08-13 14:10 mydirectory/file1
-rw-rw-r-- user/group      13 2021-08-13 14:10 mydirectory/file2
-rw-rw-r-- user/group      13 2021-08-13 14:10 mydirectory/file3
drwxrwxr-x user/group       0 2021-08-13 14:17 mydirectory/mysubdirectory/
-rw-rw-r-- user/group      26 2021-08-13 14:17 mydirectory/mysubdirectory/subdirectory-file1
-rw-rw-r-- user/group      26 2021-08-13 14:17 mydirectory/mysubdirectory/subdirectory-file2
-rw-rw-r-- user/group      26 2021-08-13 14:17 mydirectory/mysubdirectory/subdirectory-file3
$

Where:
  • t = List the contents of the tar file
  • z = The tar archive is compressed
  • v = Verbose output, which prints the file names as they are added to the tar file
  • f = Name of the tar archive
This is useful for a couple of reasons:

First, it lets you inspect the contents of the archive without having to extract it.

Second, it shows how the files will be extracted into the current directory structure. Notice in the listing of the tar archive "mytarfile.tgz", all of the contents will be located within the directory "mydirectory". This directory will be created if it doesn't already exist. If it does already exist, the files will still be placed there and could overwrite any files with the same name as those in the archive.

If the tar file is just a list of files or if there are files at the top of the directory structure, it's good to know that they will be extracted and placed in your current directory.

For example, consider this other tar file with these contents:

$ tar -tzvf othertarfile.tgz
-rw-rw-r-- user/group      13 2021-08-13 14:10 file1
-rw-rw-r-- user/group      13 2021-08-13 14:10 file2
$

Extracting this tar file will place file1 and file2 in the directory where you run the tar extraction command.

Extracting a Tar Archive

Once a tar file has been transferred to a new system or location, it's contents can be extracted with the following command:

$ tar -xzvf mytarfile.tgz

Where:
  • x = Extract tar archive
  • z = The tar archive is compressed
  • v = Verbose output, which prints the file names as they are added to the tar file
  • f = Name of the tar archive
Note that the files will be extracted relative to your current directory where you issue the tar command.
$ pwd
/home/user
$ tar -zxvf /tmp/mytarfile.tgz
mydirectory/
mydirectory/file1
mydirectory/file2
mydirectory/file3
mydirectory/mysubdirectory/
mydirectory/mysubdirectory/subdirectory-file1
mydirectory/mysubdirectory/subdirectory-file2
mydirectory/mysubdirectory/subdirectory-file3
$ ll
total 0
drwxrwxr-x. 3 user group 67 Aug 13 14:16 mydirectory
$

See the man page for tar for additional information and details on the command.


Moving Files from Shamu to Arc

You can transfer files between any two Linux-based systems using either scp or rsync. Files can be moved directly from Shamu to Arc using the methods below.

Transfer Using RSYNC

The rsync utility helps in synchronizing files maintianed on source and destination systems. Unlike scp, rsync copies only the changed portions of individual files. Therefore, it is efficient to use rsync when you only need to update a small fraction of a large dataset at the destination location. The syntax of rsync is as follows:
[abc123@login01 abc123]$ rsync       myfile abc123@arc.utsa.edu:/work/abc123/.

This command copies the file "myfile" to the destination directory on Arc.

To copy a directory and all subdirectories to a remote location, use this command:
[abc123@login01 abc123]$ rsync -avtr mydirectory abc123@arc.utsa.edu:/work/abc123/.

Where:
  • a = Preserve symbolic links and other meta-data
  • v = Verbose output
  • r = Recursive copy, ie: copy this directory and all subdirectories
  • t = Preserve time stamps
The options on this rsync command are useful when synchronizing your data to a remote location. The first time you run this command, a copy of the file and directory structure is created at the remote location. After the initial copy, if some files then change on the source side and the command is run again, only those changes are copied to the remote location.

Also, if the rsync data transfer is interrupted for some reason, you can just re-run the command to copy over and sync those items that did not get previously copied or updated.

See the man page for rsync for additional information and details on the command.

A wrapper script for rsync is available on Arc that will copy a user's entire home and/or work directory from Shamu to the new Arc environment. The script is named migrate-shamu2arc and incorporates the most common parameters as discussed above, including an option not to overwrite any existing files on Arc. The syntax of the script is as follows:
[abc123@login01 abc123]$ migrate-shamu2arc [help] home|work|both

If you would like to use this script, you can download it here: migrate-shamu2arc. Save it to your home directory on Arc and adjust the file permissions to allow it be executed:

[abc123@login01 abc123]$ chmod u+x migrate-shamu2arc

The script is meant to be run from the Arc HPC environment.

Transfer Using SCP

The Linux scp (secure copy) utility is a component of the OpenSSH suite. Assuming your Arc username is abc123, a simple scp transfer that pushes a file named "filetest" from your /work/abc123 directory on Shamu to your similar work directory on ARC would look like this:
[abc123@login01 abc123]$ scp ./filetest abc123@arc.utsa.edu:/work/abc123/. 
Warning: Permanently added 'arc.utsa.edu,129.115.106.109' (ECDSA) to the list of known hosts. 
Password: 
filetest                                                                 100% 9 3.9KB/s 00:00 
[abc123@login01 abc123]$

You can use wildcards in the scp command as shown below:
[abc123@login01 abc123]$ scp *.txt abc123@arc.utsa.edu:/work/abc123/.

This will copy all files that end in ".txt" to the Arc environment.

When copying a directory with multiple files, use tar to create a compressed archive of the directory, then transfer the directory as a single file:
[abc123@login01 abc123]$ tar -czvf ./mydata.tar mydata # create archive
[abc123@login01 abc123]$ scp ./mydata.tar abc123@arc.utsa.edu:/work/abc123/. # transfer archive

See the man page for scp for additional information and details on the command.
Topic revision: r11 - 28 Oct 2024, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback