Migrating Data From Shamu to Arc
This guide provides several tools for migrating data from the legacy Shamu HPC environment to the new Arc HPC environment. In some cases, it may be easier and more efficient to bundle your data prior to moving to ARC. Those options are discussed below.
Shamu & Arc Server Access
For both Shamu and Arc, Secure Shell (SSH) is used to access each system. Users use their UTSA Active Directory (AD) account credentials to gain access.
Server Hostname: login.shamu.utsa.edu
SSH Port: 1209
Note: To successfully connect using SSH, the 1209 port must be specified. For example, to connect to Shamu using a linux host, issue the command:
$ ssh -p 1209 login.shamu.utsa.edu
Server Hostname: arc.utsa.edu
SSH Port: 22
Port 22 is the default SSH port and does not have to be specified.
$ ssh arc.utsa.edu
Bundling Files into an Archive
Transferring large collections of files and directories between systems can be cumbersome. In some cases it can be easier and much more manageable to bundle them all together into a single archive file before moving them to a remote system. Also, the archive file can be compressed, which reduces the amount of data to transfer, and transferring a single archive file can be much more efficient than transferring large numbers of individual files, especially when they are small.
The "tar" utility in Linux can be used to archive and compress collections of files and directories on Shamu.
Creating a Compressed Tar Archive
The "tar" command can bundle collections of files and directories and place them into a single archive file.
As an example, suppose you wanted to bundle "mydirectory" with all of it's files and subdirectories:
/home/user
└── mydirectory
├── file1
├── file2
├── file3
└── mysubdirectory
├── subdirectory-file1
├── subdirectory-file2
└── subdirectory-file3
To create an archive of the directory structure and contents, use the tar command:
$ tar -czvf <archive filename> <files/directories to include>
Where:
- c = Create the tar archive
- z = Compress the tar archive
- v = Verbose output, which prints the file names as they are added to the tar file
- f = Name of the tar archive
Example:
$ ll
total 0
drwxrwxr-x. 3 user group 67 Aug 13 14:16 mydirectory
$ tar -czvf mytarfile.tgz mydirectory
mydirectory/
mydirectory/file1
mydirectory/file2
mydirectory/file3
mydirectory/mysubdirectory/
mydirectory/mysubdirectory/subdirectory-file1
mydirectory/mysubdirectory/subdirectory-file2
mydirectory/mysubdirectory/subdirectory-file3
$ ll
total 4
drwxrwxr-x. 3 user group 67 Aug 13 14:16 mydirectory
-rw-rw-r--. 1 user group 299 Aug 13 14:25 mytarfile.tgz
$
The file "mytarfile.tgz" is now a compressed archive of "mydirectory". That single file can now be transferred to a another system, such as Arc, and then all the files and directories contained in the archive can be extracted there.
Viewing the Contents of a Compressed Tar Archive
To see what's contained in a tar file, use this command:
$ tar -tzvf mytarfile.tgz
drwxrwxr-x user/group 0 2021-08-13 14:16 mydirectory/
-rw-rw-r-- user/group 13 2021-08-13 14:10 mydirectory/file1
-rw-rw-r-- user/group 13 2021-08-13 14:10 mydirectory/file2
-rw-rw-r-- user/group 13 2021-08-13 14:10 mydirectory/file3
drwxrwxr-x user/group 0 2021-08-13 14:17 mydirectory/mysubdirectory/
-rw-rw-r-- user/group 26 2021-08-13 14:17 mydirectory/mysubdirectory/subdirectory-file1
-rw-rw-r-- user/group 26 2021-08-13 14:17 mydirectory/mysubdirectory/subdirectory-file2
-rw-rw-r-- user/group 26 2021-08-13 14:17 mydirectory/mysubdirectory/subdirectory-file3
$
Where:
- t = List the contents of the tar file
- z = The tar archive is compressed
- v = Verbose output, which prints the file names as they are added to the tar file
- f = Name of the tar archive
This is useful for a couple of reasons:
First, it lets you inspect the contents of the archive without having to extract it.
Second, it shows how the files will be extracted into the current directory structure. Notice in the listing of the tar archive "mytarfile.tgz", all of the contents will be located within the directory "mydirectory". This directory will be created if it doesn't already exist. If it does already exist, the files will still be placed there and could overwrite any files with the same name as those in the archive.
If the tar file is just a list of files or if there are files at the top of the directory structure, it's good to know that they will be extracted and placed in your current directory.
For example, consider this other tar file with these contents:
$ tar -tzvf othertarfile.tgz
-rw-rw-r-- user/group 13 2021-08-13 14:10 file1
-rw-rw-r-- user/group 13 2021-08-13 14:10 file2
$
Extracting this tar file will place file1 and file2 in the directory where you run the tar extraction command.
Once a tar file has been transferred to a new system or location, it's contents can be extracted with the following command:
$ tar -xzvf mytarfile.tgz
Where:
- x = Extract tar archive
- z = The tar archive is compressed
- v = Verbose output, which prints the file names as they are added to the tar file
- f = Name of the tar archive
Note that the files will be extracted relative to your current directory where you issue the tar command.
$ pwd
/home/user
$ tar -zxvf /tmp/mytarfile.tgz
mydirectory/
mydirectory/file1
mydirectory/file2
mydirectory/file3
mydirectory/mysubdirectory/
mydirectory/mysubdirectory/subdirectory-file1
mydirectory/mysubdirectory/subdirectory-file2
mydirectory/mysubdirectory/subdirectory-file3
$ ll
total 0
drwxrwxr-x. 3 user group 67 Aug 13 14:16 mydirectory
$
See the man page for tar for additional information and details on the command.
Moving Files from Shamu to Arc
You can transfer files between any two Linux-based systems using either scp or rsync. Files can be moved directly from Shamu to Arc using the methods below.
Transfer Using RSYNC
The rsync utility helps in synchronizing files maintianed on source and destination systems. Unlike scp, rsync copies only the changed portions of individual files. Therefore, it is efficient to use rsync when you only need to update a small fraction of a large dataset at the destination location. The syntax of rsync is as follows:
[abc123@login01 abc123]$ rsync myfile abc123@arc.utsa.edu:/work/abc123/.
This command copies the file "myfile" to the destination directory on Arc.
To copy a directory and all subdirectories to a remote location, use this command:
[abc123@login01 abc123]$ rsync -avtr mydirectory abc123@arc.utsa.edu:/work/abc123/.
Where:
- a = Preserve symbolic links and other meta-data
- v = Verbose output
- r = Recursive copy, ie: copy this directory and all subdirectories
- t = Preserve time stamps
The options on this rsync command are useful when synchronizing your data to a remote location. The first time you run this command, a copy of the file and directory structure is created at the remote location. After the initial copy, if some files then change on the source side and the command is run again, only those changes are copied to the remote location.
Also, if the rsync data transfer is interrupted for some reason, you can just re-run the command to copy over and sync those items that did not get previously copied or updated.
See the man page for rsync for additional information and details on the command.
A wrapper script for rsync is available on Arc that will copy a user's entire home and/or work directory from Shamu to the new Arc environment. The script is named
migrate-shamu2arc and incorporates the most common parameters as discussed above, including an option not to overwrite any existing files on Arc. The syntax of the script is as follows:
[abc123@login01 abc123]$ migrate-shamu2arc [help] home|work|both
If you would like to use this script, you can download it here:
migrate-shamu2arc. Save it to your home directory on Arc and adjust the file permissions to allow it be executed:
[abc123@login01 abc123]$ chmod u+x migrate-shamu2arc
The script is meant to be run from the Arc HPC environment.
Transfer Using SCP
The Linux scp (secure copy) utility is a component of the
OpenSSH suite. Assuming your Arc username is abc123, a simple scp transfer that pushes a file named "filetest" from your /work/abc123 directory on Shamu to your similar work directory on ARC would look like this:
[abc123@login01 abc123]$ scp ./filetest abc123@arc.utsa.edu:/work/abc123/.
Warning: Permanently added 'arc.utsa.edu,129.115.106.109' (ECDSA) to the list of known hosts.
Password:
filetest 100% 9 3.9KB/s 00:00
[abc123@login01 abc123]$
You can use wildcards in the scp command as shown below:
[abc123@login01 abc123]$ scp *.txt abc123@arc.utsa.edu:/work/abc123/.
This will copy all files that end in ".txt" to the Arc environment.
When copying a directory with multiple files, use tar to create a compressed archive of the directory, then transfer the directory as a single file:
[abc123@login01 abc123]$ tar -czvf ./mydata.tar mydata # create archive
[abc123@login01 abc123]$ scp ./mydata.tar abc123@arc.utsa.edu:/work/abc123/. # transfer archive
See the man page for scp for additional information and details on the command.