Migrating Data From Arc to the Isilon archival storage
This guide provides several tools for migrating data from the Arc HPC environment to the Isilon archival storage. In some cases, it may be easier and more efficient to bundle your data prior to moving to the Isilon. Those options are discussed below.
Access to Arc and Isilon storage
Secure Shell (SSH) is used to access the Arc HPC environment. Users use their UTSA Active Directory (AD) account credentials to gain access.
Within Arc, the Isilon archival storage is available on the Login nodes via the
/vault/research directory.
Note that the Isilon storage is only availabe on the two Arc Login nodes. This provides a convenient location for users to transfer their files back and forth between Arc and the Isilon.
The compute nodes are not able to access the Isilon storage. The Isilon is intended for archival storage use only. It is not designed to support data IO associated with HPC jobs.
Server Hostname: arc.utsa.edu
SSH Port: 22
Port 22 is the default SSH port and does not have to be specified.
$ ssh arc.utsa.edu
Outside of the Arc HPC environment, the Isilon archival storage is available on the UTSA network or over a VPN connection at:
\\smb.utsarr.net\research.
So, for example, you can open
FileExplorer on your Windows device to access the Isilon storage. Similarly on a Mac, you can use Finder to access the Isilon.
Detailed information regarding the access and usage of the Isilon storage is available here:
KA Accessing Isilon_update_V5_2.pdf
Bundling Files into an Archive
Transferring large collections of files and directories between systems can be cumbersome. In some cases it can be easier and much more manageable to bundle them all together into a single archive file before moving them to a remote system. Also, the archive file can be compressed, which reduces the amount of data to transfer, and transferring a single archive file can be much more efficient than transferring large numbers of individual files, especially when they are small.
The "tar" utility in Linux can be used to archive and compress collections of files and directories on Arc.
Creating a Compressed Tar Archive
The "tar" command can bundle collections of files and directories and place them into a single archive file.
As an example, suppose you wanted to bundle "mydirectory" with all of it's files and subdirectories:
/home/user
└── mydirectory
├── file1
├── file2
├── file3
└── mysubdirectory
├── subdirectory-file1
├── subdirectory-file2
└── subdirectory-file3
To create an archive of the directory structure and contents, use the tar command:
$ tar -czvf <archive filename> <files/directories to include>
Where:
- c = Create the tar archive
- z = Compress the tar archive
- v = Verbose output, which prints the file names as they are added to the tar file
- f = Name of the tar archive
Example:
$ ll
total 0
drwxrwxr-x. 3 user group 67 Aug 13 14:16 mydirectory
$ tar -czvf mytarfile.tgz mydirectory
mydirectory/
mydirectory/file1
mydirectory/file2
mydirectory/file3
mydirectory/mysubdirectory/
mydirectory/mysubdirectory/subdirectory-file1
mydirectory/mysubdirectory/subdirectory-file2
mydirectory/mysubdirectory/subdirectory-file3
$ ll
total 4
drwxrwxr-x. 3 user group 67 Aug 13 14:16 mydirectory
-rw-rw-r--. 1 user group 299 Aug 13 14:25 mytarfile.tgz
$
The file "mytarfile.tgz" is now a compressed archive of "mydirectory". That single file can now be transferred to a another system or directory location, such as /vault/research, and then all the files and directories contained in the archive can be extracted there.
Viewing the Contents of a Compressed Tar Archive
To see what's contained in a tar file, use this command:
$ tar -tzvf mytarfile.tgz
drwxrwxr-x user/group 0 2021-08-13 14:16 mydirectory/
-rw-rw-r-- user/group 13 2021-08-13 14:10 mydirectory/file1
-rw-rw-r-- user/group 13 2021-08-13 14:10 mydirectory/file2
-rw-rw-r-- user/group 13 2021-08-13 14:10 mydirectory/file3
drwxrwxr-x user/group 0 2021-08-13 14:17 mydirectory/mysubdirectory/
-rw-rw-r-- user/group 26 2021-08-13 14:17 mydirectory/mysubdirectory/subdirectory-file1
-rw-rw-r-- user/group 26 2021-08-13 14:17 mydirectory/mysubdirectory/subdirectory-file2
-rw-rw-r-- user/group 26 2021-08-13 14:17 mydirectory/mysubdirectory/subdirectory-file3
$
Where:
- t = List the contents of the tar file
- z = The tar archive is compressed
- v = Verbose output, which prints the file names as they are added to the tar file
- f = Name of the tar archive
This is useful for a couple of reasons:
First, it lets you inspect the contents of the archive without having to extract it.
Second, it shows how the files will be extracted into the current directory structure. Notice in the listing of the tar archive "mytarfile.tgz", all of the contents will be located within the directory "mydirectory". This directory will be created if it doesn't already exist. If it does already exist, the files will still be placed there and could overwrite any files with the same name as those in the archive.
If the tar file is just a list of files or if there are files at the top of the directory structure, it's good to know that they will be extracted and placed in your current directory.
For example, consider this other tar file with these contents:
$ tar -tzvf othertarfile.tgz
-rw-rw-r-- user/group 13 2021-08-13 14:10 file1
-rw-rw-r-- user/group 13 2021-08-13 14:10 file2
$
Extracting this tar file will place file1 and file2 in the directory where you run the tar extraction command.
Once a tar file has been transferred to a new system or location, it's contents can be extracted with the following command:
$ tar -xzvf mytarfile.tgz
Where:
- x = Extract tar archive
- z = The tar archive is compressed
- v = Verbose output, which prints the file names as they are added to the tar file
- f = Name of the tar archive
Note that the files will be extracted relative to your current directory where you issue the tar command.
$ pwd
/home/user
$ tar -zxvf /tmp/mytarfile.tgz
mydirectory/
mydirectory/file1
mydirectory/file2
mydirectory/file3
mydirectory/mysubdirectory/
mydirectory/mysubdirectory/subdirectory-file1
mydirectory/mysubdirectory/subdirectory-file2
mydirectory/mysubdirectory/subdirectory-file3
$ ll
total 0
drwxrwxr-x. 3 user group 67 Aug 13 14:16 mydirectory
$
See the man page for tar for additional information and details on the command.
Moving Files from Arc to Isilon
You can transfer files between any locations in a Linux system or, in fact, between any two Linux-based systems, using either rsync or cp. Files can be moved directly from Arc to the Isilon archival storage using the methods below.
Transfer Using RSYNC
The rsync utility helps in synchronizing files and directories between a source and destination path. Unlike cp, rsync copies only the changed portions of individual files. Therefore, it is efficient to use rsync when you only need to update a small fraction of a large dataset at the destination location. The syntax of rsync is as follows:
[login001: abc123]$ rsync /work/abc123/myfile /vault/research/COX/abc123/.
This command copies the file "myfile" to the destination directory on Arc.
To copy a directory and all subdirectories to a remote location, use this command:
[login001: abc123]$ rsync -avtr /work/abc123/mydirectory /vault/research/COX/abc123
Where:
- a = Preserve symbolic links and other meta-data
- v = Verbose output
- r = Recursive copy, ie: copy this directory and all subdirectories
- t = Preserve time stamps
The options on this rsync command are useful when synchronizing your data to a remote location. The first time you run this command, a copy of the file and directory structure is created at the remote location. After the initial copy, if some files then change on the source side and the command is run again, only those changes are copied to the remote location.
You can use a trailing " / " when specifying the source directory with the rsync command to determine whether the rsync command copies the
contents of the specified directory or the directory itself.
For example, in the above command, the "mydirectory" directory and it's contents will be located here after the copy completes: /vault/research/COS/abc123/mydirectory
/vault/research/COX/abc123
└── mydirectory
├── myfile1
├── myfile2
└── ...
However, if you place a trailing " / " at the end of the source directory, as shown here, the contents of "mydirectory" will be copied to the destination.
[login001: abc123]$ rsync -avtr /work/abc123/mydirectory<b>/</b> /vault/research/COX/abc123
The result of the copy will be:
/vault/research/COX/abc123
├── myfile1
├── myfile2
└── ...
Regarding interruptions in a rsync copy, if the data transfer is interrupted for some reason, you can just re-run the rsync command to again and sync those items that did not get previously copied or updated the previous time.
See the man page for rsync for additional information and details on the command.
Transfer Using CP
The Linux cp (copy) command is the built in copy utility for the operating system. A simple cp transfer that copies a file named "filetest" from your /work/abc123 /vault/research/COX/abc123 directory on Shamu to your /vault/research directory on ARC would look like this:
[login001: abc123]$ cp ./filetest /vault/research/COX/abc123
You can use wildcards with the cp command as shown below:
[login001: abc123]$ cp *.txt /vault/research/COX/abc123
This will copy all files that end in ".txt" to the Arc environment.
When copying a directory with multiple files, use tar to create a compressed archive of the directory, then transfer the directory as a single file:
[login001: abc123]$ tar -czvf ./mydata.tar mydata # create archive
[login001: abc123]$ cp ./mydata.tar /vault/research/COX/abc123 # transfer archive
See the man page for cp for additional information and details on the command.