You are here: Foswiki>ARC Web>WebHome (23 Jan 2024, AdminUser)Edit Attach

Arc User-Guide

  1. Arc is the primary High Performance Computing (HPC) system at The University of Texas at San Antonio (UTSA) that can be used for running data-intensive, memory-intensive, and compute-intensive jobs from a wide range of disciplines. It is equipped with:
    • 174 total compute/GPU nodes and 2 login nodes, majority of these are Intel Cascade Lake CPUs and some are AMD EPYC CPUs
    • 30 GPU nodes - each containing two CPUs with 20 cores each for a total of 40 cores, 384GB RAM, and each including one V100 Nvidia GPU accelerator
    • 5 GPU nodes - each containing two CPUs with 20 cores each for a total of 40 cores, 384GB RAM, and each including two V100 Nvidia GPU accelerators
    • 2 GPU nodes - each containing two CPUs and 4 V100 GPUs, and 384 GB RAM
    • 2 GPU nodes - each having two AMD EPYC CPUs and having one A100 80 GB GPU, and 1 TB RAM
    • 2 large-memory nodes, each containing four CPUs with 20 cores each for a total of 80 cores, and each including 1.5TB of RAM
    • 1 large-memory node, equipped with two AMD EPYC CPUs and 2 TB of RAM
    • 1 node equipped with two AMD EPYC CPUs and having 1 TB of RAM
    • 5 nodes - each equipped with two AMD EPYC CPUs and 1 NEC vector engine and 1 TB of RAM
    • 100Gb/s Infiniband connectivity

    • Two Lustre filesytems: /home and /work, where /home has 110 TBs capacity and /work has 1.1 PB of capacity

    • A cumulative total of 250TB of local scratch (approximately 1.5 TB of /scratch space on most compute/GPU nodes)

    • Multiple partitions (or queues) having different characteristics and constraints
      • amdonly: 1 node
      • amdbigmem: 1 node
      • amdgpu: 2 nodes
      • amdvector: 5 nodes
      • bigmem: 2 nodes
      • compute1: 65 nodes
      • compute2: 25 nodes
      • computedev: 5 nodes
      • gpu1v100: 28 nodes
      • gpu2v100: 5 nodes
      • gpu4v100: 2 nodes
      • gpudev: 2 nodes
      • two privately owned partitions consisting of 24 nodes
      • one privately owned partition equipped with 3 DGX A100 80 GB GPUs
      • two privately owned partitions equipped with Dell XE8640 4x H100 GPUs.
    • Arc is accessible over SSH using two-factor authentication with DUO. Hostname for Arc is arc.utsa.edu and the SSH port number is 22. In order to utilize DUO, you must register online at passphrase.utsa.edu .

  2. Arc Fair-Use Policies
    • Running Jobs
      • Compute nodes are not shared among multiple users. Instead, when a user grabs a compute node, they will be the only user allowed to access it. This is being implemented for security reasons, as well as performance reasons. If multiple users are sharing the same node, performance can be negatively impacted due to resource contention. While we will no longer be scheduling jobs from different users on the same node, users are encouraged to take advantage of tools such as GNU parallel to co-schedule their multiple independent tasks on the compute nodes allocated to them. Please see Section 10 of the user-guide for further details on running multiple tasks concurrently on one or more nodes from a single Slurm job.
      • Each user will be limited to 10 active jobs at a given point in time and will be limited to running these jobs on a maximum of 20 compute nodes. As each compute node is dual-socket, and has a 20-core processors on each socket, a total of 800 cores could be potentially used by a job at a given point in time.
      • Each job will be limited to a run-time of no more than 72 hours. Users are encouraged to consider implementing checkpointing-restart capabilities in their home-grown applications. The research computing support group will be happy to provide guidance on implementing checkpointing-restart mechanism in the users' code. Some third-party software, like the FLASH astrophysics code, already have in-built capabilities to checkpoint-restart. Such capabilities can be enabled by setting the required environment variables. The users are encouraged to review the documentation of their software to confirm whether or not the checkpoint-restart functionality is available in the software of their choice. Section 16 of this user-guide has further information on using checkpointing and restart.
      • Exceptions : If you require access to nodes for a longer period of time, or need access to more nodes than what are allowed by default, please submit a service request ticket with an exemption request. We will need a brief description of the activity for your request, along with the number of cores and nodes required, and the time duration for which you are requesting the exemption. Also, we request to explore the options for checkpointing the code before submitting the ticket for service request at the following URL: https://support.utsa.edu/myportal
    • Data Storage (Disk Usage)
      • Work Directory (/work/abc123) – as detailed in our Wiki, this directory is where you should place any input/output files as well as logs for your running jobs. This directory is NOT backed up and is not intended for long-term storage.
      • Work Directory Data Retention – all files in the Work directory that have not been accessed in the last 30 days will be likely candidates for deletion.
      • Home Directory (/home/abc123) – this directory is backed up but should only be used for installing and compiling code. Storage of datasets is permitted here, but there will be a hard quota limit of 100GB in place.
      • Vault Directory (vault/research/abc123) - each user on Arc is provided 1TB of archival storage located in /vault. This storage space is accesible from Arc, as well as Windows or Mac computers. This data is backed up and the backups are replicated to UT Arlington for an extra layer of protection. If additional storage space is needed on the "vault" system, please submit a service request at the following URL: https://support.utsa.edu/myportal

  3. Requesting an Account on Arc
    • If you are interested in requesting an account on Arc, please visit the support portal and search for "HPC account"

      Please note that sharing of User Credentials is strictly prohibited. Any violation of this policy could lead to suspension of your account on Arc.


  4. Prerequisite: Arc has a Linux operating system and hence, basic knowledge of Linux is required for working efficiently on Arc in command-line mode.
    If you need help with learning Linux, the following link will provide a quick overview of Linux and basic Linux commands: Express Linux Tutorial

  5. Logging into Arc, Submitting Jobs, and Monitoring Jobs on Arc

  6. File transfer

  7. Modules for Managing User Environment on Arc

  8. Running C, C++, Fortran, Python, and R applications in Serial Mode
    • Both batch and interactive modes of running serial applications is covered
    • Code and scripts used in the examples shown in the document are available from this GitHub repository

  9. Running Parallel Programs
    • Code and scripts used in the examples shown in the document are available from this GitHub repository
    • OpenMP, MPI, and CUDA examples are covered in this document
    • C, C++ and Fortran are the base languages used

  10. Running Multiple Copies of Executables Concurrently from the Same Job
    • Running multiple executables concurrently from the same job is covered
    • Using GNU Parallel for running parameter-sweep applications is covered

  11. Accessing and Running Code on Vector Engines in Arc

  12. Additional Python and R Usage Information

  13. Using Some of the Popular Software Packages that are Installed System-Wide

  14. Using Containers (Singularity and Docker) on Arc

  15. Open On Demand Virtual Desktop

  16. Visualization Using Paraview on Arc

  17. Setting Java Environment for Applications with Java Dependencies

  18. Application Checkpointing and Restart on Arc

  19. Checking Currently Installed Software on Arc
    • To check the list of the currently available software packages on Arc, please use the "module spider" or "module avail" command from a compute node
    • Details on using the module commands for managing the shell environment on Arc are available here
    • By default, a module named XALT [1] is loaded into everyone's shell environment. XALT is a tool that allows the Arc HPC support staff to collect and understand job-level information about the libraries and executables that end-users access during their jobs. This assists us in tracking user executables and library usage on the cluster. If you experience an issue that may involve XALT, the module can be removed using the module unload command.
    • The list of software packages that are available on Arc as of August 23, 2021 can be found here

  20. Technical Support
    • For technical support, you can submit a support request for Arc at the following link: https://support.utsa.edu/myportal. Instructions for submitting support requests can be found here.
    • The Research Computing Support Group is available between 8:00 AM to 5:00 PM on all business days to assist with the service requests.
    • Our time-to-response on new tickets is 4 business hours, and the time-to-resolution varies depending upon the complexity of the issue.
      • Please open a new ticket for every new topic
      • Once a ticket is closed you are welcome to reopen it if the exact topic that was addressed in the ticket appears to be still unresolved
    • For after-hours emergency support, please contact Tech Cafe at 210-458-5555.

  21. Training and Workshops

References
  1. "User Environment Tracking and Problem Detection with XALT," K. Agrawal, M. R. Fahey, R. McLay, and D. James, In Proceedings of the First International Workshop on HPC User Support Tools, HUST '14, Nov. 2014. dx.doi.org/10.1109/HUST.2014.6.
I Attachment Action Size Date Who Comment
Deep Learning Model on CIFAR10 dataset using <a class="foswikiNewLink" href="/foswiki/bin/edit/ARC/PyTorch?topicparent=ARC.WebHome" rel="nofollow" title="Create this topic">PyTorch</a> on <a href="/foswiki/bin/view/ARC/GPU">GPU</a> nodes.pdfpdf Deep Learning Model on CIFAR10 dataset using PyTorch on GPU nodes.pdf manage 417 K 15 Aug 2021 - 19:36 AdminUser Pytorch on GPUs
Express_Linux_Tutorial-SizeOptimized.pdfpdf Express_Linux_Tutorial-SizeOptimized.pdf manage 653 K 19 Aug 2021 - 19:22 AdminUser Quick Linux Tutorial - Saved as a "Reduced Size" pdf to get below 10MB size limit
Installation and Working of Deep Learning Libraries (<a class="foswikiNewLink" href="/foswiki/bin/edit/ARC/TensorFlow?topicparent=ARC.WebHome" rel="nofollow" title="Create this topic">TensorFlow</a>) on Remote Linux Systems (Stampede2 and Arc).pdfpdf Installation and Working of Deep Learning Libraries (TensorFlow) on Remote Linux Systems (Stampede2 and Arc).pdf manage 135 K 15 Aug 2021 - 18:45 AdminUser Tensorflow
RUNNING MATLAB “Hello, World” Example on Remote Linux Systems (1).pdfpdf RUNNING MATLAB “Hello, World” Example on Remote Linux Systems (1).pdf manage 104 K 15 Aug 2021 - 18:07 AdminUser Sample MatLab Job
Running_Jobs_On_Arc.pdfpdf Running_Jobs_On_Arc.pdf manage 392 K 25 Oct 2022 - 22:29 AdminUser Running Jobs on Arc
migrate-shamu2arcEXT migrate-shamu2arc manage 3 K 26 Aug 2021 - 14:23 AdminUser Bash wrapper script for rsync to migrate user home and/or work data from Shamu to Arc
running_c_cpp_fortran_python_r.pdfpdf running_c_cpp_fortran_python_r.pdf manage 412 K 19 Aug 2021 - 21:32 AdminUser Running C, C++, fortran, Python, and R applications in serial mode
running_executables_and_gnu_parallel.pdfpdf running_executables_and_gnu_parallel.pdf manage 468 K 19 Aug 2021 - 21:34 AdminUser Executables and GNU Parallel
running_parallel_programs_on_Arc.pdfpdf running_parallel_programs_on_Arc.pdf manage 476 K 19 Aug 2021 - 21:32 AdminUser  
Topic revision: r66 - 23 Jan 2024, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback