Using NEC SX-Aurora Vector Engines On Arc
This guide provides an overview of the NEC SX-Aurora TSUBASA Vector Engines (VE) nodes available on the Arc HPC Cluster.
ARC Vector Engine Components
Vector Engine Compiler Node (vc001)
The ARC HPC environment includes 1 x VE Compiler node. This node is licensed to enable users to compile code for the Vector Engine nodes.
After compiling your VE code, you can run your compiled programs any of the 5 x Vector Engine nodes.
- nld (Linker)
- nar (Archiver)
- nranlib (Index generator for archives)
Vector Engine Compute Nodes (v001 - v005)
The Arc HPC environment includes 5 x AMD Compute Nodes each with a NEC Vector Engine Card, providing increased memory bandwidth and computational ability with increased power efficiency.
VE Compute Node specifications:
- Server node Cores: 2 x Physical AMD CPUs, each with 8 cores and hyperthreading, providing a total of 32 Cores
- Server node RAM: 1TB
- Vector Engine Card RAM: 48GB
- Vector Engine Card Cores: 8
- Vector Engine Card Memory Bandwidth: 1.53TB/s
The Vector Engine nodes are named: v001 - v005
Vector Engine Slurm Partitions
The Arc HPC cluster contains over a dozen Slurm partitions, each representing a unique set of resources to help meet the scientific computing needs of our users. There are two partitions that have been setup specifically for use with the Vector Engine nodes:
- amdvcompiler - Partition containing node vc001 that is used to compile VE code
Access to this partition is limited to 2 hours. Since this is the only node available for performing compilations, the time limit is shortened to allow for fair use and access to other users.
- amdvector - Partition containing nodes v001 - v005 that are used to run VE code
The jobs that run in this partition have a 3 day time limit and can be run on one of the VE nodes.
Accessing Vector Engine Nodes
The AMD Vector nodes are accessed like any other node in the cluster.
Interactive shell with SRUN
You can use the following commands to get an interactive shell on one of the nodes.
This command will provide a bash shell on one of the Vector Engine nodes:
[login001: abc123]$ srun -p amdvector -n 1 -t 01:30:00 --pty bash
This command will provide a bash shell on the Vector Compiler node:
[login001: abc123]$ srun -p amdvcompiler -n 1 -t 01:30:00 --pty bash
Submitting Batch jobs with SBATCH
You can also submit a batch job:
[login001: abc123]$ sbatch my_ve_jobscript.slurm
Loading the Vector Linux Module
After accessing a node in an interactive shell, you can load the NEC Vector module to update your PATH and other environment variables. This enables you to invoke the compiler and various other tools without having to specify the full path. This command can also be included in your SBATCH script to ensure the proper environment variables are available for your job.
When the vector module is loaded, two shell scripts are sourced as part of the process: nlcvars.sh & necmpivars.sh Those scripts provide additional environment variable definitions and can be called with additional parameters, if needed. The module sources the scripts without any parameters. A brief explanation of the additional parameters is included in the output when loading the vector module.
[vc001: abc123]$ module load vector/2.8-1
The vector module version 2.8-1 is loaded.
--- Sourcing: /opt/nec/ve/nlc/2.3.0/bin/nlcvars.sh
/opt/nec/ve/nlc/2.3.0/bin/nlcvars.sh can be called with alternate parameters:
Usage: source nlcvars.sh [ARGUMENT]...
i64 specify the default integer type is 64-bit
mpi specify MPI is used
(default: no use of MPI)
--- Sourcing: /opt/nec/ve/mpi/2.21.0/bin64/necmpivars.sh
/opt/nec/ve/mpi/2.21.0/bin64/necmpivars.sh can be called with alternate parameters:
necmpivars.sh can take additional parameters, however the
"necmpivars.sh [gnu|intel] [version]" format should only be
used at runtime in order to use VH MPI shared libraries other
than those specified by RUNPATH embedded in a MPI program
executable by the MPI compile command. In other cases,
should be used without arguments.
The "version" parameter is a directory name in the following directory:
/opt/nec/ve/mpi/2.21.0/lib64/vh/gnu (if gnu is specified)
/opt/nec/ve/mpi/2.21.0/lib64/vh/intel (if intel is specified)
Vector Engine Coding
Compiling Vector Engine Code
Due to licensing restrictions, Vector Engine code can only be compiled on the Vector Engine Compiler node, vc001.
There are several compilers available on vc001, including ncc for C, nfort for Fortran, and nc++ for C++. For each compiler, there are numerous options and suboptions available. A brief explanation of the compiler options are provided by invoking the command with the --help option.
[login001: abc123]$ srun --pty -t 02:00:00 -n 1 -p amdvcompiler bash
[vc001: abc123]$ module load vector
[vc001: abc123]$ ncc --help
For example, this matrix multiplication C program, mmultest.c
, is compiled with several options for vector optimization:</pre>
[vc001: abc123]$ ncc mmultest.c -O4 -report-all -fdiag-vector=2 -o mmultest
ncc: opt(1592): mmultest.c, line 11: Outer loop unrolled inside inner loop.: j
ncc: vec( 101): mmultest.c, line 14: Vectorized loop.
ncc: vec( 126): mmultest.c, line 15: Idiom detected.: Sum
ncc: vec( 128): mmultest.c, line 15: Fused multiply-add operation applied.
ncc: opt(1592): mmultest.c, line 24: Outer loop unrolled inside inner loop.: i
ncc: vec( 101): mmultest.c, line 26: Vectorized loop.
- -O indicates optimization level: 0=Disabled, 4=Aggressive Optimization
- -fdiag-vector specifies vector diagnostics level by n. (0: No output, 1:Information, 2:Detail) (default: -fdiag-vector=1)
- -report-all outputs the code generation list, diagnostic list, format list, inline list, option list and vector list.
Running Your Compiled Programs
Your compiled programs can be run on any of the Vector Engine nodes: v001 - v005.
Continuing with the matrix multiplication C program example, mmultest.c
, from the previous section, you can use SRUN to access the amdvector Slurm partition and then run your code.
[login001: abc123]$ srun --pty -t 02:00:00 -n 1 -p amdvector bash
[v001: abc123]$ module load vector
[v001: abc123]$ ./mmultest 3000 2000 5000
The elapsed time to multiply a [3000 x 2000] matrix with a [2000 x 5000] matrix is 2.41 seconds.
If you had compiled your code with the -O0 option, where the vector optimization level is disabled, it takes much longer to run the same matrix multiplication:
[v001: abc123]$ ./mmultest 1000 2000 1000
The elapsed time to multiply a [3000 x 2000] matrix with a [2000 x 5000] matrix is 1266.63 seconds.
Additional Resources & Documentation
Getting Started: Aurora Vectorization Training
This site contains training slides and exercises that can be used for self study: SX-Aurora TSUBASA Vectorization Training
Getting Started: SX-Aurora TSUBASA Performance Tuning Guide
The SX-Aurora TSUBASA Performance Tuning Guide is a recommended resource for providing a comprehensive background and foundation for utilizing the Vector Engines.
An archived copy of the document is available here: AuroraVE_TuningGuide.pdf
An online version is also available here: AuroraVE_TuningGuide
References: VEOS - Vector Engine Operating System Functionality for VE Programs
VEOS Documentation Library
References: SDK - NEC Software Development Kit for Vector Engines
NEC SDK Documentation Library
References: NEC MPI - NEC Message Passing Interface for Vector Engines
NEC MPI Documentation Library
Additional resources can be found at this link: SX-Aurora Documentation Library
- 21 Sep 2022