Using NEC SX-Aurora Vector Engines On Arc

This guide provides an overview of the NEC SX-Aurora TSUBASA Vector Engines (VE) nodes available on the Arc HPC Cluster.

ARC Vector Engine Components

Vector Engine Compiler Node (vc001)

The ARC HPC environment includes 1 x VE Compiler node. This node is licensed to enable users to compile code for the Vector Engine nodes. After compiling your VE code, you can run your compiled programs any of the 5 x Vector Engine nodes.

Cross Compilers:
  • nfort
  • ncc
  • nc++
  • nld (Linker)
  • nar (Archiver)
  • nranlib (Index generator for archives)
MPI wrappers:
  • mpiinfort
  • mpincc
  • mpinc++

Vector Engine Compute Nodes (v001 - v005)

The Arc HPC environment includes 5 x AMD Compute Nodes each with a NEC Vector Engine Card, providing increased memory bandwidth and computational ability with increased power efficiency.

VE Compute Node specifications:
  • Server node Cores: 2 x Physical AMD CPUs, each with 8 cores and hyperthreading, providing a total of 32 Cores
  • Server node RAM: 1TB
  • Vector Engine Card RAM: 48GB
  • Vector Engine Card Cores: 8
  • Vector Engine Card Memory Bandwidth: 1.53TB/s

The Vector Engine nodes are named: v001 - v005

Vector Engine Slurm Partitions

The Arc HPC cluster contains over a dozen Slurm partitions, each representing a unique set of resources to help meet the scientific computing needs of our users. There are two partitions that have been setup specifically for use with the Vector Engine nodes:

  • amdvcompiler - Partition containing node vc001 that is used to compile VE code
Access to this partition is limited to 2 hours. Since this is the only node available for performing compilations, the time limit is shortened to allow for fair use and access to other users.

  • amdvector - Partition containing nodes v001 - v005 that are used to run VE code
The jobs that run in this partition have a 3 day time limit and can be run on one of the VE nodes.

Accessing Vector Engine Nodes

The AMD Vector nodes are accessed like any other node in the cluster.

Interactive shell with SRUN

You can use the following commands to get an interactive shell on one of the nodes.

This command will provide a bash shell on one of the Vector Engine nodes:

[login001: abc123]$ srun -p amdvector -n 1 -t 01:30:00 --pty bash

This command will provide a bash shell on the Vector Compiler node:

[login001: abc123]$ srun -p amdvcompiler -n 1 -t 01:30:00 --pty bash

Submitting Batch jobs with SBATCH

You can also submit a batch job:

[login001: abc123]$ sbatch my_ve_jobscript.slurm

Loading the Vector Linux Module

After accessing a node in an interactive shell, you can load the NEC Vector module to update your PATH and other environment variables. This enables you to invoke the compiler and various other tools without having to specify the full path. This command can also be included in your SBATCH script to ensure the proper environment variables are available for your job.

When the vector module is loaded, two shell scripts are sourced as part of the process: & Those scripts provide additional environment variable definitions and can be called with additional parameters, if needed. The module sources the scripts without any parameters. A brief explanation of the additional parameters is included in the output when loading the vector module.

[vc001: abc123]$ module load vector/2.8-1
The vector  module version 2.8-1  is loaded.

--- Sourcing: /opt/nec/ve/nlc/2.3.0/bin/
Additional Notes:
/opt/nec/ve/nlc/2.3.0/bin/ can be called with alternate parameters:

Usage: source [ARGUMENT]...

  i64  specify the default integer type is 64-bit
       (default: 32-bit)
  mpi  specify MPI is used
       (default: no use of MPI)

--- Sourcing: /opt/nec/ve/mpi/2.21.0/bin64/
Additional Notes:
/opt/nec/ve/mpi/2.21.0/bin64/ can be called with alternate parameters: can take additional parameters, however the
" [gnu|intel] [version]" format should only be
used at runtime in order to use VH MPI shared libraries other
than those specified by RUNPATH embedded in a MPI program
executable by the MPI compile command.  In other cases,
"source /opt/nec/ve/mpi/2.21.0/bin/"
should be used without arguments.

The "version" parameter is a directory name in the following directory:
  /opt/nec/ve/mpi/2.21.0/lib64/vh/gnu (if gnu is specified)
  /opt/nec/ve/mpi/2.21.0/lib64/vh/intel (if intel is specified)
[vc001: abc123]$

Vector Engine Coding

Compiling Vector Engine Code

Due to licensing restrictions, Vector Engine code can only be compiled on the Vector Engine Compiler node, vc001.

There are several compilers available on vc001, including ncc for C, nfort for Fortran, and nc++ for C++. For each compiler, there are numerous options and suboptions available. A brief explanation of the compiler options are provided by invoking the command with the --help option.

[login001: abc123]$ srun  --pty -t 02:00:00 -n 1 -p amdvcompiler bash
[vc001: abc123]$ module load vector
[vc001: abc123]$ ncc --help

For example, this matrix multiplication C program, mmultest.c, is compiled with several options for vector optimization:</pre>

[vc001: abc123]$ ncc mmultest.c -O4 -report-all -fdiag-vector=2 -o mmultest
ncc: opt(1592): mmultest.c, line 11: Outer loop unrolled inside inner loop.: j
ncc: vec( 101): mmultest.c, line 14: Vectorized loop.
ncc: vec( 126): mmultest.c, line 15: Idiom detected.: Sum
ncc: vec( 128): mmultest.c, line 15: Fused multiply-add operation applied.
ncc: opt(1592): mmultest.c, line 24: Outer loop unrolled inside inner loop.: i
ncc: vec( 101): mmultest.c, line 26: Vectorized loop.
[vc001: abc123]$

  • -O indicates optimization level: 0=Disabled, 4=Aggressive Optimization
  • -fdiag-vector specifies vector diagnostics level by n. (0: No output, 1:Information, 2:Detail) (default: -fdiag-vector=1)
  • -report-all outputs the code generation list, diagnostic list, format list, inline list, option list and vector list.

Running Your Compiled Programs

Your compiled programs can be run on any of the Vector Engine nodes: v001 - v005.

Continuing with the matrix multiplication C program example, mmultest.c, from the previous section, you can use SRUN to access the amdvector Slurm partition and then run your code.

[login001: abc123]$ srun  --pty -t 02:00:00 -n 1 -p amdvector bash
[v001: abc123]$ module load vector
[v001: abc123]$ ./mmultest 3000 2000 5000
The elapsed time to multiply a [3000 x 2000] matrix with a [2000 x 5000] matrix is 2.41 seconds.

If you had compiled your code with the -O0 option, where the vector optimization level is disabled, it takes much longer to run the same matrix multiplication:

[v001: abc123]$ ./mmultest 1000 2000 1000
The elapsed time to multiply a [3000 x 2000] matrix with a [2000 x 5000] matrix is 1266.63 seconds.

Additional Resources & Documentation

Getting Started: Aurora Vectorization Training

This site contains training slides and exercises that can be used for self study: SX-Aurora TSUBASA Vectorization Training

Getting Started: SX-Aurora TSUBASA Performance Tuning Guide

The SX-Aurora TSUBASA Performance Tuning Guide is a recommended resource for providing a comprehensive background and foundation for utilizing the Vector Engines.

An archived copy of the document is available here: AuroraVE_TuningGuide.pdf

An online version is also available here: AuroraVE_TuningGuide

References: VEOS - Vector Engine Operating System Functionality for VE Programs

VEOS Documentation Library

References: SDK - NEC Software Development Kit for Vector Engines

NEC SDK Documentation Library

References: NEC MPI - NEC Message Passing Interface for Vector Engines

NEC MPI Documentation Library

Other Resources

Additional resources can be found at this link: SX-Aurora Documentation Library

-- AdminUser - 21 Sep 2022
Topic revision: r9 - 21 Sep 2022, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback