UsingPyTorchWithMultipleGPUDevicesAcrossMultipleNodes < ARC

You are here: Foswiki>ARC Web>UsingPyTorchWithMultipleGPUDevicesAcrossMultipleNodes (08 Apr 2022, AdminUser)Edit Attach

Here is an article about how to train a PyTorch based Deep Learning model using multiple GPU devices across multiple nodes on an HPC cluster:

https://tuni-itc.github.io/wiki/Technical-Notes/Distributed_dataparallel_pytorch/

There are a few bugs in the sample code in the article. Make sure to change the following lines of code:

model = AE(input_shape=784).cuda(args.gpus) 
model = torch.nn.parallel.DistributedDataParallel( model_sync, device_ids=[args.gpu], find_unused_parameters=True )

model = AE(input_shape=784).cuda(args.gpu) 
model = torch.nn.parallel.DistributedDataParallel( model, device_ids=[args.gpu], find_unused_parameters=True )

Topic revision: r1 - 08 Apr 2022, AdminUser

ARC

Webs
ARC
CondaEnvironmentSaysMetadataCorruptedWhenInstalling
Main
Sandbox
System

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback