Here is an article about how to train a PyTorch based Deep Learning model using multiple GPU devices across multiple nodes on an HPC cluster:

https://tuni-itc.github.io/wiki/Technical-Notes/Distributed_dataparallel_pytorch/

There are a few bugs in the sample code in the article. Make sure to change the following lines of code:

model = AE(input_shape=784).cuda(args.gpus) 
model = torch.nn.parallel.DistributedDataParallel( model_sync, device_ids=[args.gpu], find_unused_parameters=True )

to
model = AE(input_shape=784).cuda(args.gpu) 
model = torch.nn.parallel.DistributedDataParallel( model, device_ids=[args.gpu], find_unused_parameters=True )
Topic revision: r1 - 28 Oct 2024, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback