Improving all-reduce collective operations for imbalanced process arrival patterns - Publication - Bridge of Knowledge

Search

Improving all-reduce collective operations for imbalanced process arrival patterns

Abstract

Two new algorithms for the all-reduce operation optimized for imbalanced process arrival patterns (PAPs) are presented: (1) sorted linear tree, (2) pre-reduced ring as well as a new way of online PAP detection, including process arrival time estimations, and their distribution between cooperating processes was introduced. The idea, pseudo-code, implementation details, benchmark for performance evaluation and a real case example for machine learning are provided. The results of the experiments were described and analyzed, showing that the proposed solution has high scalability and improved performance in comparison with the usually used ring and Rabenseifner algorithms.

Citations

  • 6

    CrossRef

  • 0

    Web of Science

  • 6

    Scopus

Cite as

Full text

download paper
downloaded 94 times
Publication version
Accepted or Published Version
License
Creative Commons: CC-BY open in new tab

Keywords

Details

Category:
Articles
Type:
artykuł w czasopiśmie wyróżnionym w JCR
Published in:
JOURNAL OF SUPERCOMPUTING no. 74, pages 3071 - 3092,
ISSN: 0920-8542
Language:
English
Publication year:
2018
Bibliographic description:
Proficz J.: Improving all-reduce collective operations for imbalanced process arrival patterns// JOURNAL OF SUPERCOMPUTING. -Vol. 74, iss. 7 (2018), s.3071-3092
DOI:
Digital Object Identifier (open in new tab) 10.1007/s11227-018-2356-z
Bibliography: test
  1. CIFAR-10 and CIFAR-100 datasets. https://www.cs.toronto.edu/~kriz/cifar.html. Accessed 4 Jan 2018 open in new tab
  2. MPI 3.1 collective communication. http://mpi-forum.org/docs/mpi-3.1/mpi31-report/node95.htm. Accessed 26 Jan 2018 open in new tab
  3. MPICH high-performance portable MPI. https://www.mpich.org/. Accessed 7 Sep 2017 open in new tab
  4. Open MPI: open source high performance computing. https://www.open-mpi.org/. Accessed 27 Aug 2017 open in new tab
  5. POSIX threads programming. https://computing.llnl.gov/tutorials/pthreads/. Accessed 5 Jan 2018 open in new tab
  6. The standarization forum for message passing interface (MPI). http://mpi-forum.org/. Accessed 24 Jan 2018 open in new tab
  7. Tiny-dnn header only, dependency-free deep learning framework in C++. https://github.com/tiny-dnn/ tiny-dnn. Accessed 4 Jan 2018 open in new tab
  8. Czarnul P, Kuchta J, Matuszek M, Proficz J, Rościszewski P, Wójcik M, Szymański J (2017) MERP- SYS: an environment for simulation of parallel application execution on large scale HPC systems. Simul Model Pract Theory 77:124-140 open in new tab
  9. Dean J, Corrado G, Monga R, Chen K, Devin M, Le QV, Mao M, Ranzato M, Senior A, Tucker P, Yang K, Ng AY (2012) Large scale distributed deep networks. In: Advances in neural information processing systems 25: 26th annual conference on neural information processing systems 2012. Curran Associates, Inc., pp 1223-1231
  10. Faraj A, Yuan X, Lowenthal D (2006) STAR-MPI: self tuned adaptive routines for MPI collective operations. In: Proceedings of the 20th Annual International Conference on Supercomputing, pp 199- 208 open in new tab
  11. Faraj A, Patarasuk P, Yuan X (2008) A study of process arrival patterns for MPI collective operations. Int J Parallel Program 36(6):543-570 open in new tab
  12. Hockney RW (1994) The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Comput 20(3):389-398 open in new tab
  13. Krawczyk H, Krysztop B, Proficz J (2000) Suitability of the time controlled environment for race detection in distributed applications. Future Gener Comput Syst 16(6):625-635 open in new tab
  14. Krawczyk H, Nykiel M, Proficz J (2015) Tryton supercomputer capabilities for analysis of massive data streams. Pol Marit Res 22(3):99-104 open in new tab
  15. Marendic P, Lemeire J, Vucinic D, Schelkens P (2016) A novel MPI reduction algorithm resilient to imbalances in process arrival times. J Supercomput 72:1973-2013 open in new tab
  16. Marendić P, Lemeire J, Haber T, Vučinić D, Schelkens P (2012) An investigation into the performance of reduction algorithms under load imbalance. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol 7484, pp 439-450 open in new tab
  17. Patarasuk P, Yuan X (2008) Efficient MPI_Bcast across different process arrival patterns. In: IPDPS Miami 2008: Proceedings of the 22nd IEEE International Parallel and Distributed Processing Sympo- sium, Program and CD-ROM, p 1 open in new tab
  18. Proficz J, Czarnul P (2016) Performance and power-aware modeling of MPI applications for cluster computing. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) vol 9574, pp 199-209 open in new tab
  19. Rabenseifner R (2004) Optimization of collective reduction operations. In: Lecture Notes in Compu- tational Science, vol 3036, pp 1-9 open in new tab
  20. Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in MPICH. Int J High Perform Comput Appl 19(1):49-66 open in new tab
Verified by:
Gdańsk University of Technology

seen 153 times

Recommended for you

Meta Tags