Performance evaluation of unified memory and dynamic parallelism for selected parallel CUDA applications - Publikacja - MOST Wiedzy

Wyszukiwarka

Performance evaluation of unified memory and dynamic parallelism for selected parallel CUDA applications

Abstrakt

The aim of this paper is to evaluate performance of new CUDA mechanisms—unified memory and dynamic parallelism for real parallel applications compared to standard CUDA API versions. In order to gain insight into performance of these mechanisms, we decided to implement three applications with control and data flow typical of SPMD, geometric SPMD and divide-and-conquer schemes, which were then used for tests and experiments. Specifically, tested applications include verification of Goldbach’s conjecture, 2D heat transfer simulation and adaptive numerical integration. We experimented with various ways of how dynamic parallelism can be deployed into an existing implementation and be optimized further. Subsequently, we compared the best dynamic parallelism and unified memory versions to respective standard API counterparts. It was shown that usage of dynamic parallelism resulted in improvement in performance for heat simulation, better than static but worse than an iterative version for numerical integration and finally worse results for Golbach’s conjecture verification. In most cases, unified memory results in decrease in performance. On the other hand, both mechanisms can contribute to simpler and more readable codes. For dynamic parallelism, it applies to algorithms in which it can be naturally applied. Unified memory generally makes it easier for a programmer to enter the CUDA programming paradigm as it resembles the traditional memory allocation/usage pattern.

Cytowania

  • 1 4

    CrossRef

  • 0

    Web of Science

  • 1 7

    Scopus

Cytuj jako

Pełna treść

pobierz publikację
pobrano 1253 razy
Wersja publikacji
Accepted albo Published Version
Licencja
Creative Commons: CC-BY otwiera się w nowej karcie

Słowa kluczowe

Informacje szczegółowe

Kategoria:
Publikacja w czasopiśmie
Typ:
artykuł w czasopiśmie wyróżnionym w JCR
Opublikowano w:
JOURNAL OF SUPERCOMPUTING nr 72, strony 5378 - 5401,
ISSN: 0920-8542
ISSN:
0920-8542
Język:
angielski
Rok wydania:
2017
Opis bibliograficzny:
Jarząbek Ł., Czarnul P.: Performance evaluation of unified memory and dynamic parallelism for selected parallel CUDA applications// JOURNAL OF SUPERCOMPUTING. -Vol. 72, nr. 12 (2017), s.5378-5401
DOI:
Cyfrowy identyfikator dokumentu elektronicznego (otwiera się w nowej karcie) 10.1007/s11227-017-2091-x
Bibliografia: test
  1. Adinetz, A.: Adaptive parallel computation with cuda dynamic parallelism. https://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic-parallelism/ (2014). [Accessed 17.02.2016]
  2. Aliaga, J.I., Davidovic, D., Pérez, J., Quintana-Ortí, E.S.: Harnessing CUDA dynamic parallelism for the solution of sparse linear systems. In: G.R. Joubert, H. Leather, M. Parsons, F.J. Peters, M. Sawyer (eds.) Parallel Computing: On the Road to Ex- ascale, Proceedings of the International Conference on Parallel Computing, ParCo 2015, 1-4 September 2015, Edinburgh, Scotland, UK, Advances in Parallel Comput- ing, vol. 27, pp. 217226. IOS Press (2015). DOI 10.3233/978-1-61499-621-7-217. URL http://dx.doi.org/10.3233/978-1-61499-621-7-217 otwiera się w nowej karcie
  3. Caldwell, C.: Goldbach's conjecture. http://primes.utm.edu/glossary/page.php?sort= GoldbachConjecture. [Accessed 10.06.2016]
  4. Czarnul, P.: Programming, tuning and automatic parallelization of irregular divide- and-conquer applications in DAMPVM/DAC. IJHPCA 17(1), 7793 (2003). DOI 10.1177/1094342003017001007. URL http://dx.doi.org/10.1177/1094342003017001007 otwiera się w nowej karcie
  5. Czarnul, P.: Benchmarking performance of a hybrid intel xeon/xeon phi system for parallel computation of similarity measures between large vectors. International Jour- nal of Parallel Programming pp. 117 (2016). DOI 10.1007/s10766-016-0455-0. URL http://dx.doi.org/10.1007/s10766-016-0455-0 otwiera się w nowej karcie
  6. Czarnul, P.: Parallelization of Divide-and-Conquer Applications on Intel Xeon Phi with an OpenMP Based Framework, pp. 99111. Springer International Publishing, Cham (2016). DOI 10.1007/978-3-319-28564-1_9. URL http://dx.doi.org/10.1007/978-3-319- 28564-1_9 otwiera się w nowej karcie
  7. Czarnul, P., Grzeda, K.: Parallel simulations of electrophysiological phenomena in my- ocardium on large 32 and 64-bit linux clusters. In: D. Kranzlmüller, P. Kacsuk, J. Don- garra (eds.) Recent Advances in Parallel Virtual Machine and Message Passing Interface, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September 19-22, 2004, Proceedings, Lecture Notes in Computer Science, vol. 3241, pp. 234241. Springer (2004). DOI 10.1007/978-3-540-30218-6_35. URL http://dx.doi.org/10.1007/978-3- 540-30218-6_35 otwiera się w nowej karcie
  8. DiMarco, J., Taufer, M.: Performance impact of dynamic parallelism on dierent clus- tering algorithms. In: SPIE Defense, Security, and Sensing, pp. 87,520E87,520E. In- ternational Society for Optics and Photonics (2013) otwiera się w nowej karcie
  9. Guy, R.: Unsolved problems in number theory. Springer Science & Business Media (2013) otwiera się w nowej karcie
  10. Halliday, D., Resnick, R., Walker, J.: Fundamentals of Physics Extended, 10th Edition. Wiley (2013) otwiera się w nowej karcie
  11. Jones, S.: How tesla k20 speeds quicksort, a familiar comp-sci code. https://blogs.nvidia.com/blog/2012/09/12/how-tesla-k20-speeds-up-quicksort-a- familiar-comp-sci-code/ (2012). [Accessed 11.06.2016]
  12. Joseph, J., Keville, K.: An evaluation of cuda unied memory access on nvidia tegra k1. Waltham, MA USA (2015). IEEE High Performance Extreme Computing Conference (HPEC`15) Nineteenth Annual HPEC Conference otwiera się w nowej karcie
  13. Landaverde, R., Zhang, T., Coskun, A.K., Herbordt, M.: An investigation of unied memory access performance in cuda. In: High Performance Extreme Computing Con- ference (HPEC), 2014 IEEE, pp. 16 (2014) otwiera się w nowej karcie
  14. Li, D., Wu, H., Becchi, M.: Exploiting dynamic parallelism to eciently support irregular nested loops on gpus. In: Proceedings of the 2015 International Work- shop on Code Optimisation for Multi and Many Cores, COSMIC '15, pp. 5:1 5:1. ACM, New York, NY, USA (2015). DOI 10.1145/2723772.2723780. URL http://doi.acm.org/10.1145/2723772.2723780 otwiera się w nowej karcie
  15. Li, W., Jin, G., Cui, X., See, S.: An evaluation of unied memory technology on nvidia gpus. In: Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM Inter- national Symposium on, pp. 10921098 (2015). DOI 10.1109/CCGrid.2015.105 otwiera się w nowej karcie
  16. Mehta, V.: Exploiting cuda dynamic parallelism for low power arm based pro- totypes (2015).
  17. GPU Technology Conference, San Jose, U.S.A., http://on- demand.gputechconf.com/gtc/2015/presentation/S5384-Vishal-Mehta.pdf otwiera się w nowej karcie
  18. Mei, G.: Evaluating the power of gpu acceleration for idw interpolation algorithm. The Scientic World Journal 2014 (2014). Article ID 171574, doi:10.1155/2014/171574 otwiera się w nowej karcie
  19. ukasz Jarz¡bek, Paweª Czarnul otwiera się w nowej karcie
  20. Negrut, D., Serban, R., Li, A., Seidl, A.: Unied memory in cuda 6.0. a brief overview of related data access and transfer issues. In: Tech. Rep. TR-2014-09, University of WisconsinMadison (2014)
  21. NVIDIA Corporation: Dynamic parallelism in cuda (2012). Http://developer.down load.nvidia.com/assets/cuda/docs/TechBrief_Dynamic_Parallelism_in_CUDA_v2. pdf 20. NVIDIA Corporation: NVIDIA CUDA C programming guide (2015). Version 7.5 otwiera się w nowej karcie
  22. Plauth, M., Feinbube, F., Schlegel, F., Polze, A.: Using dynamic parallelism for ne- grained, irregular workloads: A case study of the n-queens problem. In: 2015 Third Inter- national Symposium on Computing and Networking (CANDAR), pp. 404407 (2015). DOI 10.1109/CANDAR.2015.26 otwiera się w nowej karcie
  23. Plauth, M., Feinbube, F., Schlegel, F., Polze, A.: A performance evalua- tion of dynamic parallelism for ne-grained, irregular workloads. Interna- tional Journal of Networking and Computing 6(2), 212229 (2016). URL http://www.ijnc.org/index.php/ijnc/article/view/126 otwiera się w nowej karcie
  24. Sakharnykh, N.: Combine openacc and unied memory for productivity and per- formance (2015). Https://devblogs.nvidia.com/parallelforall/combine-openacc-unied- memory-productivity-performance/ otwiera się w nowej karcie
  25. Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-Purpose GPU Programming, 1st edn. Addison-Wesley Professional (2010)
  26. Souto, R.P., Ostho, C., de Vasconcelos, A.T., Augusto, D.A., da Silva Dias, P.L., Rodriguez, A., Trelles, O., Ujaldon, M.: Applying gpu dy- namic parallelism to high-performance normalization of gene expres- sions (2014).
  27. GPU Technology Conference, San Jose, U.S.A., http://on- demand.gputechconf.com/gtc/2014/poster/pdf/P4209_bionformatics_sort_dynamic _parallelism.pdf 26. Theano Development Team: Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints abs/1605.02688 (2016). URL http://arxiv.org/abs/1605.02688 otwiera się w nowej karcie
  28. Wang, J., Yalamanchili, S.: Characterization and analysis of dynamic parallelism in unstructured gpu applications. In: Workload Characterization (IISWC), 2014 IEEE International Symposium on, pp. 5160 (2014). DOI 10.1109/IISWC.2014.6983039 otwiera się w nowej karcie
  29. Wilkinson, B., Allen, M.: Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers, edition edn. Pearson (2004). ISBN 978-0131405639 otwiera się w nowej karcie
  30. Zhang, P., Holk, E., Matty, J., Misurda, S., Zalewski, M., Chu, J., McMillan, S., Lumsdaine, A.: Dynamic parallelism for simple and ecient gpu graph algorithms. In: Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, IA3 '15, pp. 11:111:4. ACM, New York, NY, USA (2015). DOI 10.1145/2833179.2833189. URL http://doi.acm.org/10.1145/2833179.2833189 otwiera się w nowej karcie
Weryfikacja:
Politechnika Gdańska

wyświetlono 483 razy

Publikacje, które mogą cię zainteresować

Meta Tagi