BC-MPI: running an mpi application on multiple clusters with beesycluster connectivity
PublikacjaW artykule zaproponowano nowy pakiet BC-MPI, który umożliwiauruchomienie aplikacji MPI na wielu klastrach z różnymi implementacjami MPI. Wykorzystuje dedykowane implementacje MPIdo komunikacji wewnątrz klastrów oraz tryb MPI THREAD MULTIPLE dokomunikacji pomiędzy klastrami w dodatkowych wątkach aplikacji MPI. Ponadto, aplikacja BC-MPI może być automatycznie skompilowanai uruchomiona przez warstwę pośrednią BeesyCluster. BeesyClusterumożliwia...
Checkpointing of Parallel MPI Applications using MPI One-sided API with Support for Byte-addressable Non-volatile RAM
PublikacjaThe increasing size of computational clusters results in an increasing probability of failures, which in turn requires application checkpointing in order to survive those failures. Traditional checkpointing requires data to be copied from application memory into persistent storage medium, which increases application execution time as it is usually done in a separate step. In this paper we propose to use emerging byte-addressable...
Performance and Power-Aware Modeling of MPI Applications for Cluster Computing
PublikacjaThe paper presents modeling of performance and power consumption when running parallel applications on modern cluster-based systems. The model includes basic so-called blocks representing either computations or communication. The latter includes both point-to-point and collective communication. Real measurements were performed using MPI applications and routines run on three different clusters with both Infiniband and Gigabit Ethernet...
Towards Easy-to-Use Checkpointing of MPI Applications within CLUSTERIX.
PublikacjaW literaturze wymienia się wiele bibliotek/systemów zarówno poziomu jądra jak i użytkownika, które wspomagają zapisywanie i odtwarzanie stanu procesów. W odniesieniu do aplikacji równoległych, jest to jednak zadanie cały czas trudne. Praca prezentuje nasze podejście do zapisywania/odtwarzania stanu aplikacji MPI wspomagane przez programistę, które wykorzystane będzie w środowisku projektu CLUSTERIX tj. zintegrowanej grupie klastrów...
Object serialization and remote exception pattern for distributed C++/MPI application
PublikacjaMPI is commonly used standard in development of scientific applications. It focuses on interlanguage operability and is not very well object oriented. The paper proposes a general pattern enabling design of distributed and object oriented applications. It also presents its sample implementations and performance tests.
New user-guided and ckpt-based checkpointing libraries for parallel MPI applications
PublikacjaPraca prezentuje szczególy projektowe i implementacyjne jak również wyniki wydajnościowe dwóch nowych bibliotek checkpointingu opracowanych przez autorów dla równoległych aplikacji MPI. Pierwsz biblioteka, tzw. user-guided wymaga od programisty dostarczenia funkcji pakujących i rozpakowujących stan procesu, ale dostarcza łatwego w użyciu API z wykorzystaniem stałych MPI. Wykorzystuje funkcje I/O MPI-2 lub dedykowany proces master...
A Solution to Image Processing with Parallel MPI I/O and Distributed NVRAM Cache
PublikacjaThe paper presents a new approach to parallel image processing using byte addressable, non-volatile memory (NVRAM). We show that our custom built MPI I/O implementation of selected functions that use a distributed cache that incorporates NVRAMs located in cluster nodes can be used for efficient processing of large images. We demonstrate performance benefits of such a solution compared to a traditional implementation without NVRAM...
Portable parallel simulator using MPI for 2D and 3D domains: design and performance testing
PublikacjaW artykule prezentujemy szczegóły projektowo-implementacyjne naszego modularnego kodu symulacyjnego z wykorzystaniem MPI, w tym nakładaniem obliczeń i komunikacji. Podkreślamy modularność naszej implementacji pozwalającą na łatwą adaptację kodu dla innych zasotosowań. Prezentujemy związek pomiędzy przyspieszeniem obliczeń, rozmiarem i kształtami trójwymiarowych domen z różnymi stosunkami liczby węzłów aktualizowanych przez procesor...
Three levels of fail-safe mode in MPI I/O NVRAM distributed cache
PublikacjaThe paper presents architecture and design of three versions for fail-safe data storage in a distributed cache using NVRAM in cluster nodes. In the first one, cache consistency is assured through additional buffering write requests. The second one is based on additional write log managers running on different nodes. The third one benefits from synchronization with a Parallel File System (PFS) for saving data into a new file which...
Investigation into MPI All-Reduce Performance in a Distributed Cluster with Consideration of Imbalanced Process Arrival Patterns
PublikacjaThe paper presents an evaluation of all-reduce collective MPI algorithms for an environment based on a geographically-distributed compute cluster. The testbed was split into two sites: CI TASK in Gdansk University of Technology and ICM in University of Warsaw, located about 300 km from each other, both connected by a fast optical fiber Ethernet-based 100 Gbps network (900 km part of the PIONIER backbone). Each site hosted a set...
Efektywna warstwa pośrednicząca dla obliczeń typu master-slave w środowisku C++/MPI
PublikacjaPokazano, jak dla wysokowydajnościowego algorytmu pisanego w modelu master-slave w języku C++ i spełniającego pewne ograniczenia można napisać i wykorzystać warstwę komunikacyjną zupełnie oddzielającą kod odpowiedzialny za komunikację od kodu odpowiedzialnego za dzie-dzinę problemową. Przedstawiona zostaje specyfkacja wymagań, jakie powinien spełniać hipotetyczny system rozproszony oraz warstwa komunikacyjna, a także wymagania...
Performance Assessment of Using Docker for Selected MPI Applications in a Parallel Environment Based on Commodity Hardware
PublikacjaIn the paper, we perform detailed performance analysis of three parallel MPI applications run in a parallel environment based on commodity hardware, using Docker and bare-metal configurations. The testbed applications are representative of the most typical parallel processing paradigms: master–slave, geometric Single Program Multiple Data (SPMD) as well as divide-and-conquer and feature characteristic computational and communication...
Teaching High–performance Computing Systems – A Case Study with Parallel Programming APIs: MPI, OpenMP and CUDA
PublikacjaHigh performance computing (HPC) education has become essential in recent years, especially that parallel computing on high performance computing systems enables modern machine learning models to grow in scale. This significant increase in the computational power of modern supercomputers relies on a large number of cores in modern CPUs and GPUs. As a consequence, parallel program development based on parallel thinking has become...
A Parallel MPI I/O Solution Supported by Byte-addressable Non-volatile RAM Distributed Cache
PublikacjaWhile many scientific, large-scale applications are data-intensive, fast and efficient I/O operations have become of key importance for HPC environments. We propose an MPI I/O extension based on in-system distributed cache with data located in Non-volatile Random Access Memory (NVRAM) available in each cluster node. The presented architecture makes effective use of NVRAM properties such as persistence and byte-level access behind...
A Fail-Safe NVRAM Based Mechanism for Efficient Creation and Recovery of Data Copies in Parallel MPI Applications
PublikacjaThe paper presents a fail-safe NVRAM based mechanism for creation and recovery of data copies during parallel MPI application runtime. Specifically, we target a cluster environment in which each node has an NVRAM installed in it. Our previously developed extension to the MPI I/O API can take advantage of NVRAM regions in order to provide an NVRAM based cache like mechanism to significantly speed up I/O operations and allow to preload...
Zastosowanie bajtowo adresowanej pamięci NVRAM do zwiększenia wydajności wybranych aplikacji równoległych wykorzystujących MPI I/O
PublikacjaObecnie wiele badań podejmuje temat rosnącego problemu wydajności operacji na plikach w środowiskach klastrowych. Jednocześnie, według ostatnich doniesień związanych z rozwojem technologii pamięci komputerowych, w najbliższej przyszłości na rynku powinny pojawić się układy trwałej pamięci o dostępie swobodnym, adresowanej bajtowo. Niniejsza rozprawa pokazuje, że przy użyciu takiej pamięci można zwiększyć wydajność wybranych...
European MPI Users' Group Conference (European PVM/MPI Users' Group Conference)
Konferencje -
Paweł Czarnul dr hab. inż.
OsobyPaweł Czarnul uzyskał stopień doktora habilitowanego w dziedzinie nauk technicznych w dyscyplinie informatyka w roku 2015 zaś stopień doktora nauk technicznych w zakresie informatyki(z wyróżnieniem) nadany przez Radę Wydziału Elektroniki, Telekomunikacji i Informatyki Politechniki Gdańskiej w roku 2003. Dziedziny jego zainteresowań obejmują: przetwarzanie równoległei rozproszone w tym programowanie równoległe na klastrach obliczeniowych,...
Process arrival pattern aware algorithms for acceleration of scatter and gather operations
PublikacjaImbalanced process arrival patterns (PAPs) are ubiquitous in many parallel and distributed systems, especially in HPC ones. The collective operations, e.g. in MPI, are designed for equal process arrival times (PATs), and are not optimized for deviations in their appearance. We propose eight new PAP-aware algorithms for the scatter and gather operations. They are binomial or linear tree adaptations introducing additional process...
Dissociative multi-photon ionization of isolated uracil and uracil-adenine complexes
PublikacjaRecent multi-photon ionization (MPI) experiments on uracil revealed a fragment ion at m/z 84 that was proposed as a potential marker for ring opening in the electronically excited neutral molecule. The present MPI measurements on deuterated uracil identify the fragment as C3H4N2O+ (uracil+ less CO), a plausible dissociative ionization product from the theoretically predicted open-ring isomer. Equivalent measurements on thymine...
Distributed NVRAM Cache – Optimization and Evaluation with Power of Adjacency Matrix
PublikacjaIn this paper we build on our previously proposed MPI I/O NVRAM distributed cache for high performance computing. In each cluster node it incorporates NVRAMs which are used as an intermediate cache layer between an application and a file for fast read/write operations supported through wrappers of MPI I/O functions. In this paper we propose optimizations of the solution including handling of write requests with a synchronous mode,...
Parallelization of Compute Intensive Applications into Workflows based on Services in BeesyCluster
PublikacjaThe paper presents an approach for modeling, optimization and execution of workflow applications based on services that incorporates both service selection and partitioning of input data for parallel processing by parallel workflow paths. A compute-intensive workflow application for parallel integration is presented. An impact of the input data partitioning on the scalability is presented. The paper shows a comparison of the theoretical...
Parallel simulations of electrophysiological phenomena in myocardium on large 32 and 64-bit Linux clusters.
PublikacjaW pracy podjęto badania i przeprowadzono symulacje zjawisk elektrofizjologicznych w mięśniu sercowym z wykorzystaniem wytworzonego w tym celu oprogramowania równoległego opartego na MPI. Zaimplementowano i zbadano ulepszenia kodu prowadzące do uzyskania dobrej skalowalności oraz przeprowadzono testy wydajności na najnowszych 32 i 64-bitowych klastrach linuksowych. Praca stanowi próbę równoległej implementacji znanego podejścia...
Strategie obsługi wyjątków w aplikacjach rozproszonych.
PublikacjaRozpatrzono wykorzystanie mechanizmu obsługi wyjątków w systemach rozproszonych. Zaprezentowano różne strategie obsługi wyjątków dla różnych modeli przetwarzania i odpowiadającym ich środowisk programistycznych. Przyjęto nową koncepcję zdalnego odbiorcy wyjątków oraz zaprezentowano jego implementację przy wykorzystaniu biblioteki MPI oraz RMI.
Improving Clairvoyant: reduction algorithm resilient to imbalanced process arrival patterns
PublikacjaThe Clairvoyant algorithm proposed in “A novel MPI reduction algorithm resilient to imbalances in process arrival times” was analyzed, commented and improved. The comments concern handling certain edge cases in the original pseudocode and description, i.e., adding another state of a process, improved cache friendliness more precise complexity estimations and some other issues improving the robustness of the algorithm implementation....
Protokoły łączności do transmisji strumieni multimedialnych na platformie KASKADA
PublikacjaPlatforma KASKADA rozumiana jako system przetwarzania strumieni multimedialnych dostarcza szeregu usług wspomagających zapewnienie bezpieczeństwa publicznego oraz ocenę badań medycznych. Wydajność platformy KASKADA w znaczącym stopniu uzależniona jest od efektywności metod komunikacji, w tym wymiany danych multimedialnych, które stanowią podstawę przetwarzania. Celem prowadzonych prac było zaprojektowanie podsystemu komunikacji...
Multi-agent large-scale parallel crowd simulation
PublikacjaThis paper presents design, implementation and performance results of a new modular, parallel, agent-based and large scale crowd simulation environment. A parallel application, implemented with C and MPI, was implemented and run in this parallel environment for simulation and visualization of an evacuation scenario at Gdansk University of Technology, Poland and further in the area of districts of Gdansk. The application uses a...
Kosmiczne zastosowania zaawansowanych technologii informatycznych
Kursy OnlineNowoczesne technologie wykorzystania systemów dużej mocy obliczeniowej: superkomputerów o architekturze klastrowej na przykładzie środowisk związanych z masowym przetwarzaniem danych (Big Data), obliczeniami w chmurze (Cloud Computing) oraz klasycznym podejściem wymiany wiadomości (MPI: Message Passing Interface) dla przetwarzania wsadowego.
All-gather Algorithms Resilient to Imbalanced Process Arrival Patterns
PublikacjaTwo novel algorithms for the all-gather operation resilient to imbalanced process arrival patterns (PATs) are presented. The first one, Background Disseminated Ring (BDR), is based on the regular parallel ring algorithm often supplied in MPI implementations and exploits an auxiliary background thread for early data exchange from faster processes to accelerate the performed all-gather operation. The other algorithm, Background Sorted...
Parallelisation of genetic algorithms for solving university timetabling problems
PublikacjaAlgorytmy genetyczne stanowią ważną metodę rozwiązywania problemów optymalizacyjnych. W artykule skupiono się na projekcie równoległego algorytmu genetycznego pozwalającego uzyskiwać uniwersyteckie rozkłady zajęć, spełniające zarówno twarde jak i miękkie ograniczenia. Czytelnika wprowadzono w niektóre znane sposoby zrównoleglenia, przedstawiono również podejście autorów, ykorzystujące MPI. Przyjęto strukturę zarządzania opartą...
Simulation of parallel similarity measure computations for large data sets
PublikacjaThe paper presents our approach to implementation of similarity measure for big data analysis in a parallel environment. We describe the algorithm for parallelisation of the computations. We provide results from a real MPI application for computations of similarity measures as well as results achieved with our simulation software. The simulation environment allows us to model parallel systems of various sizes with various components...
Use of ICT infrastructure for teaching HPC
PublikacjaIn this paper we look at modern ICT infrastructure as well as curriculum used for conducting a contemporary course on high performance computing taught over several years at the Faculty of Electronics Telecommunications and Informatics, Gdansk University of Technology, Poland. We describe the infrastructure in the context of teaching parallel programming at the cluster level using MPI, node level using OpenMP and CUDA. We present...
Workflow application for detection of unwanted events
PublikacjaZaprezentowano rozproszoną aplikację do wykrywania potencjalnie niebezpiecznych zdarzeń z wejściowych strumieni wideo. Rozpoznanie niepożądanych zdarzeń wywołuje alarmy i wysyła powiadomienia do odpowiednich służb, jak również powoduje zarejestrowanie filmu. Model aplikacji składa się z węzłów z kamerami, pobierajacych strumienie danych, przetwarzajacych dane, wysyłajacych powiadomienia i zapisujacych dane. Zaimplementowana aplikacja...
BeesyCluster as Front-End for High Performance Computing Services
PublikacjaThe paper presents the BeesyCluster system as a middleware allowing invocation of services on high performance computing resources within the NIWA Centre of Competence project. Access is possible through both WWW and SOAP Web Service interfaces. The former allows non-experienced users to invoke both simple and complex services exposed through easyto-use servlets. The latter is meant for integration of external applications with...
NVRAM as Main Storage of Parallel File System
PublikacjaModern cluster environments' main trouble used to be lack of computational power provided by CPUs and GPUs, but recently they suffer more and more from insufficient performance of input and output operations. Apart from better network infrastructure and more sophisticated processing algorithms, a lot of solutions base on emerging memory technologies. This paper presents evaluation of using non-volatile random-access memory as a...
Parallelization of Selected Algorithms on Multi-core CPUs, a Cluster and in a Hybrid CPU+Xeon Phi Environment
PublikacjaIn the paper we present parallel implementations as well as execution times and speed-ups of three different algorithms run in various environments such as on a workstation with multi-core CPUs and a cluster. The parallel codes, implementing the master-slave model in C+MPI, differ in computation to communication ratios. The considered problems include: a genetic algorithm with various ratios of master processing time to communication...
A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systems
PublikacjaIn the paper, we have proposed a framework that allows programming a parallel application for a multi-node system, with one or more GPUs per node, using an OpenMP+extended CUDA API. OpenMP is used for launching threads responsible for management of particular GPUs and extended CUDA calls allow to manage CUDA objects, data and launch kernels. The framework hides inter-node MPI communication from the programmer who can benefit from...
KernelHive: a new workflow-based framework for multilevel high performance computing using clusters and workstations with CPUs and GPUs
PublikacjaThe paper presents a new open-source framework called KernelHive for multilevel parallelization of computations among various clusters, cluster nodes, and finally, among both CPUs and GPUs for a particular application. An application is modeled as an acyclic directed graph with a possibility to run nodes in parallel and automatic expansion of nodes (called node unrolling) depending on the number of computation units available....
Minimizing Distribution and Data Loading Overheads in Parallel Training of DNN Acoustic Models with Frequent Parameter Averaging
PublikacjaIn the paper we investigate the performance of parallel deep neural network training with parameter averaging for acoustic modeling in Kaldi, a popular automatic speech recognition toolkit. We describe experiments based on training a recurrent neural network with 4 layers of 800 LSTM hidden states on a 100-hour corpora of annotated Polish speech data. We propose a MPI-based modification of the training program which minimizes the...
Parallel Programming for Modern High Performance Computing Systems
PublikacjaIn view of the growing presence and popularity of multicore and manycore processors, accelerators, and coprocessors, as well as clusters using such computing devices, the development of efficient parallel applications has become a key challenge to be able to exploit the performance of such systems. This book covers the scope of parallel programming for modern high performance computing systems. It first discusses selected and...
MERPSYS: An environment for simulation of parallel application execution on large scale HPC systems
PublikacjaIn this paper we present a new environment called MERPSYS that allows simulation of parallel application execution time on cluster-based systems. The environment offers a modeling application using the Java language extended with methods representing message passing type communication routines. It also offers a graphical interface for building a system model that incorporates various hardware components such as CPUs, GPUs, interconnects...
Multi-GPU-powered UNRES package for physics-based coarse-grained simulations of structure, dynamics, and thermodynamics of protein systems at biological size- and timescales
PublikacjaCoarse-grained models are nowadays extensively used in biomolecular simulations owing to the tremendous extension of size- and time-scale of simulations. The physics-based UNRES (UNited RESidue) model of proteins developed in our laboratory has only two interaction sites per amino-acid residue (united peptide groups and united side chains) and implicit solvent. However, owing to rigorous physics-based derivation, which enabled...
Testing for conformance of parallel programming pattern languages
PublikacjaThis paper reports on the project being run by TUG and IMAG, aimed at reducing the volume of tests required to exercise parallel programming language compilers and libraries. The idea is to use the ISO STEP standard scheme for conformance testing of software products. A detailed example illustrating the ongoing work is presented.
Improving all-reduce collective operations for imbalanced process arrival patterns
PublikacjaTwo new algorithms for the all-reduce operation optimized for imbalanced process arrival patterns (PAPs) are presented: (1) sorted linear tree, (2) pre-reduced ring as well as a new way of online PAP detection, including process arrival time estimations, and their distribution between cooperating processes was introduced. The idea, pseudo-code, implementation details, benchmark for performance evaluation and a real case example...
Energy Consumption Modeling in SPMD and DAC Applications
PublikacjaIn this chapter, we show a study of energy consumption during execution of SPMD and DAC application – the same applications which time of execution we modeled in the previous two chapters. We measured an average power usage at a single node of the GALERA+ cluster during application execution and then we modeled the total energy consumption by the application. Next we simulated the applications using MERPSYS and we compared the...
Modeling SPMD Application Execution Time
PublikacjaParallel applications in a Single Process Multiple Data paradigm assume splitting huge amounts of data to multiple processors working in parallel at small data packets. As the individual data packets are not independent, the processors must interact with each other to exchange results of the calculations with their adjacent partners and take these results into account in their own computations. An example of SPMD is geometric parallelism...