Filters
total: 1461
filtered: 1358
-
Catalog
Chosen catalog filters
displaying 1000 best results Help
Search results for: MULTI-GPU SCALABILITY
-
Scalability of surrogate-assisted multi-objective optimization of antenna structures exploiting variable-fidelity electromagnetic simulation models
PublicationMulti-objective optimization of antenna structures is a challenging task due to high-computational cost of evaluating the design objectives as well as large number of adjustable parameters. Design speedup can be achieved by means of surrogate-based optimization techniques. In particular, a combination of variable-fidelity electromagnetic (EM) simulations, design space reduction techniques, response surface approximation (RSA) models,...
-
Multi-GPU UNRES for scalable coarse-grained simulations of very large protein systems
PublicationGraphical Processor Units (GPUs) are nowadays widely used in all-atom molecular simulations because of the advantage of efficient partitioning of atom pairs between the kernels to compute the contributions to energy and forces, thus enabling the treatment of very large systems. Extension of time- and size-scale of computations is also sought through the development of coarse-grained (CG) models, in which atoms are merged into extended...
-
Performance and Energy Aware Training of a Deep Neural Network in a Multi-GPU Environment with Power Capping
PublicationIn this paper we demonstrate that it is possible to obtain considerable improvement of performance and energy aware metrics for training of deep neural networks using a modern parallel multi-GPU system, by enforcing selected, non-default power caps on the GPUs. We measure the power and energy consumption of the whole node using a professional, certified hardware power meter. For a high performance workstation with 8 GPUs, we were...
-
Communication and Load Balancing Optimization for Finite Element Electromagnetic Simulations Using Multi-GPU Workstation
PublicationThis paper considers a method for accelerating finite-element simulations of electromagnetic problems on a workstation using graphics processing units (GPUs). The focus is on finite-element formulations using higher order elements and tetrahedral meshes that lead to sparse matrices too large to be dealt with on a typical workstation using direct methods. We discuss the problem of rapid matrix generation and assembly, as well as...
-
A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systems
PublicationIn the paper, we have proposed a framework that allows programming a parallel application for a multi-node system, with one or more GPUs per node, using an OpenMP+extended CUDA API. OpenMP is used for launching threads responsible for management of particular GPUs and extended CUDA calls allow to manage CUDA objects, data and launch kernels. The framework hides inter-node MPI communication from the programmer who can benefit from...
-
Multi-GPU-powered UNRES package for physics-based coarse-grained simulations of structure, dynamics, and thermodynamics of protein systems at biological size- and timescales
PublicationCoarse-grained models are nowadays extensively used in biomolecular simulations owing to the tremendous extension of size- and time-scale of simulations. The physics-based UNRES (UNited RESidue) model of proteins developed in our laboratory has only two interaction sites per amino-acid residue (united peptide groups and united side chains) and implicit solvent. However, owing to rigorous physics-based derivation, which enabled...
-
Characterizing the Scalability of Graph Convolutional Networks on Intel® PIUMA
PublicationLarge-scale Graph Convolutional Network (GCN) inference on traditional CPU/GPU systems is challenging due to a large memory footprint, sparse computational patterns, and irregular memory accesses with poor locality. Intel’s Programmable Integrated Unffied Memory Architecture (PIUMA) is designed to address these challenges for graph analytics. In this paper, a detailed characterization of GCNs is presented using the Open-Graph Benchmark...
-
The impact of the AC922 Architecture on Performance of Deep Neural Network Training
PublicationPractical deep learning applications require more and more computing power. New computing architectures emerge, specifically designed for the artificial intelligence applications, including the IBM Power System AC922. In this paper we confront an AC922 (8335-GTG) server equipped with 4 NVIDIA Volta V100 GPUs with selected deep neural network training applications, including four convolutional and one recurrent model. We report...
-
Energy-Aware Scheduling for High-Performance Computing Systems: A Survey
PublicationHigh-performance computing (HPC), according to its name, is traditionally oriented toward performance, especially the execution time and scalability of the computations. However, due to the high cost and environmental issues, energy consumption has already become a very important factor that needs to be considered. The paper presents a survey of energy-aware scheduling methods used in a modern HPC environment, starting with the...
-
Performance evaluation of parallel background subtraction on GPU platforms
PublicationImplementation of the background subtraction algorithm on parallel GPUs is presented. The algorithm processes video streams and extracts foreground pixels. The work focuses on optimizing parallel algorithm implementation by taking into account specific features of the GPU architecture, such as memory access, data transfers and work group organization. The algorithm is implemented in both OpenCL and CUDA. Various optimizations of...
-
Overview of Scalability and Reliability Problem in SDN Networks
PublicationIn the paper an overview of scalability and reliability in the SDN (Software Defined Networks) networks has been presented. Problems and limitations for guaranteeing scalability and reliability in SDN networks have been indicated. Known methods for assuring scalability and reliability in SDN networks have been described. Projects from research communities for resolving issues with scalability and reliability in SDN networks have...
-
Finite element matrix generation on a GPU
PublicationThis paper presents an efficient technique for fast generation of sparse systems of linear equations arising in computational electromagnetics in a finite element method using higher order elements. The proposed approach employs a graphics processing unit (GPU) for both numerical integration and matrix assembly. The performance results obtained on a test platform consisting of a Fermi GPU (1x Tesla C2075) and a CPU (2x twelve-core...
-
Monolithic vs. Microservice Architecture: A Performance and Scalability Evaluation
PublicationContext. Since its proclamation in 2012, microservices-based architecture has gained widespread popularity due to its advantages, such as improved availability, fault tolerance, and horizontal scalability, as well as greater software development agility. Motivation. Yet, refactoring a monolith to microservices by smaller businesses and expecting that the migration will bring benefits similar to those reported by top global companies,...
-
Tuning matrix-vector multiplication on GPU
PublicationA matrix times vector multiplication (matvec) is a cornerstone operation in iterative methods of solving large sparse systems of equations such as the conjugate gradients method (cg), the minimal residual method (minres), the generalized residual method (gmres) and exerts an influence on overall performance of those methods. An implementation of matvec is particularly demanding when one executes computations on a GPU (Graphics...
-
GPU based implementation of Temperature-Vegetation Dryness Index for AVHRR3 Satellite Data
PublicationPaper presents an implementation of TVDI (Temperature-Vegetation-Dryness Index) algorithm on GPU (Graphics Processing Unit). Calculation of this index is based on LST (Land Surface Temperature) and NDVI (Normalized Difference Vegetation Index). Discussed results are based on multi-spectral imagery retrieved from AVHRR3 sensors for area of Poland. All phases of TVDI implementation on GPU are modified in respect to CUDA platform....
-
GPU-accelerated finite element method
PublicationIn this paper the results of the acceleration of computations involved in analysing electromagnetic problems by means of the finite element method (FEM), obtained with graphics processors (GPU), are presented. A 4.7-fold acceleration was achieved thanks to the massive parallelization of the most time-consuming steps of FEM, namely finite-element matrix-generation and the solution of a sparse system of linear equations with the...
-
Task Allocation and Scalability Evaluation for Real-Time Multimedia Processing in a Cluster Envirinment
PublicationAn allocation algorithm for stream processing tasks is proposed (Modified best Fit Descendent, MBFD). A comparison with another solution (BFD) is provided. Tests of the algorithms in an HPC environment are descrobed and the results are presented. A proper scalability metric is proposed and used for the evaluation of the allocation algorithm.
-
Dynamic GPU power capping with online performance tracing for energy efficient GPU computing using DEPO tool
PublicationGPU accelerators have become essential to the recent advance in computational power of high- performance computing (HPC) systems. Current HPC systems’ reaching an approximately 20–30 mega-watt power demand has resulted in increasing CO2 emissions, energy costs and necessitate increasingly complex cooling systems. This is a very real challenge. To address this, new mechanisms of software power control could be employed. In this...
-
Acceleration of the DGF-FDTD method on GPU using the CUDA technology
PublicationWe present a parallel implementation of the discrete Green's function formulation of the finite-difference time-domain (DGF-FDTD) method on a graphics processing unit (GPU). The compute unified device architecture (CUDA) parallel computing platform is applied in the developed implementation. For the sake of example, arrays of Yagi-Uda antennas were simulated with the use of DGF-FDTD on GPU. The efficiency of parallel computations...
-
Parallelization of large vector similarity computations in a hybrid CPU+GPU environment
PublicationThe paper presents design, implementation and tuning of a hybrid parallel OpenMP+CUDA code for computation of similarity between pairs of a large number of multidimensional vectors. The problem has a wide range of applications, and consequently its optimization is of high importance, especially on currently widespread hybrid CPU+GPU systems targeted in the paper. The following are presented and tested for computation of all vector...
-
Parallel implementation of the DGF-FDTD method on GPU Using the CUDA technology
PublicationThe discrete Green's function (DGF) formulation of the finite-difference time-domain method (FDTD) is accelerated on a graphics processing unit (GPU) by means of the Compute Unified Device Architecture (CUDA) technology. In the developed implementation of the DGF-FDTD method, a new analytic expression for dyadic DGF derived based on scalar DGF is employed in computations. The DGF-FDTD method on GPU returns solutions that are compatible...
-
Accuracy, Memory and Speed Strategies in GPU-based Finite-Element Matrix-Generation
PublicationThis paper presents strategies on how to optimize GPU-based finite-element matrix-generation that occurs in the finite-element method (FEM) using higher order curvilinear elements. The goal of the optimization is to increase the speed of evaluation and assembly of large finite-element matrices on a single GPU (Graphics Processing Unit) while maintaining the accuracy of numerical integration at the desired level. For this reason,...
-
Parallel Background Subtraction in Video Streams Using OpenCL on GPU Platforms
PublicationImplementation of the background subtraction algorithm using OpenCL platform is presented. The algorithm processes live stream of video frames from the surveillance camera in on-line mode. Processing is performed using a host machine and a parallel computing device. The work focuses on optimizing an OpenCL algorithm implementation for GPU devices by taking into account specific features of the GPU architecture, such as memory access,...
-
Implementation of FDTD-compatible Green's function on heterogeneous CPU-GPU parallel processing system
PublicationThis paper presents an implementation of the FDTD-compatible Green's function on a heterogeneous parallel processing system. The developed implementation simultaneously utilizes computational power of the central processing unit (CPU) and the graphics processing unit (GPU) to the computational tasks best suited to each architecture. Recently, closed-form expression for this discrete Green's function (DGF) was derived, which facilitates...
-
Modelling and simulation of GPU processing in the MERPSYS environment
PublicationIn this work, we evaluate an analytical GPU performance model based on Little's law, that expresses the kernel execution time in terms of latency bound, throughput bound, and achieved occupancy. We then combine it with the results of several research papers, introduce equations for data transfer time estimation, and finally incorporate it into the MERPSYS framework, which is a general-purpose simulator for parallel and distributed...
-
GPU-Accelerated Finite-Element Matrix Generation for Lossless, Lossy, and Tensor Media [EM Programmer's Notebook]
PublicationThis paper presents an optimization approach for limiting memory requirements and enhancing the performance of GPU-accelerated finite-element matrix generation applied in the implementation of the higher-order finite-element method (FEM). It emphasizes the details of the implementation of the matrix-generation algorithm for the simulation of electromagnetic wave propagation in lossless, lossy, and tensor media. Moreover, the impact...
-
Jacobi and gauss-seidel preconditioned complex conjugate gradient method with GPU acceleration for finite element method
PublicationIn this paper two implementations of iterative solvers for solving complex symmetric and sparse systems resulting from finite element method applied to wave equation are discussed. The problem under investigation is a dielectric resonator antenna (DRA) discretized by FEM with vector elements of the second order (LT/QN). The solvers use the preconditioned conjugate gradient (pcg) method implemented on Graphics Processing Unit (GPU)...
-
Efficient parallel implementation of crowd simulation using a hybrid CPU+GPU high performance computing system
PublicationIn the paper we present a modern efficient parallel OpenMP+CUDA implementation of crowd simulation for hybrid CPU+GPU systems and demonstrate its higher performance over CPU-only and GPU-only implementations for several problem sizes including 10 000, 50 000, 100 000, 500 000 and 1 000 000 agents. We show how performance varies for various tile sizes and what CPU–GPU load balancing settings shall be preferred for various domain...
-
A memory efficient and fast sparse matrix vector product on a Gpu
PublicationThis paper proposes a new sparse matrix storage format which allows an efficient implementation of a sparse matrix vector product on a Fermi Graphics Processing Unit (GPU). Unlike previous formats it has both low memory footprint and good throughput. The new format, which we call Sliced ELLR-T has been designed specifically for accelerating the iterative solution of a large sparse and complex-valued system of linear equations arising...
-
GPU Acceleration of Multilevel Solvers for Analysis of Microwave Components With Finite Element Method
PublicationThe letter discusses a fast implementation of the conjugate gradient iterative method with ${rm E}$-field multilevel preconditioner applied to solving real symmetric and sparse systems obtained with vector finite element method. In order to accelerate computations, a graphics processing unit (GPU) was used and significant speed-up (2.61 fold) was achieved comparing to a central processing unit (CPU) based approach. These results...
-
Block Conjugate Gradient Method with Multilevel Preconditioning and GPU Acceleration for FEM Problems in Electromagnetics
PublicationIn this paper a GPU-accelerated block conjugate gradient solver with multilevel preconditioning is presented for solving large system of sparse equations with multiple right hand-sides (RHSs) which arise in the finite-element analysis of electromagnetic problems. We demonstrate that blocking reduces the time to solution significantly and allows for better utilization of the computing power of GPUs, especially when the system matrix...
-
Multi-core and Multiprocessor Implementation of Numerical Integration in Finite Element Method
PublicationThe paper presents techniques for accelerating a numerical integration process which appears in the Finite Element Method. The acceleration is achieved by taking advantages of multi-core and multiprocessor devices. It is shown that using multi-core implementation with OpenMP and a GPU acceleration using CUDA architecture allows one to achieve the speedups by a factor of 5 and 10 on a CPU and GPUs, respectively.
-
Investigation of Parallel Data Processing Using Hybrid High Performance CPU + GPU Systems and CUDA Streams
PublicationThe paper investigates parallel data processing in a hybrid CPU+GPU(s) system using multiple CUDA streams for overlapping communication and computations. This is crucial for efficient processing of data, in particular incoming data stream processing that would naturally be forwarded using multiple CUDA streams to GPUs. Performance is evaluated for various compute time to host-device communication time ratios, numbers of CUDA streams,...
-
A GPU Solver for Sparse Generalized Eigenvalue Problems with Symmetric Complex-Valued Matrices Obtained Using Higher-Order FEM
PublicationThe paper discusses a fast implementation of the stabilized locally optimal block preconditioned conjugate gradient (sLOBPCG) method, using a hierarchical multilevel preconditioner to solve nonHermitian sparse generalized eigenvalue problems with large symmetric complex-valued matrices obtained using the higher-order finite-element method (FEM), applied to the analysis of a microwave resonator. The resonant frequencies of the low-order...
-
GPU-Accelerated 3D Mesh Deformation for Optimization Based on the Finite Element Method
PublicationThis paper discusses a strategy for speeding up the mesh deformation process in the design-byoptimization of high-frequency components involving electromagnetic field simulations using the 3D finite element method (FEM). The mesh deformation is assumed to be described by a linear elasticity model of a rigid body; therefore, each time the shape of the device is changed, an auxiliary elasticity finite-element problem must be solved....
-
A Task-Scheduling Approach for Efficient Sparse Symmetric Matrix-Vector Multiplication on a GPU
PublicationIn this paper, a task-scheduling approach to efficiently calculating sparse symmetric matrix-vector products and designed to run on Graphics Processing Units (GPUs) is presented. The main premise is that, for many sparse symmetric matrices occurring in common applications, it is possible to obtain significant reductions in memory usage and improvements in performance when the matrix is prepared in certain ways prior to computation....
-
Single and Dual-GPU Generalized Sparse Eigenvalue Solvers for Finding a Few Low-Order Resonances of a Microwave Cavity Using the Finite-Element Method
PublicationThis paper presents two fast generalized eigenvalue solvers for sparse symmetric matrices that arise when electromagnetic cavity resonances are investigated using the higher-order finite element method (FEM). To find a few loworder resonances, the locally optimal block preconditioned conjugate gradient (LOBPCG) algorithm with null-space deflation is applied. The computations are expedited by using one or two graphical processing...
-
Auto-tuning methodology for configuration and application parameters of hybrid CPU + GPU parallel systems based on expert knowledge
PublicationAuto-tuning of configuration and application param- eters allows to achieve significant performance gains in many contemporary compute-intensive applications. Feasible search spaces of parameters tend to become too big to allow for exhaustive search in the auto-tuning process. Expert knowledge about the utilized computing systems becomes useful to prune the search space and new methodologies are needed in the face of emerging heterogeneous...
-
Performance Evaluation of Selected Parallel Object Detection and Tracking Algorithms on an Embedded GPU Platform
PublicationPerformance evaluation of selected complex video processing algorithms, implemented on a parallel, embedded GPU platform Tegra X1, is presented. Three algorithms were chosen for evaluation: a GMM-based object detection algorithm, a particle filter tracking algorithm and an optical flow based algorithm devoted to people counting in a crowd flow. The choice of these algorithms was based on their computational complexity and parallel...
-
Tuning a Hybrid GPU-CPU V-Cycle Multilevel Preconditioner for Solving Large Real and Complex Systems of FEM Equations
PublicationThis letter presents techniques for tuning an accelerated preconditioned conjugate gradient solver with a multilevel preconditioner. The solver is optimized for a fast solution of sparse systems of equations arising in computational electromagnetics in a finite element method using higher-order elements. The goal of the tuning is to increase the throughput while at the same time reducing the memory requirements in order to allow...
-
Benchmarking Scalability and Security Configuration Impact for A Distributed Sensors-Server IOT Use Case
PublicationInternet of Things has been getting more and more attention and found numerous practical applications. Especially important in this context are performance, security and ability to cope with failures. Especially crucial is to find good trade-off between these. In this article we present results of practical tests with multiple clients representing sensors sending notifications to an IoT middleware – DeviceHive. We investigate performance...
-
EM-Driven Multi-Objective Optimization of Antenna Structures in Multi-Dimensional Design Spaces
PublicationFeasible multi-objective optimization of antenna structures is presented. An initial set of Pareto optimal solutions is found using a multi-objective evolutionary algorithm (MOEA) working with a fast surrogate antenna model obtained by kriging interpolation of coarse-discretization EM simulation data. To make the surrogate construction computationally feasible in multi-dimensional design space, the space subset containing non-dominated...
-
Multi agent grid systems
PublicationThis chapter presents an idea of merging grid and volunteer systemswith multi agent systems. It gives some basics concerning multi agentsystem and the most followed standard. Some deliberations concerningsuch an existing systems were made in order to finally present possibilities of introducing agents into the Comcute system.
-
Low-Cost Multi-Objective Optimization Yagi-Uda Antenna in Multi-Dimensional Parameter Space
PublicationA surrogate-based technique for fast multi-objective optimization of a multi-parameter planar Yagi-Uda antenna structure is presented. The proposed method utilizes response surface approximation (RSA) models constructed using training samples obtained from evaluation of the low-fidelity antenna model. Utilization of the RSA models allowsfor fast determination of the best possible trade-offs between conflicting objectives in multi-objective...
-
Multi-state multi-reference Møller-Plesset second-order perturbation theory for molecular calculations
PublicationThis work presents multi‐state multi‐reference Møller–Plesset second‐order perturbation theory as a variant of multi‐reference perturbation theory to treat electron correlation in molecules. An effective Hamiltonian is constructed from the first‐order wave operator to treat several strongly interacting electronic states simultaneously. The wave operator is obtained by solving the generalized Bloch equation within the first‐order...
-
MULTI-OBJECTIVE OPTIMIZATION PROBLEM IN THE OptD-MULTI METHOD
PublicationNew measurement technologies, e.g. Light Detection And Ranging (LiDAR), generate very large datasets. In many cases, it is reasonable to reduce the number of measuring points, but in such a way that the datasets after reduction satisfy specific optimization criteria. For this purpose the Optimum Dataset (OptD) method proposed in [1] and [2] can be applied. The OptD method with the use of several optimization criteria is called...
-
GPU Power Capping for Energy-Performance Trade-Offs in Training of Deep Convolutional Neural Networks for Image Recognition
PublicationIn the paper we present performance-energy trade-off investigation of training Deep Convolutional Neural Networks for image recognition. Several representative and widely adopted network models, such as Alexnet, VGG-19, Inception V3, Inception V4, Resnet50 and Resnet152 were tested using systems with Nvidia Quadro RTX 6000 as well as Nvidia V100 GPUs. Using GPU power capping we found other than default configurations minimizing...
-
Robust Object Detection with Multi-input Multi-output Faster R-CNN
PublicationRecent years have seen impressive progress in visual recognition on many benchmarks, however, generalization to the out-of-distribution setting remains a significant challenge. A state-of-the-art method for robust visual recognition is model ensembling. However, recently it was shown that similarly competitive results could be achieved with a much smaller cost, by using multi-input multi-output architecture (MIMO). In this work,...
-
Robust Object Detection with Multi-input Multi-output Faster R-CNN
PublicationRecent years have seen impressive progress in visual recognition on many benchmarks, however, generalization to the out-of-distribution setting remains a significant challenge. A state-of-the-art method for robust visual recognition is model ensembling. However, recently it was shown that similarly competitive results could be achieved with a much smaller cost, by using multi-input multi-output architecture (MIMO). In this work,...
-
Multi Queue Approach for Network Services Implemented for Multi Core CPUs
PublicationMultiple core processors have already became the dominant design for general purpose CPUs. Incarnations of this technology are present in solutions dedicated to such areas like computer graphics, signal processing and also computer networking. Since the key functionality of network core components is fast package servicing, multicore technology, due to multi tasking ability, seems useful to support packet processing. Dedicated...