TensorHive: Management of Exclusive GPU Access for Distributed Machine Learning Workloads

Paweł Rościszewski; Michał Martyniak; Filip Schodowski

TensorHive: Management of Exclusive GPU Access for Distributed Machine Learning Workloads

Abstract

TensorHive is a tool for organizing work of research and engineering teams that use servers with GPUs for machine learning workloads. In a comprehensive web interface, it supports reservation of GPUs for exclusive usage, hardware monitoring, as well as configuring, executing and queuing distributed computational jobs. Focusing on easy installation and simple configuration, the tool automatically detects the available computing resources and monitors their utilization. Reservations granted on the basis of flexible access control settings are protected by pluggable violation hooks. The job execution module includes auto-configuration templates for distributed neural network training jobs in frameworks such as TensorFlow and PyTorch. Documentation, source code, usage examples and issue tracking are available at the project page: https://github.com/roscisz/TensorHive/

Authors (3)

Cite as

Full text

download paper

downloaded 93 times

Publication version: Accepted or Published Version
License: open in new tab

full content of the article see on external site open in new tab

Keywords

Details

Category:: Articles
Type:: artykuły w czasopismach
Published in:: JOURNAL OF MACHINE LEARNING RESEARCH no. 22, pages 1 - 5,
ISSN: 1532-4435
Language:: English
Publication year:: 2021
Bibliographic description:: Rościszewski P., Martyniak M., Schodowski F.: TensorHive: Management of Exclusive GPU Access for Distributed Machine Learning Workloads// JOURNAL OF MACHINE LEARNING RESEARCH -,iss. 22 (2021), s.1-5
Verified by:: Gdańsk University of Technology

seen 146 times

Recommended for you

Research Platform for Monitoring, Control and Security of Critical Infrastructure Systems

2013

Implementation of integrated control In drinking water distribution systems - IT system proposal

2008

Secure access control and information protection mechanisms in radio system for monitoring and acquisition of data from traffic enforcement cameras

2010

Advanced Control With PLC—Code Generator for aMPC Controller Implementation and Cooperation With External Computational Server for Dealing With Multidimensionality, Constraints and LMI Based Robustness

2022

Meta Tags

Search