|
The Distributor facilitates distributed computing on large input files. It serves as a pre- and post-processing tool and interfaces with most standard batch-queuing systems.
In high throughput virtual screening, one usually deals with hundred thousands of compounds. On today's compute farms, an efficient, yet simple parallelisation can be achieved decomposing the input file(s) and letting the software process the much smaller chunks on multiple machines in parallel.
The basic concept behind the Distributor is to take the burden of bookkeeping off of the user and let the software automatically make sure that the job gets done in its entirety. Now the Distributor does not re-invent the wheel. Years of development have been spent on sophisticated job-management and batch-queuing systems assuring optimal use of the compute resources by load balancing. Therefore the Distributor does not replace but rather interfaces with most of those standard technologies and additionally provides automated error reporting (for example, through email) and much more.
In short, given:
|
a large set of input data blocks (items) in one or several files |
|
some arbitrary tool that processes this kind of input |
|
a number of compute nodes |
|
a batch queuing system |
Distributor
|
splits all input files into segments, each with a subset of items |
|
submits small jobs into the batch queuing system |
|
checks the jobs' status |
|
merges the jobs' output |
|
informs the user via email about the current status |
|