OMRunner - The DRMAA job runner

Synopsis

omrunner [OPTIONS] -f jdf

Description

The OMRunner uses SSH and DRMAA to submit co-allocated OpenMPI jobs to remote clusters. DRMAA provides a common interface to autonomous local resource managers in remote clusters. OpenMPI is an open source highly configurable MPI-2 implementation that is developed and maintained by a consortium of academic, research, and industry partners. The OMRunner has the capability of selecting a fast interconnect to use when a job is submitted on multiple clusters on DAS-3. In most cases high speed Myri-10G interconnect is used unless the Delft cluster is selected. With the Delft cluster, the Gigabit/s Ethernet interconnect is used. In addition to the OpenMPI jobs, the OMRunner can be used to submit other non-coallocated jobs to remote multiple clusters. Jobs compiled with other implementations of MPI such as MPICH, cannot be submitted with the OMRunner.

Options

-flex : the job request is flexible
-optComm : if possible, try to optimize communication
-cm : if possible, try to minimize the number of clusters used
-x <clusters> : comma separated list of clusters not to be used
-np <processes>: number of processes to run per node
-l <LEVEL> : set log4j <FATAL| ERROR| WARN| DEBUG> output level

Examples

The following are examples of running jobs with the OMRunner.

Simple co-allocated job execution.

This example executes an MPI application that calculate pi and exits. The application has been compiled with OpenMPI on DAS-3.

[hashim@fs3 JDFs]$ cat cpi-das3.jdf
+(
&( count = "2")
( directory = "/home/hashim/bin" )
(maxWallTime = "15" )
( executable = "cpi-ompi" )
)
(&( count = "2")
( directory = "/home/hashim/bin" )
(maxWallTime = "15" )
( executable = "cpi-ompi" )
)

[hashim@fs3 JDFs]$ omrunner -f cpi-das3.jdf
Ksched - Assigned job ID 78755
Ksched - Job 78755 Assigned LOW_PRIORITY
Ksched - Reservation for component 1 succeed
Ksched - Placed component 2 on fs3.das3.tudelft.nl
Ksched - Placed component 1 on fs0.das3.cs.vu.nl
Ksched - Reservation for component 2 succeed
Runner - Submitting for execution component 1 to fs0.das3.cs.vu.nl
Ksched - Claiming for processors for job 78755 begins
Runner - Submitting for execution component 2 to fs3.das3.tudelft.nl
DRMAA - Component2@ fs3.das3.tudelft.nl: QUEUED
DRMAA - Component1@ fs0.das3.cs.vu.nl: QUEUED
DRMAA - Component2@ fs3.das3.tudelft.nl: ACTIVE
DRMAA - Component1@ fs0.das3.cs.vu.nl: ACTIVE
Process 0 of 4 on node319
Process 3 of 4 on node076
Process 2 of 4 on node077
Process 1 of 4 on node332
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.022639
Runner - Job 78755 has completed successfully

Compare the output of the OMRunner and that of the KRunner to spot the differences.

Another co-allocated job example

In this example we execute the Poisson application that implements a parallel iterative algorithm to find a discrete approximation to the solution of the two-dimensional Poisson equation on the unit square. The job request has four non-fixed components, which in total are requesting 64 nodes. However, we use the -np 2 switch to run this job on 128 cores.

[hashim@fs3 JDFs]$ cat pois-ompi.jdf
+
( &(count = "16")
( directory = "/home/hashim/bin")
( maxWallTime = "15" )
( executable = "/home/hashim/bin/Pois-ompi" )
( arguments = "16" "8" )
)
( &(count = "16")
( directory = "/home/hashim/bin")
( maxWallTime = "15" )
( executable = "/home/hashim/bin/Pois-ompi" )
( arguments = "16" "8" )
)
( &(count = "16")
( directory = "/home/hashim/bin")
( maxWallTime = "15" )
( executable = "/home/hashim/bin/Pois-ompi" )
( arguments = "16" "8" )
)
( &(count = "16")
( directory = "/home/hashim/bin")
( maxWallTime = "15" )
( executable = "/home/hashim/bin/Pois-ompi" )
( arguments = "16" "8" )
)

[hashim@fs3 JDFs]$ omrunner -np 2 -f pois-ompi.jdf
Ksched - Assigned job ID 78760
Ksched - Job 78760 Assigned LOW_PRIORITY
Ksched - Reservation for component 1 succeed
Ksched - Reservation for component 2 succeed
Ksched - Reservation for component 3 succeed
Ksched - Reservation for component 4 succeed
Ksched - Claiming for processors for job 78760 begins
Ksched - Placed component 4 on fs0.das3.cs.vu.nl
Ksched - Placed component 2 on fs3.das3.tudelft.nl
Ksched - Placed component 1 on fs3.das3.tudelft.nl
Ksched - Placed component 3 on fs2.das3.science.uva.nl
Runner - Submitting for execution component 1 to fs3.das3.tudelft.nl
Runner - Submitting for execution component 3 to fs2.das3.science.uva.nl
Runner - Submitting for execution component 4 to fs0.das3.cs.vu.nl
Runner - Submitting for execution component 2 to fs3.das3.tudelft.nl
DRMAA - Component1@ fs3.das3.tudelft.nl: QUEUED
DRMAA - Component2@ fs3.das3.tudelft.nl: QUEUED
DRMAA - Component4@ fs0.das3.cs.vu.nl: QUEUED
DRMAA - Component3@ fs2.das3.science.uva.nl: QUEUED
DRMAA - Component1@ fs3.das3.tudelft.nl: ACTIVE
DRMAA - Component2@ fs3.das3.tudelft.nl: ACTIVE
DRMAA - Component4@ fs0.das3.cs.vu.nl: ACTIVE
DRMAA - Component3@ fs2.das3.science.uva.nl: ACTIVE
Iter.= 315 Proc. 0/128 : Elapsed total Wtime: 9.37 ( 99.7% CPU)
Runner - Job 78760 has completed successfully

KOALA News

  • January 2013: MR-Runner upgraded! Now the MR-Runner deploys Hadoop-1.0.0 clusters, compatible with Pig-0.10.0. 

  • December 2012KOALA 2.1 released! Deploy MapReduce clusters on DAS-4 with the Koala MR-Runner

  • November 2012:  Best Paper Award at MTAGS12 workshop (co-located with SC12) with work on MapReduce!

  • November 2009KOALA 2.0 released! You can now run Parameter sweep applications (PSAs) with KOALA CSRunner

  • April 2008: New KOALA runner! The OMRunner enables DRMAA and OpenMPI job submissions. 

  • July 2007: Paper accepted at Grid07 conference with work on scheduling malleable jobs in KOALA.

  • May 2007: KOALA has now been ported successfully to DAS-3. All the KOALA runners are operational apart from the DRunner.

  • April 2007: The KOALA IRunner has been updated to include recommendations made by the Ibis group