omrunner [OPTIONS] -f jdf
The OMRunner uses SSH and DRMAA to submit co-allocated OpenMPI jobs to remote clusters. DRMAA provides a common interface to autonomous local resource managers in remote clusters. OpenMPI is an open source highly configurable MPI-2 implementation that is developed and maintained by a consortium of academic, research, and industry partners. The OMRunner has the capability of selecting a fast interconnect to use when a job is submitted on multiple clusters on DAS-3. In most cases high speed Myri-10G interconnect is used unless the Delft cluster is selected. With the Delft cluster, the Gigabit/s Ethernet interconnect is used. In addition to the OpenMPI jobs, the OMRunner can be used to submit other non-coallocated jobs to remote multiple clusters. Jobs compiled with other implementations of MPI such as MPICH, cannot be submitted with the OMRunner.
-flex : the job request is flexible
-optComm : if possible, try to optimize communication
-cm : if possible, try to minimize the number of clusters used
-x <clusters> : comma separated list of clusters not to be used
-np <processes>: number of processes to run per node
-l <LEVEL> : set log4j <FATAL| ERROR| WARN| DEBUG> output level
The following are examples of running jobs with the OMRunner.
This example executes an MPI application that calculate pi and exits. The application has been compiled with OpenMPI on DAS-3.
[hashim@fs3 JDFs]$ cat cpi-das3.jdf
+(
&( count = "2")
( directory = "/home/hashim/bin" )
(maxWallTime = "15" )
( executable = "cpi-ompi" )
)
(&( count = "2")
( directory = "/home/hashim/bin" )
(maxWallTime = "15" )
( executable = "cpi-ompi" )
)
[hashim@fs3 JDFs]$ omrunner -f cpi-das3.jdf
Ksched - Assigned job ID 78755
Ksched - Job 78755 Assigned LOW_PRIORITY
Ksched - Reservation for component 1 succeed
Ksched - Placed component 2 on fs3.das3.tudelft.nl
Ksched - Placed component 1 on fs0.das3.cs.vu.nl
Ksched - Reservation for component 2 succeed
Runner - Submitting for execution component 1 to fs0.das3.cs.vu.nl
Ksched - Claiming for processors for job 78755 begins
Runner - Submitting for execution component 2 to fs3.das3.tudelft.nl
DRMAA - Component2@ fs3.das3.tudelft.nl: QUEUED
DRMAA - Component1@ fs0.das3.cs.vu.nl: QUEUED
DRMAA - Component2@ fs3.das3.tudelft.nl: ACTIVE
DRMAA - Component1@ fs0.das3.cs.vu.nl: ACTIVE
Process 0 of 4 on node319
Process 3 of 4 on node076
Process 2 of 4 on node077
Process 1 of 4 on node332
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.022639
Runner - Job 78755 has completed successfully
Compare the output of the OMRunner and that of the KRunner to spot the differences.
In this example we execute the Poisson application that implements a parallel iterative algorithm to find a discrete approximation to the solution of the two-dimensional Poisson equation on the unit square. The job request has four non-fixed components, which in total are requesting 64 nodes. However, we use the -np 2 switch to run this job on 128 cores.
[hashim@fs3 JDFs]$ cat pois-ompi.jdf
+
( &(count = "16")
( directory = "/home/hashim/bin")
( maxWallTime = "15" )
( executable = "/home/hashim/bin/Pois-ompi" )
( arguments = "16" "8" )
)
( &(count = "16")
( directory = "/home/hashim/bin")
( maxWallTime = "15" )
( executable = "/home/hashim/bin/Pois-ompi" )
( arguments = "16" "8" )
)
( &(count = "16")
( directory = "/home/hashim/bin")
( maxWallTime = "15" )
( executable = "/home/hashim/bin/Pois-ompi" )
( arguments = "16" "8" )
)
( &(count = "16")
( directory = "/home/hashim/bin")
( maxWallTime = "15" )
( executable = "/home/hashim/bin/Pois-ompi" )
( arguments = "16" "8" )
)
[hashim@fs3 JDFs]$ omrunner -np 2 -f pois-ompi.jdf
Ksched - Assigned job ID 78760
Ksched - Job 78760 Assigned LOW_PRIORITY
Ksched - Reservation for component 1 succeed
Ksched - Reservation for component 2 succeed
Ksched - Reservation for component 3 succeed
Ksched - Reservation for component 4 succeed
Ksched - Claiming for processors for job 78760 begins
Ksched - Placed component 4 on fs0.das3.cs.vu.nl
Ksched - Placed component 2 on fs3.das3.tudelft.nl
Ksched - Placed component 1 on fs3.das3.tudelft.nl
Ksched - Placed component 3 on fs2.das3.science.uva.nl
Runner - Submitting for execution component 1 to fs3.das3.tudelft.nl
Runner - Submitting for execution component 3 to fs2.das3.science.uva.nl
Runner - Submitting for execution component 4 to fs0.das3.cs.vu.nl
Runner - Submitting for execution component 2 to fs3.das3.tudelft.nl
DRMAA - Component1@ fs3.das3.tudelft.nl: QUEUED
DRMAA - Component2@ fs3.das3.tudelft.nl: QUEUED
DRMAA - Component4@ fs0.das3.cs.vu.nl: QUEUED
DRMAA - Component3@ fs2.das3.science.uva.nl: QUEUED
DRMAA - Component1@ fs3.das3.tudelft.nl: ACTIVE
DRMAA - Component2@ fs3.das3.tudelft.nl: ACTIVE
DRMAA - Component4@ fs0.das3.cs.vu.nl: ACTIVE
DRMAA - Component3@ fs2.das3.science.uva.nl: ACTIVE
Iter.= 315 Proc. 0/128 : Elapsed total Wtime: 9.37 ( 99.7% CPU)
Runner - Job 78760 has completed successfully