krunner [OPTIONS] -f jdf
The KRunner is the default Globus runner of KOALA. It implements the most basic way of running a job on a grid. It can be used for almost any kind of job, but it does not implement specific requirements certain job types may have.
-l <LEVEL> : set log4j <FATAL|ERROR|WARN|DEBUG> output level
-g : stage executable to the execution site
-flex : the job request is flexible
-optComm : if possible, try to optimize communication
-cm : if possible, try to minimize the number of clusters used
-x <clusters>: comma separated list of clusters not to be used
The following are examples of running jobs with the KRunner.
The first example is a very simple job which just executes "uname -n" and exits. This can be done with the rsl given below. In this example the rsl is stored in the file 'uname-1.jdf'. The most simple way of starting a job is shown.
& ( directory = "/bin" ) ( arguments = "-n" ) ( executable = "uname" ) ( maxWallTime = "15" ) ( count = "5" ) [hashim@fs3 JDFs]$ krunner -f uname-1.jdf Ksched - Assigned job ID 78624 Ksched - Job 78624 Assigned LOW_PRIORITY Ksched - Reservation for component 1 succeed Ksched - Placed component 1 on fs3.das3.tudelft.nl Ksched - Claiming for processors for job 78624 begins Runner - Submitting for execution component 1 to fs3.das3.tudelft.nl GRAM - Component1 @ fs3.das3.tudelft.nl: PENDING node358 node301 node362 node342 node310 GRAM - Component1 @ fs3.das3.tudelft.nl: DONE Runner - Job 78624 has completed successfully
The KRunner sends a new job request to the Ksched, which is the KOALA scheduler. If the rsl is correct, the Ksched responds with a KOALA job id and the assigned priority level of the job. After the job has been placed successfully, the Ksched informs the runner the execution site, in this case fs3.das3.tudelft.nl, selected for the component. At the predetermined job claiming time, the Ksched instructs the runner to start claiming processors for the job components. The runner then submits the job component to the selected execution site for execution. node358, node301, node362, node342, and node310 are the messages from stdout redirected from the nodes where the command uname -n has been running. The status messages are the transition messages coming from the local resource manager informing us about the progress of the job. A successful job component goes through the following stages:
In this example we run an MPICH application that calculates pi. The job request, shown below, is semi-fixed and consists of two components. In this example, we want the standard output of the run to be appended to the file out.dat. Note in the rsl we have added the "jobtype" attribute. This is required with the Globus GRAM for MPI jobs.
[hashim@fs3 JDFs]$ cat cpi-mpich.jdf + ( &( count = "2") ( directory = "/home/hashim/bin" ) (maxWallTime = "15" ) (jobtype = "mpi" ) (stdout = "out.dat") ( executable = "/home/hashim/bin/cpi.mpich" ) ( resourcemanagercontact = "fs2.das3.science.uva.nl" ) ) ( &( count = "2") ( directory = "/home/hashim/bin" ) (maxWallTime = "15" ) (jobtype = "mpi" ) (stdout = "out.dat") ( executable = "/home/hashim/bin/cpi.mpich" ) ) [hashim@fs3 JDFs]$ krunner -f cpi-mpich.jdf Ksched - Assigned job ID 78647 Ksched - Job 78647 Assigned LOW_PRIORITY Ksched - Reservation for component 1 succeed Ksched - Placed component 2 on fs0.das3.cs.vu.nl Ksched - Reservation for component 2 succeed Ksched - Placed component 1 on fs2.das3.science.uva.nl Ksched - Claiming for processors for job 78647 begins Runner - Submitting for execution component 2 to fs0.das3.cs.vu.nl Runner - Submitting for execution component 1 to fs2.das3.science.uva.nl GRAM - Component1 @ fs2.das3.science.uva.nl: STAGE_IN GRAM - Component2 @ fs0.das3.cs.vu.nl: STAGE_IN GRAM - Component2 @ fs0.das3.cs.vu.nl: PENDING GRAM - Component1 @ fs2.das3.science.uva.nl: PENDING GRAM - Component2 @ fs0.das3.cs.vu.nl: DONE GRAM - Component1 @ fs2.das3.science.uva.nl: DONE Runner - Job 78647 has completed successfully [hashim@fs3 JDFs]$ more out.dat Process 1 of 2 on node011.beowulf.cluster Process 0 of 2 on node004.beowulf.cluster pi is approximately 3.1415926544231318, Error is 0.0000000008333387 wall clock time = 0.000000 Process 1 of 2 on node218.beowulf.cluster Process 0 of 2 on node230.beowulf.cluster pi is approximately 3.1415926544231318, Error is 0.0000000008333387 wall clock time = 0.000000
The components are sent to fs2.das3.science.uva.nl, which was fixed, and fs0.das3.cs.vu.nl. Since the KRunner does not support co-allocation, the two components are executed independently and hence, each produce its own output.