Dvbmonkey’s Blog

March 3, 2009

Open MPI Master & Servant Example – BogoMips

Filed under: linux — dvbmonkey @ 12:44 pm
Tags: , ,

In yesterdays post I introduced a simple ‘master & servant’ technique where I used the rank-0 node to collate results from all the other nodes. To do this I used the methods MPI_Send and MPI_Recv to send/recv 128-byte MPI_CHAR strings. Today I am extending the example by sending/receiving MPI_FLOAT‘s to demonstrate that native C/C++ numerical values can be passed between nodes easily.

What are BogoMips?

From the Wikipedia article, BogoMips are an “unscientific measurement of CPU speed made by the Linux kernel when it boots, to calibrate an internal busy-loop”. If you’ve used Linux for some time you may have noticed the “BogoMips” value seen during the boot-up console messages. Alternatively, you can cat /proc/cpuinfo to see the values your Linux Kernel has calculated during boot.

Example: bogomips.c

Based on the Linux kernel code in init/main.c and include/linux/delay.h and the example ‘Standalone BogoMips’ code by Jeff Tranter. Here is a really simple ‘MPI BogoMips’ calculation, where each node takes an average of 10 BogoMips calculations for itself and MPI_Send‘s the result to the rank-0 node which sum’s them up and prints the total.

 * Based on code Linux kernel code in init/main.c and include/linux/delay.h
 * and the example code by Jeff Tranter (Jeff_Tranter@Mitel.COM)



    /* the original code from the Linux kernel */
    int HZ = 100;

    #define rdtscl(low) \
         __asm__ __volatile__ ("rdtsc" : "=a" (low) : : "edx")

    //This delay() is the one used on x86's with TSC after 2.2.14.
    //It won't work on a non TSC x86, period.
    void __inline__ delay(unsigned long loops)
        unsigned long bclock, now;
        do {
        while ((now - bclock) < loops);

    /* portable version */
    static void delay(int loops)
        long i;
        for (i = loops; i >= 0; i--);

/* this should be approx 2 Bo*oMips to start (note initial shift), and will
 *    still work even if initially too large, it will just take slightly longer */
unsigned long loops_per_jiffy = (1 << 12);

/* This is the number of bits of precision for the loops_per_jiffy.  Each
 *    bit takes on average 1.5/HZ seconds.  This (like the original) is a little
 *       better than 1% */
#define LPS_PREC 8

int numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];

//plagiarized straight from the 2.4 sources.
float calibrate_delay(void)
    unsigned long ticks, loopbit;
    int lps_precision = LPS_PREC;
    loops_per_jiffy = (1 << 12);
    while (loops_per_jiffy <<= 1) {
	ticks = clock();
	while (ticks == clock())
	    /* nothing */ ;
	ticks = clock();
	ticks = clock() - ticks;
	if (ticks)
    loops_per_jiffy >>= 1;
    loopbit = loops_per_jiffy;
    while (lps_precision-- && (loopbit >>= 1)) {
	loops_per_jiffy |= loopbit;
	ticks = clock();
	while (ticks == clock());
	ticks = clock();
	if (clock() != ticks)
	    loops_per_jiffy &= ~loopbit;
    return (loops_per_jiffy / (500000/HZ)) + (float)((loops_per_jiffy/(5000/HZ))%100) / (float)100;

int main(int argc, char *argv[])
    unsigned long loops_per_sec = 1;
    unsigned long ticks;
    int i;

    MPI_Status stat;
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Get_processor_name(processor_name, &namelen);

    float bogomips = 0;
    for ( i = 0; i < 9; i++ ) {
        bogomips += calibrate_delay();
    bogomips = bogomips / (float) 10;

    printf( "[%02d/%02d %s] returned = %f BogoMips\n", rank, numprocs, processor_name, bogomips );

    if ( rank == 0 ) {
        float totalBogomips = bogomips;
        for ( i = 1; i < numprocs; i++ ) {
            float f = 0;
            MPI_Recv(&f, 1, MPI_FLOAT, i, 0, MPI_COMM_WORLD, &stat);
            totalBogomips += f;
        printf( "Total = %f BogoMips\n", totalBogomips );
    } else {
        MPI_Send(&bogomips, 1, MPI_FLOAT, 0, 0, MPI_COMM_WORLD);

    return 0;

The result of running this should look something like this:

[00/18 mpinode01] 2293.593018 BogoMips
[01/18 mpinode01] 2147.446045 BogoMips
[02/18 mpinode01] 2230.513916 BogoMips
[03/18 mpinode01] 2473.651855 BogoMips
[04/18 mpinode02] 3659.688721 BogoMips
[05/18 mpinode02] 4057.167236 BogoMips
[06/18 mpinode02] 4067.651123 BogoMips
[07/18 mpinode02] 4419.580078 BogoMips
[08/18 mpinode03] 2368.138916 BogoMips
[09/18 mpinode03] 3327.585938 BogoMips
[10/18 mpinode03] 3277.451904 BogoMips
[11/18 mpinode03] 3130.323975 BogoMips
[12/18 mpinode04] 2940.759766 BogoMips
[13/18 mpinode04] 3207.983154 BogoMips
[14/18 mpinode04] 4362.892090 BogoMips
[15/18 mpinode04] 3313.822998 BogoMips
[16/18 mpinode05] 2390.749023 BogoMips
[17/18 mpinode05] 3017.437012 BogoMips
Total = 56686.441406 BogoMips


March 2, 2009

An Open MPI Master & Servant Example

Filed under: linux — dvbmonkey @ 2:26 pm
Tags: , ,

Building on the Getting started… post from last week I’ve knocked up a quick example showing one way to get your MPI processes to communicate with one another.


#include <stdio.h>
#include <mpi.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
   int numprocs, rank, namelen;
   char processor_name[MPI_MAX_PROCESSOR_NAME];

   MPI_Init(&argc, &argv);
   MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   MPI_Get_processor_name(processor_name, &namelen);

   if ( rank == 0 ) {
      printf( "[%02d/%02d %s]: I am the master\n", rank, numprocs, processor_name );
      // Tell the servants to do something
   } else {
      printf( "[%02d/%02d %s]: I am a servant\n", rank, numprocs, processor_name );
      // Wait for something to do


Build this with mpicc master_servant.c -o master_servant and run it, you should get something like:

[00/08 mpinode01]: I am the master
[01/08 mpinode01]: I am a servant
[02/08 mpinode01]: I am a servant
[03/08 mpinode01]: I am a servant
[04/08 mpinode02]: I am a servant
[05/08 mpinode02]: I am a servant
[06/08 mpinode02]: I am a servant
[07/08 mpinode02]: I am a servant

Ok, this means that based on the rank returned by MPI_Comm_rank we can decide which instance of the program is going to act as the “master” and which instance(s) are going to act as “servants” – pretty neat!

Next example, we build on this by getting the program instances to communicate with one another. Borrowed from the example found here.


#include <stdio.h>
#include <string.h>
#include <mpi.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
   char idstr[32], buff[128];
   int numprocs, rank, namelen, i;
   char processor_name[MPI_MAX_PROCESSOR_NAME];

   MPI_Status stat;
   MPI_Init(&argc, &argv);
   MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   MPI_Get_processor_name(processor_name, &namelen);

   // Based on example from https://wiki.inf.ed.ac.uk/pub/ANC/ComputationalResources/slides.pdf
   if (rank == 0) {
      // This is the rank-0 copy of the process
      printf("We have %d processors\n", numprocs);
      // Send each process a "Hello ... " string
      for(i = 1; i < numprocs; i++) {
         sprintf(buff, "Hello %d... ", i);
         MPI_Send(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD);
      // Go into a blocking-receive for each servant process
      for(i = 1; i < numprocs; i++) {
         MPI_Recv(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD, &stat);
         printf("%s\n", buff);
   } else {
      // Go into a blocking-receive waiting
      MPI_Recv(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &stat);
      // Append our identity onto the received string
      sprintf(idstr, "Processor %d ", rank);
      strcat(buff, idstr);
      strcat(buff, "reporting!");
      // Send the string back to the rank-0 process
      MPI_Send(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD);


Build this example with mpicc master_servant2.c -o master_servant2 and run it, you should get the following output:

We have 8 processors
Hello 1... Processor 1 reporting!
Hello 2... Processor 2 reporting!
Hello 3... Processor 3 reporting!
Hello 4... Processor 4 reporting!
Hello 5... Processor 5 reporting!
Hello 6... Processor 6 reporting!
Hello 7... Processor 7 reporting!
Hello 8... Processor 8 reporting!

Now you can use this master/servant technique to partition work across instances of your MPI-capable program.


If you get an error like this “[hostname][0,1,0][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113” try shutting down iptables on the MPI nodes. It was a quick-fix for me, i am sure there is a ‘proper’ way to configure it though. Keep in mind its probably not a good idea to switch off iptables on a machine if its connected to the open internet, the machines I have used in this guide are all on an internal network.

February 27, 2009

Getting started with Open MPI on Fedora

Filed under: linux — dvbmonkey @ 4:36 pm
Tags: , , , ,

Recently rediscovered the world of parallel computing after wondering what to do with a bunch of mostly idle Linux boxes, all running various versions of Fedora Core Linux. I found this guide particularly useful and decided to elaborate on the subject here.


Open MPI is an open-source implementation of the Message Passing Interface which allows programmers to write software that runs on several machines simultaneously. Furthermore it allows these copies of the program to communicate/cooperate with each other to say… share the load of an intensive calculation amongst each other or, daisy-chain the results from one ‘node’ to another. This is not new, its been around for decades and today it is one of the main techniques used in Supercomputing platforms.

The basic principle is you need two things, firstly the MPI development suite in order to build your MPI-capable applications (e.g. Open MPI) and secondly a client/server queue manager to distribute the programs to remote computers and return the results (e.g. TORQUE). Both these components are distributed by the Fedora Project and are readily available.

Setting up the TORQUE server

Firstly, you will need to doctor the /etc/hosts file, placing your preferred hostname infront of “localhost” on the “” line, example: mpimaster localhost.localdomain localhost

Now, you will need to install the following packages, using something like YUM, the package torque-client will require some GUI related libraries (freetype, libX*, tcl, tk etc.) even if you’re not using X on the torque-server.

$ sudo yum install torque torque-client torque-server torque-mom libtorque

Next you will need to do some setup stuff, if you get a warning that pbs_server is already running do a /etc/init.d/pbs_server stop:

$ sudo /usr/sbin/pbs_server -t create
$ sudo /usr/share/doc/torque-2.1.10/torque.setup root

Now, create the following file and put the hostname of this server.

$pbsserver mpimaster

Create another file, this will contain a list of all the nodes/clients we’re going to be using. The parameter “np=4” describes the number of processors (or cores) available on this node, in both cases below the client will be a QuadCore processor so I have set “np=4”. If you need to add more nodes to your MPI cluster at a later time, this is where you configure them.

mpinode01 np=4
mpinode02 np=4

We create another config file, this time just containing the hostname of the server machine.


Now we update IPTables to allow incoming connections to the server, an example of my own configuration with the additional two lines in bold opening up tcp/udp ports 15000 to 15004. Once done run $ sudo /etc/init.d/iptables restart to pickup the new settings.

# Firewall configuration written by system-config-firewall
# Manual customization of this file is not recommended.
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 15000:15004 -j ACCEPT
-A INPUT -p udp -m udp --dport 15000:15004 -j ACCEPT

-A INPUT -j REJECT --reject-with icmp-host-prohibited
-A FORWARD -j REJECT --reject-with icmp-host-prohibited

IMPORTANT: Commands are sent to the client nodes over RSH/SSH, in order to make this all work its assumed you’ve setup key-based SSH from the server to each of the client nodes.

All done, a quick restart of the torque server and we’re onto setting up our client nodes.

$ sudo /etc/init.d/pbs_server restart
$ sudo /etc/init.d/pbs_mom restart

Setting up the TORQUE client nodes

Going for speed/efficiency, I devised a one-line shell command to install and configure each of the clients if you are logged on as root:

# yum -y install torque-client torque-mom && echo -e "\tmpimaster" >> /etc/hosts && echo "mpimaster" >> /var/torque/server_name && echo "\$pbsserver mpimaster" >> /var/torque/mom_priv/config && /etc/init.d/pbs_mom start

But basically it breaks down into the following:

* Install the client software
$ sudo yum install openmpi torque-client torque-mom

* Add the server’s hostname and address to the /etc/hosts file
# echo -e "\tmpimaster" >> /etc/hosts

* Set the server’s hostname in the config file(s)
# echo "mpimaster" >> /var/torque/server_name
# echo "\$pbsserver mpimaster" >> /var/torque/mom_priv/config

* Start the service
/etc/init.d/pbs_mom start

Testing it out

From the the ‘mpimaster’ machine, you should be able to issue the command pbsnodes -a and see the client machines connected e.g.

$ pbsnodes -a
state = free
np = 4
ntype = cluster
status = opsys=linux,uname=Linux pepe #1 SMP Wed Jan 21 01:54:56 EST 2009 i686,sessions=? 0,nsessions=? 0,nusers=0,idletime=861421,
totmem=5359032kb,availmem=5277996kb,physmem=4146624kb,ncpus=4,loadave=0.00,netload=104310870,state=free,jobs=? 0,rectime=1235751237

state = free
np = 4
ntype = cluster
status = opsys=linux,uname=Linux taz #1 SMP Wed Jan 21 01:54:56 EST 2008 i686,sessions=? 0,nsessions=? 0,nusers=0,idletime=366959,
totmem=5359048kb,availmem=5277268kb,physmem=4146640kb,ncpus=4,loadave=0.00,netload=46008061,state=free,jobs=? 0,rectime=1235751223

If you see this, congratulations! you are ready to rock! If your client nodes are not connected, check the configuration, network connectivity and lastly, check the ‘pbs_mom’ service is running on each client, optionally try restarting the ‘pbs_mom’ service.

MPI Development

You’ll need to install a couple of additional packages on your development machine,

$ sudo yum install openmpi openmpi-devel openmpi-libs

Now lets start with the inevitable ‘Hello World!’ example,


#include <stdio.h>
#include <mpi.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
   int numprocs, rank, namelen;
   char processor_name[MPI_MAX_PROCESSOR_NAME];
   MPI_Init(&argc, &argv);
   MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   MPI_Get_processor_name(processor_name, &namelen);
   printf("Hello World! from process %d out of %d on %s\n", rank, numprocs, processor_name);

Normally we’d just use gcc to build this, but for convenience MPI provides a mpicc which handles the include and library paths for you.

$ mpicc hello.c -o hello

In order to tell Open MPI / Torque where to run your application we must provide it with a “hostfile”, similar to the file /var/torque/server_priv/nodes we made earlier:


mpinode01 slots=4
mpinode02 slots=4

Now, we’re ready to run it for the first time. Note, in this example I did my development work on the machine acting as the ‘mpiserver’ – if you try submitting an MPI job from another machine you might need slightly different configuration.

$ mpirun --hostfile myhostfile hello
Hello World! from process 0 out of 8 on mpinode01
Hello World! from process 1 out of 8 on mpinode01
Hello World! from process 2 out of 8 on mpinode01
Hello World! from process 3 out of 8 on mpinode01
Hello World! from process 4 out of 8 on mpinode02
Hello World! from process 5 out of 8 on mpinode02
Hello World! from process 6 out of 8 on mpinode02
Hello World! from process 7 out of 8 on mpinode02

Voilà, you have just submitted an MPI task and had it execute on a number of your processors.

MPI makes distributing & communication between copies of your programs easy, however its up to you to use this potential to provide a real speed up in a real-work application. A really simple example is a program that operates on a set of 8 large files. Normally, while running on a single processor you would process these files sequentially. Using MPI you could load 8 copies of your program on 8 processing nodes, and have each node process a different file. Effectively giving you a 8-times speed up compared to running it on a single processor.

I’ve loosely tested the approach described here on different systems running Fedora Core Linux versions 8, 9 & 10. Any questions / comments welcomed!

Firstly, try out the Open MPI FAQ’s, personally I encountered the following problems:

  • mpirun appears to ‘hang’: caused by iptables, I just shut down iptables to resolve the issue.
  • Fedora Core 7: the package sets the wrong library path in /etc/ld.so.conf
  • Fedora Core 7: the package included with the distribution ‘doesnt work’, library issues

Updated: 2nd March 2009
Ooppss! as Jeff Squyres pointed out in his comment below, the way I configured things in the original post meant that “mpirun” just spawned 8 processes on my localhost – not the remote nodes. I’ve reworked the configuration to account for this. Many thanks Jeff!

Blog at WordPress.com.