Journal Articles

Distributed Computing using MOE

Jocelyn Demers and Paul Labute
Chemical Computing Group Inc.
 
Introduction | Distributed Computing | The Hypercube | MOE/smp | Summary

 

Introduction

Distributed computing requires coordinating the efforts of a collection of processing nodes linked together by a network. It is often the only way to reduce to a tolerable level the time required to perform a lengthy computation.

MOE and MOE/batch are the interactive and batch versions, respectively, of Chemical Computing Group's Molecular Operating Environment software for molecular modeling and drug discovery. MOE/smp is an acronym for MOE / Scalable Multi-Processor. MOE/smp is an enhancement to MOE and MOE/batch allowing multiple MOE processes to cooperate on large-scale distributed computing jobs.

The principal obstacles to distributed computing are hardware availability, node interconnection and message routing. In this article, we will discuss these obstacles and present how they have been overcome in the design of MOE/smp. We also describe new functions added to MOE's integrated high-level programming language, SVL (Scientific Vector Language), to support parallel programming and give an example of their usage.

Distributed Computing

Central to the notion of distributed computing is the simple fact that more than one processor is available. Currently, there are multi-processor servers that deliver excellent performance but high cost limits their numbers. In contrast, low-cost Intel computers inspired the concept of "compute farms" where a network of PCs can be seen as a super computer. However, this last approach raises issues of operating system compatibility and software portability that are not trivial. Fortunately, MOE's ability to run on numerous platforms resolves portability problems and permits MOE/smp to support heterogeneous clusters of computers running different operating systems.

The missing piece of the puzzle is a standard way to launch a process on a remote machine running an unspecified operating system. MOE/smp relies on the remote execution ("rexec") protocol to fulfill this need. Since the rexec daemon is standard on Unix operating systems, but not on the Microsoft Windows family, a version of rexec for Windows has been developed and is included with MOE/smp. No other special daemons are required to run on each processor, a shared file system is not assumed, and the existing LAN is used for inter-processor communication with no changes required in its physical topology. Thus, setting up a distributed computing environment based on MOE/smp requires little if any intervention from specialized IT personnel.

In order for the processing nodes to cooperate, they must be interconnected to allow intercommunication. (Note that the term "interconnection" as used here does not refer to how nodes are physically linked, but rather to virtual links used for communications. The topology of this virtual communications network is mapped to whatever the physical network topology happens to be.) Too many connections per processor may exceed operating system limits but too few will oblige processors to divert excessive resources for routing messages. The longest path in the network between any two processors determines the maximum delay to transmit a message from one processor to another. This is known as latency. Too few connections per processor increases latency. If connections are not evenly distributed or the routing algorithm favors particular paths through the cluster, traffic bottlenecks can develop. Figure 1 illustrates common interconnection strategies.


Figure 1: A Comparison of Processor Interconnection Strategies

Each strategy illustrated in Figure 1 has its own advantages and disadvantages, but all are subject to bottlenecks. In order to avoid this problem, MOE/smp uses a hypercube interconnection network.

The Hypercube

The hypercube interconnection strategy consists of connecting n = 2k processors in a k-dimensional cube. Four examples of simple hypercubes consisting of 1, 2, 4, and 8 nodes, are shown in Figure 2.


Figure 2: Hypercube Interconnection Strategy

If there are n processors and they are numbered from 0 to (n-1) in such a way that addresses of connected processors differ by exactly 1 bit in binary representation as illustrated in Figure 3, then the hypercube exhibits some desirable properties.


Figure 3: Hypercube Processor Numbering Scheme

The routing in the cube is simplified to finding any direct neighbor that reduces the number of bit differences between the address of the recipient and the address of the messenger. In Figure 3 for example, to route a message from node 0 (000 in binary) to node 7 (111), the path through node 4 (100) and 6 (110) can be used since at each processor along the path the bit differences with 7 are reduced by one. This routing methodology is very efficient since it is "local" and "memoryless", that is, it requires neither network access nor internal database lookup. Furthermore, randomly selecting a neighbor on a path toward the destination eliminates bottlenecks by distributing the routing on all possible candidates.

This numbering scheme has another advantage. It is possible to induce a sub-tree in the hypercube according to the following simple formula: the "parent" of a processor is its direct neighbor with the lowest address. This is illustrated for an 8-node hypercube in Figure 4 where the original hypercube is shown on the left and its sub-tree is shown on the right.


Figure 4: Sub-tree in a Hypercube of 8 Nodes

The "depth" of a processor in the tree is the number of non-zero bits in its address in binary representation. The "latency" or longest path between two nodes in the tree is log2 n, where n is the number of nodes. For example in Figure 4, node 7 has three non-zero bits in its binary address (111) and is three "hops" from the root node, node 0. Since the path from 0 to 7 is the longest, the latency is three. The sub-tree is useful for certain kinds of inter-processor communication, such as propagating a message to all nodes to initialize or shutdown all processors. Since its latency is so low, the sub-tree enables such operations to be accomplished relatively rapidly.

The hypercube concept can be generalized to a base greater than 2. If the base is a power of 2, say 2i, it is convenient to consider this as grouping together each set of i adjacent bits in the binary representation of each node's address. In fact, the base determines the trade-off between the longest path in the network (HOP) and the number of connections per processor (CON). A larger base decreases latency but increases the number of connections per processor. The following table illustrates this:

Base 2 Base 4 Base 8 Base 16
#Processors CON HOP CON HOP CON HOP CON HOP
16 4 4 6 2 14 2 15 1
64 6 6 9 3 14 2 30 2
256 8 8 12 4 21 3 30 2
1024 10 10 15 5 21 4 45 3
4096 12 12 18 6 21 4 45 3
16384 14 14 21 7 35 5 60 4

MOE/smp

MOE/smp is an enhancement to MOE and MOE/batch allowing several MOE processes to cooperate on large-scale distributed computing jobs. The heart of MOE/smp is a message passing, or "transport" layer that allows SVL tasks running on different processors to communicate with each other.

Since MOE/smp's transport layer is based on the hypercube interconnection strategy, it benefits from all of its advantages. The O(log2 n) recursive communications strategy enables efficient startup and shutdown of large networks. The routing algorithm minimizes communications cost and eliminates bottlenecks. The hypercube network provides scalable control over the complexity of the system even to very large numbers of processors.

Multi-processing with MOE/smp is relatively easy to setup. The same session can run on a variety of types of computers - for example, laptops, workstations, and multi-processor supercomputers - and each computer can run a different operating system including various flavors of both Unix and Windows. It uses the existing LAN for inter-processor communication, requires no special daemons on the nodes other than standard rexec, and respects typical OS socket limits. No shared file system is assumed. For these reasons, little or no intervention from specialized IT personnel is required for setup.

Figure 5 illustrates the execution sequence of a typical MOE/smp session. It begins by determining available processors. The list can be specified by the user or obtained from any of a variety of third-party load-balancing software utilities. Once this is done, the root node recursively launches child MOE processes on the specified processors. At this point, several MOE processes are active and they complete the structure of the hypercube by connecting to their neighbors. Once this is done, the distributed computing platform is ready and awaits its first instructions.

Figure 5: MOE/smp Execution Sequence

The computing power of MOE/smp is put at the disposal of the SVL programmer through a family of functions whose names begin with the prefix "mpu_". In the SVL distributed computing model, the process executing the current SVL task is assumed to be one of n such processes in constant communication with each other on a network. Each such process is called a host. A host may be on a different computer. A host's number is an integer in the range 1, ..., MPU_HOSTCOUNT. The SVL constant MPU_THISHOST is the host number of the current process.

For most applications, mpu_batch is the only multi-processor function needed. The mpu_batch function automatically handles distribution of jobs and collection of results and any possible errors. For example, suppose that a global function MyCalculation exists on all hosts. Suppose that on host number 1, a function called GetNextItem returns the next value to give to MyCalculation and [], the empty vector, when there are no more (e.g. end of file). The following code could be used to distribute the calls to MyCalculation:

 function ParallelMyCalculation [] loop local item =
GetNextItem []; local cmd = select ['', 'call', isnull item];

             local [res,code,seqno,id] = mpu_batch [cmd,'MyCalculation',item,item];

             if code == '' then
                 // ... res contains the successful result
                 // ... id contains the original item passed (udata)
             elseif code == 'error' then
                 // ... res contains the token error message
                 // ... id contains the original item passed (udata)
             endif
          until code == 'eof' endloop
     endfunction
If there were P items returned by GetNextItem then the loop in the above code would be executed P + MPU_HOSTCOUNT times.

SVL also includes message passing functions based on the send-receive-reply model for programmers requiring more sophisticated control over distributed computing applications.

Summary

MOE/smp is an enhancement to MOE and MOE/batch that enables each of these to be used for distributed computing. Its scalable base-b hypercube network design provides efficient (Order log-b n) operation in terms of startup and shutdown, connections required per processor, and routing.

MOE's portability enables MOE/smp to support heterogeneous networks of a variety of computing equipment that includes servers, rack clusters, desktops, and laptops, running a variety of operating systems including Windows, Linux, HP-UX, Sun Solaris, and SGI Irix. Since it runs on the existing LAN and is easy to setup, this makes feasible "just in time" clusters of equipment available on the spot. Thus, MOE/smp permits small scale "research clusters" to speed up individual calculations, medium scale "server clusters" to speed up departmental calculations, and large scale "compute farms" to speed up strategic calculations.