Journal Articles



SVL: The Scientific Vector Language


M. Santavy, P. Labute
Chemical Computing Group Inc.


Contents

Introduction | The Data Parallel Trend | Existing Languages
The Flavor of SVL | Integration in MOE
An Example SVL Program


1. Introduction

SVL is a new high-performance data-parallel programming language built into the MOE Molecular Operating Environment. SVL is an embedded language; that is, its compiler and run-time environment are an integral part of MOE. SVL serves as the command, macro, scripting, and high-performance computing language of MOE.

The tight integration of SVL and the molecular computing services of MOE result in a unique, high-performance and flexible molecular operating environment that minimizes the time taken to convert ideas into results.

SVL was designed

  • for large scale scientific computations;
  • to reduce programming time and program size;
  • to be parallelizable and vectorizable;
  • to deliver high performance;
  • to be interactive.

SVL was designed to be the vehicle for the exploitation of both sequential and parallel computing platforms. The future of industrial scientific computing lies in the efficient use of the entire spectrum of computer platforms from low end Pentium Pro platforms through shared memory multiprocessors to high-end vector pipeline and massively parallel architectures.

In this document, we will outline the ideas behind SVL and data-parallelism as a programming paradigm. We present some of the reasoning behind the decision to design and implement a new programming language. Finally, we attempt to give the flavor of SVL (although a thorough description is beyond the scope of this document).


2. The Data-Parallel Trend

A language is called collection-oriented if aggregate data structures (e.g., sets, sequences, arrays, vectors and lists) as well operations for manipulating them "as a whole" (summing, sorting, permuting and applying a function to each member) are primitives in the language.

Collection-oriented languages discourage explicit inner loops. For this reason they are, in most cases, ideally suited for implementation on massively parallel machines. The parallelism inherent in the primitive operations removes the need for sophisticated compiler analysis to uncover available parallelism.

Indeed, the designers of high-level languages for massively parallel machines have turned to collection-oriented languages because of the difficulty in automatically exploiting parallelism. Languages such as C*, *LISP, CM-LISP (Connection Machine), AL and Apply (Warp), Parallel Pascal (MPP) and the array extensions to Fortran and High Performance Fortran (Cray) all exploit collection-oriented primitives and constructs.

SVL is a collection oriented language; more precisely, it is a data-parallel language. A data-parallel language has the following important characteristics:

  • Imperative Style. Imperative languages (e.g., C and Fortran) can be compiled more efficiently and are more familiar to the wider programming community. Functional languages (e.g., Haskell and SETL) currently exhibit poor array performance, inefficient use of memory and poorly support of random numbers.

  • Explicit Parallelism. The programmer and compiler work together to produce efficient parallel (and sequential) code. It is unreasonable to expect a compiler to extract parallelism that has been hidden inside sequential control and data structures.

  • Processor-Primitive View. A critical problem in generating code for distributed memory machines is determining how to distribute the data among the individual memories available. Data parallel languages encourage the expression of algorithms in the fundamental unit of parallelism; that is, the operations performed by the processors.

  • SIMD focus. Each processor executes the same instructions in a synchronous way; that is, language is Single Instruction stream, Multiple Data stream (SIMD).

  • Global Data Access. Each processor has access to the memories of any other processor. Processor interaction is through expressions of the language rather than explicit message passing.

The trend towards data-parallelism is due to a number of key observations and wide-spread experiences including

  • the difficulty in programming parallel computers with the message passing paradigm;

  • the success of vector architectures (e.g., Cray) and shared memory multi-processors;

  • the natural mapping of scientific calculations onto the primitive operations of data parallel languages;

  • that sequential processor engineering assumptions (e.g., localization of memory access) are not as likely to be violated in the data-parallel paradigm.

The result has been a wealth of research (e.g., at CMU and MIT) into data parallel languages. Many language groups have shown that the paradigm is extremely successful in that it allows for efficient implementations on a wide variety of computer architectures, even sequential processors.

Data parallel languages in general, and SVL in particular, have five important strengths:

  • Versatility. Data parallelism is the most natural paradigm for a large proportion of scientific and engineering computing. In biology, chemistry, chemical engineering, geology, earth and space science, physics, astronomy, computer science and other fields, experience has shown that the source of parallelism is essentially always data parallelism.

  • Practicality. It is easy to convert Fortran or C paradigms into the SVL paradigm. The imperative style of data-parallel languages eliminates the radical paradigm normally associated with parallel programming.

  • Programmability. SVL programs are easier to write than programs using lower level message passing constructs. The synchronous execution model means that there is only one thread of control; therefore, deadlock and race conditions are eliminated. Global data access eliminates message passing considerations from the programming process even when the underlying architecture has distributed memory.

  • Portability. Software portability is of critical importance. Very often, portability and maintainability are more important than raw performance due to the short life-span of high-performance computing architectures. SVL programs are more portable than programs written in a language closer to the underlying architecture.

  • Reasonable Performance. Data-parallel languages exhibit high speedups on a wide variety of computer architectures. Naturally, hand-coded programs using lower level parallelism can be faster. The situation is, however, analogous to assembly language. A proficient assembly language programmer can always write programs that run faster than C or Fortran programs yet very little programming is actually done in assembly language. Other considerations, such as portability, maintainability and programmer productivity in general outweigh the speed advantages of assembly language. As parallel computing enters the mainstream, extracting microsecond speedups will become less important.


3. Existing Languages

The decision to design a new programming language was not taken lightly. There are, perhaps, a dozen credible programming languages that, on the surface, appear to satisfy the requirements. However, after careful study, the conclusion was reached that a new language was necessary. In this section, some of the reasoning leading to this conclusion will be described.

Existing candidate languages for scientific computing can be divided into two broad classes. (Please note, however, that this classification is not strict in the sense that most languages contain features from many classes.)

Imperative languages are, loosely speaking, the traditional languages and the more recent object oriented languages. Examples of languages in this class are C, Fortran, Pascal, Modula, C++, Ada, SML, and APL.

Imperative languages like C and Fortran can deliver high performance and can be used for large scale problems. Unfortunately, they are not interactive, parallelize poorly, and are not concise. The goals of High Performance Fortran are a step in the right direction; however, the result is Fortran with a few vector constructs and not a fundamental change. Object oriented languages like C++ and SML can at times be concise; however, they still suffer from the drawbacks of the other imperative languages.

It was not the imperative nature of these languages that made them poor candidates for embedding into MOE. Rather, it was the fact that they were low level languages; that is, paradigms in which the programmer must be concerned with things like memory allocation and management, as well as all looping and iteration control. Furthermore, this low level focus makes programs written in these languages difficult to parallelize (e.g., the efforts toward automatic vectorization of Fortran programs have led to the conclusion that a new language, High Performance Fortran, was required).

Functional languages are modern languages (in some cases, still experimental) that attempt to provide powerful new features and constructs by sacrificing the imperative notion of state. In functional languages, side effects are banned (e.g. global data structures or COMMON blocks). A program becomes, in effect, one large mathematical expression or function. A consequence of this is that there are no traditional looping constructs and recursion must be used (although compilers can convert some recursive algorithms into loops). In exchange for these sacrifices, functional languages offer powerful notational syntax, elegant and concise programs and a strong mathematical flavor. Examples of languages in this class are Haskell, Lisp, NESL, SETL, and ML.

A recent survey (Zorn 1995) of the use of nonprocedural languages for numerical computing revealed that functional programs can be elegant, concise and very expressive for mathematical algorithms; however, existing functional language implementations have a number of severe problems that make them (currently) unsuitable for industrial application:

  • Poor array performance ­ Although most functional languages are experimental and have immature implementations, the paradigm forces certain inefficiencies: lack of in-place updating results in the fact that array manipulations are slower by a factor of 50 to 10,000 and required memory garbage collection produces high levels of fragmentation and virtual memory usage.

  • Poor support for large scale problems ­ Functional languages generally require vast amounts of memory even for moderately sized problems; for example, Gaussian elimination with Haskell uses well over 16Mb for a 20 by 20 matrix.

  • Poor support for random numbers ­ The elimination of side effects results in a difficulty handling random numbers. This is unfortunate, since random algorithms are extremely important in simulations and parallel algorithms.

The latter two problems are inherent to the functional paradigm; however, the first is not. Indeed, languages such as NESL attempt to solve the first problem; however, they are still in the experimental stage and are therefore more suited to language research than industrial application.


4. The Flavor of SVL

SVL is both a compiled and interactive language. In this way SVL can serve as a command language and as a forum for the development of complex scientific applications. The interactive component of the language consists of the evaluation of expressions that make up the core of applications programs. Flow control constructs and function definitions form the remainder of the language.

The unit of data in SVL is the vector. Vectors are manipulated with functions that operate on a vector and return a result vector. Syntactic shorthands are provided for common functions such as addition and multiplication. A statement of data manipulation is an expression. Expressions and flow constructs can be grouped into functions.

SVL is a very rich language. A full description is beyond the scope of this document; however, the flavor of SVL will be described below.

An SVL vector is a quantity very much like the array that is found in many programming languages. The general ordered vector x has the form

[x(1),...,x(n)]

Each vector x(i) is called an element of x and n is the length of x. Vectors can nest to arbitrary depths since each of the elements of x is, itself, a vector. If n=0 then x is a null vector while if n=1 then then x is called a unit.

All vectors are constructed from the set of atomic unit vectors. This set includes all

  • 8-bit ASCII characters;
  • IEEE double precision floating point;
  • 32-bit signed integers;
  • Strings of 8-bit ASCII characters.

Atomic unit vectors have the unique property that they are equal in value to their first element. It is this property that eliminates the need to distinguish between scalars and vectors.

SVL supports the traditional infix syntax for mathematical operators; for example,

    svl> 3 * 4;
    svl> [1,2,3] + [4,5,6];
    svl> -3 * [4,5] + [6,7];

are all legal SVL expressions. Operators proceed element-wise, pairing an element from one operand with the corresponding element in the other operand. In the case where a unit vector operates with another vector, the unit vector automatically extends to the length. This principle of unit extension provides scalar-vector multiplication, for example. Unit extension is not limited to atomic units. To form an outer product one uses unit extension as follows:

    svl> [2,3,4] * [ [5,6,7] ];

The vector on the right is a unit; it consists of a single element, a 3-vector. It is extended to each element of the vector on the left. The resulting vector is

    [ [10,12,14], [15,18,21], [20,24,28] ]

SVL provides a rich set of built-in functions such as trigonometric functions, random number generators, sampling, sorting and permutation functions as well as mathematical reduction functions. For example, the add function sums all of the elements in a vector. The dot product of two vectors x and y is computed with add(x*y). while the 2-norm of a vector is computed with

    svl> sqrt add sqr x;

which almost reads as "the square root of the sum of the squares". The power of the unit extension principle and the natural syntax are extremely expressive. For example, if u = [x,y,z] is a vector containing three length n-vectors of coordinates, then the norm of each of the n 3-vectors [x(i),y(i),z(i)] can be computed in parallel with the code fragment sqrt add sqr u. Note the similarity with the code fragment for the 2-norm of a simple n-vector.

The familiar Velocity Verlet integration step (used in molecular dynamics simulations) can be coded with

    pos = pos + vel * dt - acc * sqrt dt / 2;
    vel = vel - acc * dt/2;
    [F,acc] = potential pos;
    acc = acc / [mass];
    vel = vel - acc * dt / 2

which operates on all atoms at the same time. Note that this is practically the theoretical description of the algorithm. The above fragment of SVL code, together with a few lines of code for initialization and flow control form a complete dynamics simulation program.


5. Integration in MOE

SVL is an integral part of MOE. It does not consist of a separate compiler and execution model but rather a tightly coupled command and control paradigm.

Specific SVL functions are provided to modify and examine the molecular data structures used for visualization and calculation. SVL can control such things as how molecules are rendered, the geometry and properties of atoms, and configuration of the potential energy model.

SVL is the key to the programmability of the MOE interface in that menu commands trigger the execution of SVL expressions. Any program written in SVL can be run by making an attachment to the MOE menus. Interaction and dialog with users is supported with an automatic graphical interface constructor. Thus any SVL program can have a graphical user interface.

SVL is interactive and, indeed, is the command language of MOE. All functions available in programs are available at the MOE command line. New commands can be tested interactively and, when ready, introduced to the menus. The high level of integration of SVL and the molecular modeling services of MOE result in an extremely flexible and efficient environment for computational chemistry.

With traditional programming languages and script-based tool control, many chemists are discouraged from implementing new techniques. In many cases, a simple idea can only be tested after days or months of programming. With MOE and SVL, new ideas and methods can be tried out in a matter of hours.


6. An Example SVL Program

MOE is built on the embedded language architecture; that is, MOE contains a compiler, linker and run-time execution environment for SVL, the Scientific Vector Language. With SVL, programs can be developed and run from within MOE.

We now present an example of what an SVL program looks like. The context of this example is Conformational Search; more specifically, a modified version of the Random Incremental Pulse Search (RIPS) as described in

    D.M. Ferguson, D.J. Raber
    A New Approach to Probing Conformational Space with Molecular Mechanics: Random Incremental Pulse Search.
    J. Am. Chem. Soc. 1989, 111, pp4371-4378

Given that a molecule has been loaded into the MOE system, the RIPS methodology proceeds as follows. (Each step will be described below.)

  1. Perturb conformation.
  2. Minimize Energy.
  3. Check for duplicate conformation.
  4. If new conformation then write to database.
  5. If maximum iterations not exceeded go to Step 1; else stop.

The program will be developed in the following sections and a program listing will appear at the bottom of the page. From the outline described above the methodology requires the services of the internal molecular objects, the forcefield, energy minimization and structural comparison and molecular database services. We structure the program as follows.

    // Note that '//' characters introduce a comment which extends
    // to the end of the current line. SVL also supports C-style
    ../*...*/' comments.

    #set title 'Random Incremental Pulse Search'

    // RIPS is the main function that initiates the search. RIPS will
    // accept an argument consisting of a number of (optional) named
    // parameters. For example, RIPS [file:'output.mdb', delta:0.1].

    global function RIPS options
      // ... guts to come later ...
    endfunction

The #set line is optional and sets the title of the program module. This title will appear the graphical module management window of MOE. The global keyword indicates that the function to be defined will be visible at the MOE command line. (The local keyword indicates that the function to be defined is private to the program module.)

The semantics of the function definition is as follows. RIPS accepts one argument, called options. This argument may take on many values depending on what the caller passes in; however, the intent is that the caller will use a calling sequence of the form

    RIPS [maxit:20, file:'my_database.mdb']
to invoke the search. In this hypothetical calling sequence, the caller wants to perform 20 iterations and store unique conformers to the database file my_database.mdb. Also, we have assumed that the parameter names gtest and file are meaningful to RIPS.


Perturb the Conformation

Each attempt to generate a new conformation consists of a conformation perturbation followed by energy minimization. The described methodology uses a uniform random Cartesian coordinate perturbation.

    // perturb is a local function to perturb ALL the coordinates
    // of the loaded molecule. The sole argument to the function is
    // the radius of the Cartesian perturbation (in Angstroms).
    // Notes follow the code.

    local function perturb d
      local atoms = Atoms[];
      local pos = aPos atoms;
      aSetPos [atoms, pos + 2*d * randU one pos - d];
    endfunction

Notes:

  • Atoms is a function that takes no arguments (so it is passed the null vector represented by []) and returns a vector of internal molecular object keys to all of the atoms in the currently loaded molecule.

  • aPos is a function that maps molecular object atom keys (as returned by Atoms into Cartesian coordinates. The return vector is of length 3. Element 1 is a vector containing all of the x coordinates, element 2 contains all of the y coordinates and element 3 contains all of the z coordinates.

  • aSetPos modifies the coordinates of the atoms specified by key to the given coordinates (in the same format that aPos returns). Formally, the function is declared as aSetPos[atom_keys,[x,y,z]].

  • one is a function that maps all numeric values in the given vector (no matter how deeply they are nested) to 1.

  • randU is a function that multiplies each numeric value in the given vector (no matter how deeply it is nested) by a uniform random variate in [0,1].


Minimize Energy

Energy minimization is performed with a call to the Molecular Mechanics services of (a piece of SVL code defined in another module). In MOE, these services are accessed with the MOE function (that must be declared as external).

    function MM; // import MM function
    ...
    MM [gtest:0.001];
Like the RIPS function that is being written, MM accepts a vector of tuning parameters. In this case, we are specifying an RMS gradient test of 0.001.

We will now fill in some of the RIPS function body (although we will modify it shortly). The first version will perform a specific number of perturbations and energy minimizations.

    function MM, perturb;

    global function RIPS options
      const DEFAULTS = [maxit:20, delta:0.2];
      options = cat [options, DEFAULTS]; // add in defaults
      local i;

      for i = 1, options.maxit loop
        perturb options.delta;
        MM [gtest:1e-3];
      endloop
    endfunction

Notes:

  • cat is a function that catenates all of the elements of the given vector. In this context, RIPS is adding default values to the parameters in case they were not specified.

  • The options.maxit and options.delta fragments select out the first value in the options vector tagged with the tokens 'maxit' and 'delta' respectively. (Think of options as a structure in C).

  • The local i; fragment declares a variable called i that is private to the RIPS function.


Check for Duplicate Conformation

In order to produce unique conformations two things will have to be done. Firstly, all of the unique conformations generated will have to be kept somewhere. Secondly, as each new conformation is generated it will have to be compared to the known ones.

We modify the RIPS function to keep all of the unique conformers in a single vector as follows.

    global function RIPS options
      // ... same as bove

      // 'conformers' will hold each unique conformation
      // as it is generated. 'atoms' will hold the object keys
      // of all the atoms in the molecule.

      local conformers = [];
      local atoms = Atoms [];
      local pos,p;

      for i = 1, options.maxit loop
        perturb options.delta;
        MM [gtest:1e-3];
        pos = aPos atoms;

        if new_conf [pos,conformers] then
          conformers = append [conformers, pos];
        endif
      endloop
    endfunction
Notes:
  • new_conf is a function (described below) that returns 'true' if the first passed conformer is a member of the set specified by the second passed element.

  • If the conformation is new we use the append function to add the new conformer to the conformers vector.
The new_conf function can be coded as follows:
    function Superpose;

    local function new_conf [pos,list]
      local p, dist;

      for p in list loop
        dist = Superpose [ [pos,p] ]; // get RMS distance
        if dist < 0.1 then
          return 0; // have it: return false
        endif
      endloop

      return 1; // don't have it: return true
    endfunction
Notes:
  • Superpose is an SVL function defined in another function that performs optimial rigid body superposition and returns the mean square distance between the given point sets (in this case, the coordinates of all of the atoms).

  • The for syntax used assigns to p each element of the vector list in turn.
As given, the new_conf function looks very traditional: an explicit loop is performed over each element in the list vector. A more advanced version of the same function would look like the following.
    local function new_conf [pos,list]
      local dist = app Superpose app nest tr [list,[pos]];
      return not orE dist;
    endfunction
Notes:
  • nest is an SVL function that nests its argument. For example, nest [1,2,3] returns [[1,2,3]]

  • tr is the transpose function of SVL.

  • app is a modifier to the function calling mechanism. Normally, a function applies to the given vector. The prefix app makes the function apply to each element of the vector.

  • not is the element-wise logical negation operator.

  • orE is the element-wise logical disjunction operator. In this case it is ORing each element of the given vector (the same way that add sums the elements of the given vector.


Write Conformation to Database

Saving the conformations is accomplished by creating a MOE molecular database file (which can be visualized with the Database Viewer). We encapsulate the writing of the conformations in the following function.

    local function save_conformers [filename,list]
      local mdb, mol_data, pos;

      mdb = db_Open [filename,'create'];
      db_CreateField [mdb, 'mol', 'molecule'];

      mol_data = db_ExtractMolecule Atoms[];

      for pos in list loop
        mol_data(4)(MOL_ATOM_X) = pos(1);
        mol_data(4)(MOL_ATOM_Y) = pos(2);
        mol_data(4)(MOL_ATOM_Z) = pos(3);
        db_Write [mdb, 0, [mol:mol_data]];
      endloop

      db_Close mdb;
    endfunction
Notes:
  • db_Open is a function that opens or creates (in this case) a MOE Molecular Database File.

  • db_CreateField is a function that creates a new field in the given molecular database. The second argument is the name of the field and the third argument is the data type of the field (in this case, molecular data).

  • db_ExtractMolecule is a function that returns a nested vector containing the appropriate data for writing molecules using the db_Write function.

  • The body of the for loop modifies the coordinate vectors in the molecular data vector and writes out each conformer.

  • db_Write is a function used for writing new data or updating data in a molecular database. The first argument specifies the database, the second the entry key (in this case 0 to indicate the creation of a new entry) and the third argument is the contents of the entry.
We have deviated slightly from the original program outline in that we are writing the conformers to a database just before termination. This was done for clarity; however, it would be a simple matter to write each new conformation out as it is generated.


Program Listing

Putting all of the code fragments together, we end up with the following SVL program (roughly 100 lines).

    // rips.svl -- RIPS conformational search

    #set title 'Random Incremental Pulse Search'

    function MM, Superpose;

    // save_conformers writes out the given conformations
    // to a specified new molecular database

    local function save_conformers [filename,list]
      local mdb, mol_data, pos;

      mdb = db_Open [filename,'create'];
      db_CreateField [mdb, 'mol', 'molecule'];

      mol_data = db_ExtractMolecule Atoms[];

      for pos in list loop
        mol_data(4)(MOL_ATOM_X) = pos(1);
        mol_data(4)(MOL_ATOM_Y) = pos(2);
        mol_data(4)(MOL_ATOM_Z) = pos(3);
        db_Write [mdb, 0, [mol:mol_data]];
      endloop

      db_Close mdb;
    endfunction

    // perturb is a local function to perturb ALL the coordinates
    // of the loaded molecule. The sole argument to the function is
    // the radius of the Cartesian perturbation (in Angstroms).
    // Notes follow the code.

    local function perturb d
      local atoms = Atoms[];
      local pos = aPos atoms;
      aSetPos [atoms, pos + 2*d * randU one pos - d];
    endfunction

    // new_conf returns 1 iff the given conformer is not already
    // in the given list of conformers and 0 otherwise

    local function new_conf [pos,list]
      local dist = app Superpose app nest tr [list,[pos]];
      return not andE dist;
    endfunction

    // RIPS is the main function that initiates the search. RIPS will
    // accept an argument consisting of a number of (optional) named
    // parameters.

    global function RIPS options
      const DEFAULTS = [maxit:20, delta:0.2, file:'rips.mdb'];
      options = cat [options, DEFAULTS]; // add in defaults

      // 'conformers' will hold each unique conformation
      // as it is generated. 'atoms' will hold the object keys
      // of all the atoms in the molecule.

      local conformers = [];
      local atoms = Atoms [];

      local pos,p,i;

      for i = 1, options.maxit loop
        perturb options.delta;
        MM [gtest:1e-3];
        pos = aPos atoms;

        if new_conf [pos,conformers] then
          conformers = append [conformers, pos];
        endif
      endloop

      save_conformers [options.file,conformers];
    endfunction