Journal Articles

The P's and Q's of Methodology Development

P. Labute
Chemical Computing Group Inc.

Suppose that Professor Peabody develops an interesting and useful chemistry program called P. Using P is simply a matter of contacting Professor Peabody and obtaining P. Naturally, P runs on all in-house hardware platforms, fits perfectly with the in-house methodology and requires no further modification to support the chemical problems at hand. As time passes, new versions of P become available and are always in tune with current in-house thinking and methodology direction.

Unreasonable? Of course it is. Front-line researchers are just not concerned with systems integration, deployment and maintenance but with Science. Why should they be? After all, unless these researchers are directly supported by industry, there is no reason for them to consider any methodology or theoretical program but their own. Issues such as integration and maintenance belong to the domain of the commercial software companies.

Let's repeat the exercise.

Suppose that ChemCo Inc. develops an interesting and useful chemistry program called Q. Using Q is simply a matter of contacting ChemCo and licensing Q. Of course, Q runs on all in-house hardware platforms, fits perfectly with the in-house methodology and was designed to support the chemical problems at hand. As time passes, new versions of Q become available and are always in tune with current in-house thinking and methodology direction.

Unreasonable? No, this is supposed to be the raison d'être of software companies. This is the rationale for the high price tags on their software and the yearly maintenance fees. Sadly, this scenario is, perhaps, unrealistic and rarely, if ever, encountered in real life. Indeed, entertaining the notion is more likely than not to produce the thought "Fat Chance!".

"... discovery research, on the other
hand, is a blend of science, knowledge,
art and experience."

It might be tempting to say that the current state of affairs is only a reflection of the current batch of chemistry software companies, and that the situation will improve with time. One may even draw an analogy with, say, the early CAD industry. Early tools were specialized point solutions, integrated poorly and did not always fit in-house methodology. With the passage of time, industrial methodology was understood more fully, standards emerged and tool integration became the norm. But such analogies break down when one considers the fundamental activities of the users of the software tools: computational chemistry software tools are used in discovery research. Software tools in other industries deal mainly with automation of well-understood processes; discovery research, on the other hand, is a blend of science, knowledge, art and experience. Methodology, direction and focus differ widely. It is unlikely that the discovery "process" will ever be understood to the point where it can be fully automated.

If the problem is not the maturity of the computational chemistry software industry, then what is the problem? Consider, again, ChemCo and its chemistry program Q. Where did Q come from? Consider the cases.

  1. Program Q is really Professor Peabody's program P commercialized by ChemCo. ChemCo might have made it possible to run P from within its other tools, or might have refined P's user interface. Professor Peabody's group will continue to improve P and release new versions. While ChemCo will, presumably, be responsible for the industrial integration of P, Professor Peabody is responsible for methodology direction. With the commercialization being handled by ChemCo, Professor Peabody proceeds with his own research program.

  2. Program Q is Professor Peabody's program P commercialized by Professor Peabody's new software company ChemCo. Professor Peabody continues with his own research program but in a non-academic setting.

  3. Program Q was developed by ChemCo's staff of scientists and software developers or is a complete rewrite of Professor Peabody's program acquired by ChemCo. The problem here is that ChemCo must either prescribe methodology or attempt to support the in-house methodology of all of their customers. The former case leads to inflexible software and the latter case is not viable from a business perspective.
In each of the cases presented above, there is nothing to encourage the external software provider to remain in tune with its customers' thinking and methodology. The front-line researcher is concerned with a particular methodology while the commercial firm cannot be in tune will all of its customers simultaneously.

It is difficult for the current leading suppliers of computational chemistry software to charge a high enough price for their software to recover their research and development costs. This problem is particularly acute since high-level methodology software often has a relatively short shelf-life. The leading suppliers are being forced into being pure commercialization firms or service and consulting firms. While it is not yet clear what the full impact of this will be, one can be relatively certain that the current set of software providers will have ever less impact upon methodology development.

If chemistry software firms turn into service and consulting firms, then computational chemistry will have come full circle: new methodology will be developed principally by the people that started computational chemistry in the first place: the industrial and academic researchers.

Suppose that a researcher publishes an interesting and possibly useful new methodology. The method probably will not be quickly incorporated into existing commercial computational chemistry software. Worse, if the researcher is a member of the scientific staff of a competing research company it is unlikely that the executable software can be obtained. What then? To evaluate the new method in-house requires custom software development.

It is this author's belief that in-house methodology development capabilities will be of fundamental importance to the long-term competitiveness of industrial research companies.

The ability to exploit the full spectrum of available computing platforms is essential to a cost effective computational chemistry development and deployment strategy. Through the years, several industrial computing models have been proposed and implemented. Each model claims to make maximal use of the hardware dollar, and each has its strengths and weaknesses.

Each new generation of hardware and operating software has caused swings to one or another computing model with promises of the "ultimate solution" to maximize use of the hardware dollar. Personal computers were used to off-load departmental computers. Parallel Virtual Machine software claims to maximize use those wasted workstation cycles. Pentium Pro processors promise an economical workstation model. Inter- and Intra-net technology promotes the time-sharing model.

Which model will win? The assumption that a particular model will emerge as the model forces hardware planners to predict which of the models will win. The most likely scenario is that none of the models will emerge as the dominant and most efficient method of exploiting computing power. At each point in time, and for each situation, one of the models will seem to make the most sense; however, this advantage will be most likely short-lived. It is pointless to fight against this mode of constant change: a long-term hardware strategy must not only take this state of affairs into account but take advantage of it.

The key to taking advantage of the constant change in hardware is to be in a position to ignore it. After all, why should anyone care that hardware manufacturer X currently has the fastest workstation around? In a little while, company Y and then perhaps company Z will have the fastest system. Why bother?

Almost all hardware planning problems are caused by software. Supercomputers are purchased because they are supposed to be the fastest machines on the planet. Desktop workstations should be purchased because of price/performance ratios, servicing and offers of reasonable performance, graphics and networking features. Hardware purchasing choice is often restricted by the lack of portability of software. Truly portable software gives hardware planners the ability to consider a hardware purchase on the merits of an individual offer without fear of software breakdowns.

The portability problem is not trivial. It is one thing to write portable high-performance UNIX software to run on a variety of single processor workstations. It is another thing to write software that is efficient on PCs, workstations and supercomputers. PC-based systems suffer from primitive operating systems (likely to be fixed with the adoption of Windows/NT). Parallel processors are difficult to program; the different architectures and programming paradigms to not promote portability. If one asks how parallel computers are used one often hears the refrain "a while back we had a post-doc who wrote some code for it but since then, nobody has had the time to figure it out."

It is tempting to hope that the current commercial software vendors are coming to the rescue. After all, aren't software companies good at this sort of thing? Notwithstanding the fact that none of the computational chemistry software suppliers offers truly portable products, portability is properly in the domain of the commercial software firms. In-house methodology development groups should not have to concern themselves with monitoring and actively supporting the latest hardware and operating system releases. Not only does this activity have little to do with solving in-house chemistry problems, but it can also be a management nightmare since each combination of hardware and software must be tested and directly supported.

Minimizing exposure to hardware change is, therefore, directly dependent on software portability. In the context of end-use of computational chemistry software, portability gives hardware choice and configuration freedom. For in-house methodology developers, the ability to write portable software easily combines the benefits of custom development and deployment flexibility.

"... the developers of linear algebra codes ...
achieved both portability and high-performance
on the full spectrum of computing equipment."

Is it possible to write truly portable software in-house without high maintenance costs? There is evidence to suggest that it is possible, and that the solution to the portability problem will come from modern programming language research.

Traditional programming languages such as Fortran and C have done much to insulate the programmer from the form and idiosyncrasies of the instruction set of a particular computer. With the definition of standard language syntax, semantics and libraries, programmers could reasonably expect that their programs could run on all machines supporting a standard compiler. With the large body of Fortran and C programs in existence, hardware manufacturers are forced to deliver standard compilers in order to tap into that body of software.

The situation with operating systems and graphical interfaces is not as well developed. Competing flavors of UNIX, Windows, Windows/NT and Macintosh OS have a large installed base. Graphical interface toolkits such as X/Motif and Windows are still very popular. However, there is an emerging set of fundamental operating and graphical system capabilities that make cross-platform development relatively painless. Wrappers can be written to isolate the application programmer from the particulars of the underlying operating system and window interface, and OpenGL is likely to become the standard 3D rendering interface. While it will take some time before there is a true operating system and interface standard, it is possible for programmers to operate under the assumption that such a standard is already in place.

The portability problems in high-performance scientific computing come from differences in multiprocessor architecture and the almost total lack of standards for the parallel programming paradigm. Most parallel processors achieve high performance only when a problem is mapped precisely onto a machine architecture. For this reason, most supercomputer manufacturers deliver compilers with special extensions to Fortran or C that allow the programmer to exploit the parallel features of the computer. Because each architecture is different, this practice leads to programming difficulty and non-portable code.

This portability problem was solved successfully by the developers of linear algebra codes. Early versions of these libraries were programmed in straight Fortran. As the libraries were ported to parallel and supercomputer architectures, it was noticed that only a small portion of the code dictated the overall performance of the library routines. This small portion became known as the Basic Linear Algebra Subroutines (BLAS). Each subroutine performed a simple operation such as vector addition or a simple matrix operation This simplicity made it easy to port the BLAS to widely different architectures. By constraining the remainder of the software to use the basic subroutines, the software written using them achieved both portability and high-performance on the full spectrum of computing equipment. Indeed, hardware manufacturers are quick to port the BLAS to each new generation of hardware, and thus effectively port a huge body of scientific software with minimal effort.

This landmark success story was not lost on computer programming language researchers. The fundamental reasons for the success were quickly identified:

  1. A small number of subroutines were called primitive and other algorithms were expressed in terms of these primitives.

  2. The primitive routines dealt with entire collections of objects like vectors and matrices and not with individual numbers or scalars.

These fundamentals were the inspiration for the family of programming languages known as collection-oriented languages.

Collection-oriented languages are ideally suited for implementation on parallel machines. The inherent parallelism in the primitives removes the need for sophisticated compiler analysis to exploit available parallelism. Indeed, the designers of high-level languages for massively parallel machines have turned to collection-oriented languages because of the difficulty in automatically exploiting parallelism. Languages such as C*, *LISP, CM-LISP (Connection Machine), AL and Apply (Warp), Parallel Pascal (MPP) and the array extensions to Fortran and High Performance Fortran (Cray) all exploit collection-oriented primitives and constructs.

What is interesting about collection-oriented languages is that the language implementors are responsible for porting the language primitives from machine to machine. Thus, programs written using these languages become portable across the full spectrum of computing equipment. This is not to say that Fortran or C will disappear. On the contrary, these languages will always have their place; however, it is important to realize that the new high-performance collection-oriented languages offer complete use of the spectrum of computing platforms in a truly portable manner. For computational chemistry methodology developers, these languages provide a mechanism to purchase portability from an external supplier and retain the flexibility of custom and portable methodology development.

A viable long-term scientific software strategy for the chemical discovary research company R must take two important factors into account.

Firstly, commercial scientific software suppliers are coming under increasing pressure to become consultants and systems integrators rather than methodology developers. Company R will have to evaluate carefully the impact of this trend and develop strategies for ensuring the supply of new methodology.

Secondly, R must develop strategies to minimize the exposure to hardware change. The absolute worst-case for company R is to be dependent on dead-end software running on out-dated hardware. Software portability is the key to extracting all of the benefits of computing equipment from PCs to supercomputers as well as maximizing hardware configuration freedom and platform choice.

One strategy is to rely solely on commercial software suppliers for methodology. In this case, it is the supplier that must deal with the issues raised. It is important to find out, from these suppliers, exactly what their plans are for increasing the portability of their software as well as their plans for new methodology development. Another strategy is to develop methodology in-house. In this case, critical decisions must be made regarding software development tools. The new parallel languages offer the ability to write truly portable code without the high maintenance costs.

The decision to rely on external suppliers for methodology or to develop it in-house is not always an easy one. What is clear is that if company R is dependent on discovery-based research it must take steps to control and ensure the supply of new methodology. A likely scenario is that company R will rely on external suppliers for well-defined software tools and augment these tools with new methodology developed in-house.

In the end, Professor Peabody will still be producing interesting and new methodology ideas as will be scientists in competing industrial firms. The challenge is to find a cost-effective way to incorporate and possibly improve on these new ideas without waiting for ChemCo's commercial version.

Paul Labute is the Director of Research and Development at Chemical Computing Group Inc. 1010 Sherbrooke Street W, Suite 910, Montreal, Quebec, Canada H3A 2R7. His email address is and he can be reached by telephone at (514) 393-1055.