I worked with Clark Wiedmann from 1984 to 1986, helping him finish STSC's APL compiler. For information about how the compiler works, see the papers by Wiedmann and Weigang in the APL '85 Conference Proceedings and papers by Wiedmann in the '84 and '86 Proceedings. The compiler is still available as a product; for information call Manugistics at (301) 984-5000 or send e-mail to aplsales@manu.com.
Briefly, the compiler runs under the APL*PLUS Mainframe System, an enhanced version of VSAPL that includes nested arrays. The compiler translates APL all the way down to IBM 370 machine language. Compiled functions coexist in the workspace with variables and interpreted functions. The compiler does not produce stand-alone code; compiled functions make calls to the "compiler support package" in the interpreter. These calls range from memory management to APL operations. Not all APL primitives are "compiled"; some (like execute or gradeup) are performed by calling the interpreter. The reason for doing this is to avoid making the compiled functions huge and to avoid doing work that would not result in much speedup.
The compiler does not require "a LOT of adaptations" to your code. Almost any standard (non-nested) APL program can be compiled, with the only requirements being declarations for subfunctions and global and shared variables, and branching only to labeled lines. However, modifications to your program and a few additional declarations can permit the compiler to optimize the program better and produce faster code. Because the compiler is applied selectively to CPU bottlenecks, most users are inclined to perform these modifications to get the most out of the compiler.
As an example, consider the grouped summation problem discussed here recently. (I've reduced the problem to one-way grouping for simplicity.) The fastest interpreted solutions involved sorting and partitioned summation. However, as Tom Chwastyk pointed out, this can give inaccurate results due to round-off error if the data has a wide range of values. To avoid this possibility one of several slower algorithms must be used. The simplest solution was not even mentioned, because it involves an unthinkable (for APL) amount of looping.
{del}Z{is}G GROUPSUM D;I [1] @Sums elements of {omega} according to groups {alpha} [2] Z{is}(0{max}{max}/G){reshape}0 [3] I{is}{shape}D & {goto}(I=0)/0 [4] L1:Z[G[I]]{is}Z[G[I]]+D[I] [5] {goto}(0<I{is}I-1)/L1 {del} 1 2 1 4 GROUPSUM 10 5 3 8 13 5 0 8
This algorithm has two desirable properties: its execution time is proportional to the length of the argument (not the length squared, or the number of groups times the number of elements in the most populous group, or even N-log N), and it does not suffer from the round-off problem.
The program runs very slowly because it processes scalars. This makes interpreter overhead (including dynamic memory management) dwarf the time spent doing useful work. However, compiling this program makes it run about 50 times faster (for arguments of 10,000 elements), making it about 2.5 times faster than the sorting solution. The only changes I made to obtain this speedup were declarations for the arguments and result:
[6] @ DECLARE REAL: D[] Z[] [7] @ DECLARE INTEGER: G[]
Declaring the argument D to be real causes the compiler to generate floating-point operations for arithmetic on D. If D comes in as an integer array, it will automatically be converted to real type. (The result Z is declared real to help the compiler deduce that it should not be Boolean.)
If this function were responsible for a significant fraction of the CPU time in your application, it would be worth spending some time optimizing it for the compiler. Here is an optimized version:
{del}Z{is}G GROUPSUM2 D;I;R [1] @Sums elements of {omega} according to groups {alpha} [2] Z{is}0{reshape}R{is}0 & {goto}(0={shape}G)/0 [3] Z{is}(I{is}{max}/G){reshape}R & {goto}(I=0)/0 [4] I{is}{shape}D [5] Z[1]{is}Z[1]{times}G[1] [6] L1:Z[G[I]]{is}Z[G[I]]+D[I] [7] {goto}(0<I{is}I-1)/L1 [8] @ DECLARE REAL: D[] R [9] @ DECLARE INTEGER: G[] I {del}
The changes allow the compiler to "expand" (perform with compiled code) some operations it would otherwise pass to the interpreter, and they "hoist" operations out of the L1: loop (do them just once before entering the loop).
on line [3], assigning the result of the maximum reduction to an integer variable allows the compiler to expand the operation. Without the assignment, the compiler passes the operation to the interpreter because the type of the result is unknown ({max}/{iota}0 returns real). (Operations on arrays of unknown type are generally passed to the interpreter because it is usually impossible to beat the interpreter when types are unknown.) With the assignment, the compiled code can generate a declaration error if G is empty and follow that check with an integer-maximum loop.
on lines [2] and [3], Z is initialized by reshaping a declared real (R) to avoid first creating a Boolean array which must then be converted to real type. Z is no longer declared because the compiler's data flow analysis figures out that it is real.
line [5] hoists several operations out of the L1: loop. Using indexing causes the compiled code to fetch the value of #IO (and check for undefined #IO; this is pre-leaky-localization code). Referencing Z and G causes the code to compute the addresses of these variables. Because there are no memory allocations within the loop, these addresses are used in the loop without having to be recomputed.
This version runs more than three times faster than the unoptimized version, and it runs 8.5 times faster than the sorting algorithm (for 10,000 element arguments). Compilation has sped up the looping function by a factor of 175.
So, yes, you might end up making a number of changes to get a compiled function to run as fast as possible. You do the same thing with the interpreter. The kind of changes you make for the compiler are different from those you make for the interpreter because the compiler and the interpreter are very different animals.
[You may be wondering about compiling the sorting algorithm. It is not improved by compilation; in fact it runs very slightly (about 1.5%) slower when compiled. Many operations are passed to the interpreter, and interpreter calls from a compiled function are not quite as fast as calls from an interpreted function.]
If this tidbit has piqued your curiosity, I suggest you read "An Introduction to STSC's APL Compiler" in the APL '85 Conference Proceedings. This paper should answer many of your questions.
Jim
Home Page