This article first appeared in the APL85 Conference Proceedings, APL Quote Quad, Vol. 15, No. 4 (May 1985). Copyright © 1985 ACM. Reproduced by permission of the Association for Computing Machinery.
Properly programmed, computers can execute simple operations on scalar variables (such as I{<-}I+1) hundreds of times faster than an APL interpreter. As a result, highly iterative, scalar- oriented programs are incredibly inefficient in interpreted APL. Compilation can speed up such programs enough to make them practical and, in some cases, faster than equivalent non-iterative programs.
Although many programs can be written efficiently in APL without iteration, some programs absolutely require iteration. A user who needs such a program is practically forced to use a different language. Other programs can be written non-iteratively, but are more clearly expressed with iteration. APL programmers are often forced to sacrifice clarity and simplicity in order to make a program run fast enough to use. Clearly, these aspects of APL programming neither make the language easier to use nor enhance programmer productivity.
Have you found any useful algorithms in journals lately? If you are an APL programmer, the answer is probably "no". Virtually all published programs are scalar-iterative. In their raw state (before parallelization), such programs generally run too slowly to be useful. APL users are intellectually isolated from the rest of the computing community by an implementation detail: APL is always interpreted.
STSC's APL compiler provides relief from the tyranny of interpretation. In a compiled function, a simple operation like I{<-}I+1 can often be done with a single machine instruction. Programs that the interpreter executes most inefficiently, scalar-iterative programs, are sped up dramatically (8 to 200 times faster) by compilation.
However, compilation cannot speed up all programs equally. Programs which are non-iterative or which process large arrays are usually executed fairly efficiently by the interpreter. Such programs can be sped up only moderately, even if written by hand in assembly language.
Compilation is not a magical process; it is a series of conceptually simple transformations that translate an APL program to optimized machine instructions. A basic understanding of machine language and the compiler's transformations enables one to examine a program and estimate how much it can be sped up by compilation.
The programs and data in an APL active workspace are typically stored in the "central memory" of the computer. Memory may be thought of as a vector of "bytes" (a byte is 8 bits of data). A character is represented by a single byte, an integer by a sequence of 4 bytes (typically), and a real (floating point) number by 8 bytes. Data stored in memory is identified by location measured in bytes from the start of memory.
In addition to central memory, computers generally have "registers", which are fast-access memory locations, usually on the CPU chip. Most machines have a dozen or so registers to hold integer values and a few more to hold real numbers. Using registers, rather than memory, to hold values results in faster program execution.
Machine instructions operate on arguments in registers and, in some cases, arguments in memory. Instructions are available to load 1, 4, or 8 bytes from a given location in memory into a register, or to store a register into memory. Arithmetic instructions add, subtract, multiply, divide, or compare two numbers of a particular type (real or integer). Conversion between integer and real type is done with explicit instructions. Logical instructions apply Boolean operations (such as ^, {or}, {/=}, shifts, or rotates) to integers viewed as 32-element bit vectors.
An APL array is stored in memory as a vector (the array ravelled) together with additional data fields giving the type, rank, and shape of the array. The symbol table, a data structure which is invisible to APL users, links an array with its name, and allows programs to find where an array starts in memory (the "address" of the array).
A new array is created by "allocating" memory for it, a process which finds an unused section of memory large enough to hold the array. If there are no segments large enough, a "garbage collect" must be performed, a process which slides arrays around in memory to consolidate free space. After an allocate, a program must not use addresses calculated before the allocate because the arrays may have been moved by a garbage collect.
Allocating (and later freeing) memory for an array takes about 200 machine instructions (regardless of array size). Clearly, any compiler which does not eliminate allocates for scalars (as in I{<-}I+1) cannot even remotely approach ultimate machine efficiency for scalar-iterative programs. Once run-time syntax analysis is eliminated, memory allocations are the major source of inefficiency in APL programs.
APL array operations are implemented with scalar machine instructions and loops. For instance, the sum of a vector (Z{<-}+/R) is computed with:
Allocate Z sum {<-} 0 Loop for i {in} {iota}{rho}R sum {<-} sum + R[i] endloop Z[0] {<-} sum
[The notation "Loop for i {in} {iota}N" means to execute the statements in the loop with i = 0, 1, 2, ... N-1. Indexing in the pseudo-code examples always uses zero origin, regardless of the setting of #IO. Arrays of any rank may be indexed as vectors.]
APL-style indexing involves many discrete machine operations. First, #IO is subtracted from the indices, and they are checked for index errors. Multidimensional array indices are combined into a single index into the ravelled array. The vector index is multiplied by BPE (the number of bytes per element) to get the byte offset, and then added to the address of the array, giving the location of the element in memory. For example, indexing a real matrix M[I;J] with scalars I and J is implemented with:
nrows {<-} ({rho}M)[0] i {<-} I - #IO If (i<0) or (i{>=}nrows) then Signal index error endif j {<-} J - #IO ncols {<-} ({rho}M)[1] If (j<0) or (j{>=}ncols) then Signal index error endif vecind {<-} (i{times}ncols) + j BPE{<-}8 @ 8 bytes/element loc {<-} (vecind{times}BPE) + Address of M result {<-} MEMORY[loc+{iota}BPE]
[The notation "MEMORY[loc+{iota}BPE]" denotes BPE consecutive bytes of memory starting at location loc.]
Boolean arrays are more difficult to index because most computers cannot fetch an arbitrary bit from memory. Instead, the byte containing the desired bit is fetched, shifted so the desired bit is rightmost, then AND-ed with a mask to zero out all but the lowest-order bit, resulting in an integer zero or one. Storing into a Boolean array is equally complicated.
The first steps in compilation are lexical analysis and parsing [2], which convert the APL program into an ordered list of interpreter calls. This resolves questions of valence (whether a name denotes a function or variable) and order of execution. The interpreter calls are represented in an "intermediate language" which has operations for all APL primitives as well as operations which correspond to machine instructions.
Data flow analysis is performed on the intermediate code to determine the types and ranks of variables and intermediate results. Declarations in the APL program start the process off by giving the type and rank of the program arguments and globals. The compiler follows the branching flow of the program forward and backward, determining the type and rank of results from information about arguments, and vice versa. A report is generated for the programmer giving the types and ranks the compiler has deduced for variables in the program and indicating where more declarations would be helpful.
Next, selected interpreter calls are expanded to scalar operations and loops. Complex APL operations such as {gradeup} or {domino} are not expanded; the compiler concentrates on simpler operations with more potential for improvement. Operations on arguments of unknown type are generally not expanded. If expanded, such operations would be little faster than the interpreter and would tend to make the compiled function huge. Operations that the compiler currently expands include:
+ - {signum} {divide} * | {ceiling} {floor} {rho} {iota}A ,A A[...] (where types known) A[...]{<-}B A{take}B A{drop}B (vector B only) A,B (for rank {<=}1) A{iota}B B{epsilon}A (scalar B only) = {/=} < > {<=} {>=} ~ ^ {or} {nand} {nor} (not for Boolean array arguments) f/ f{slashbar} {jot}.f (fns +-{times}{divide}{ceiling}{floor}) f\ f{backslashbar} (fns +{times}{ceiling}{floor}) A{transpose}B (for constant A) {transpose}A (if rank A known) {->}L {->}C/L {->}C{rho}L {->}C{take}L {->}C{drop}L (for label L)
Operations the compiler declines to expand get passed to the interpreter.
After expanding selected APL operations, the intermediate code is optimized. The optimizations further customize the expanded code to the specific case involved and remove many redundant operations. A sequence of optimizations is applied repeatedly until the intermediate code stabilizes.
The peephole optimizer looks at each instruction in the intermediate code and checks for possible improvements based on the particular instruction involved. For instance, i{<-}j+0 is replaced with i{<-}j. The peephole optimizer also finds loops with constant iteration counts. Zero-iteration loops are removed, one-iteration loops are replaced with straight-line code, and some two-iteration loops are expanded to two copies of the body of the loop. If-then-else's with known conditions are replaced with unconditional code.
The dead variable optimizer finds operations which define variables that are not referenced later. Unless the operation has side effects or necessary error checks, it is removed. Although dead variables are uncommon in user programs, the code expanders and other optimizations generate dead variables in profusion.
The copy optimizer finds where a copy of a variable is used and changes the operations to use the original variable instead of the copy. For instance, the instructions
copy {<-} original z {<-} copy + y
are replaced with:
copy {<-} original z {<-} original + y
This makes variable copy dead if it is not referenced elsewhere; the dead variable optimizer will remove the instruction copy{<-}original.
The expression optimizer finds common subexpressions and prevents them from being recalculated. For instance, the instructions
j {<-} a + i x {<-} V[j] k {<-} a + i (note k=j) y {<-} W[k]
are replaced with:
j {<-} a + i x {<-} V[j] k {<-} a + i y {<-} W[j] (use j, not k)
This makes variable k dead; the dead variable optimizer will remove the instruction k{<-}a+i. Again, although common subexpressions may not be too frequent in user programs, the code expanders generate many of them.
Loop merging joins the bodies of certain loops with equal iteration counts, getting the job of two loops done with a single loop. For instance, the expansion for Z{<-}{floor}0.5+V starts out with two loops:
Allocate Temp Loop for i {in} {iota}{rho}V t {<-} 0.5 + V[i] Temp[i] {<-} t endloop Allocate Z Loop for j {in} {iota}{rho}V u {<-} {floor} Temp[j] Z[j] {<-} u endloop
Loop merging transforms the code to:
Allocate Temp Allocate Z Loop for i {in} {iota}{rho}V t {<-} 0.5 + V[i] Temp[i] {<-} t u {<-} {floor} Temp[i] Z[i] {<-} u endloop
The copy and dead variable optimizations clean up the code to produce:
Allocate Z For i {in} {iota}{rho}V t {<-} 0.5 + V[i] u {<-} {floor} t Z[i] {<-} u endfor
Note how loop merging made it possible to eliminate the allocate for the intermediate array result Temp. Because of loop merging, it is possible to write programs that use array operations, yet iterate without allocates. We will see later that this is vital to speeding up programs.
The loop induction expression optimizer reduces multiplication of a loop counter to addition. This eliminates many of the multiplications implicit in array indexing. For instance, Z{<-}V[3+{iota}5] for integer vector V is transformed from:
BPE {<-} 4 @ 4 bytes/element Allocate Z Loop for i {in} {iota}5 j {<-} (3+#IO+i)-#IO If (j<0) or (j{>=}{rho}V) then Signal index error endif source {<-} (j{times}BPE) + Address of V t {<-} MEMORY[source+{iota}BPE] dest {<-} (i{times}BPE) + Address of Z MEMORY[dest+{iota}BPE] {<-} t endloop
to:
BPE {<-} 4 @ 4 bytes/element Allocate Z i {<-} -1 j {<-} (3+#IO+i)-#IO first {<-} j + 1 If (first<0) or (first{>=}{rho}V) then Signal index error endif last {<-} j + 5 If (last<0) or (last{>=}{rho}V) then Signal index error endif source {<-} (j{times}BPE) + Address of V s {<-} Address of Z e {<-} s + (BPE{times}(5-1)) Loop for dest = s up to e source {<-} source + 4 t {<-} MEMORY[source+{iota}BPE] MEMORY[dest+{iota}BPE] {<-} t endloop
Although the code is not reduced in size by this optimization, most of the operations are moved out of the loop, speeding up execution if the loop iterates many times.
The final phase of compilation translates the intermediate code to machine language. This final translation allows the intermediate language to be customized to the needs of the optimizers, and facilitates adapting the compiler to different target machines. Control structure instructions (such as Loops and Ifs) are expanded to branches and labels. Some higher level intermediate instructions (such as maximum) are expanded to a sequence of machine instructions. Frequently used variables are assigned to registers. The compiler optionally produces an assembler listing of the code it generates, enabling us to judge the quality of the code and note opportunities for further improvement.
How do these transformations and optimizations affect the speed of actual programs? The speedup is different for each program, but two broad classes may be distinguished.
If a program iterates and the compiler is able to eliminate allocates in the loops, the program is likely to be sped up 8 to 200 times by compilation. The more the program iterates, the greater the speedup tends to be. If operations in the loops return non-scalar arrays, the speedup tends to decrease as the average array size increases.
Non-iterative programs with few opportunities for loop merging and few scalar arrays are typically 0.9 to 2 times faster after compilation, and tend to be sped up less when processing large arrays.
Many programs fall somewhere between these two categories, perhaps iterating with one or two allocates in a loop, or not iterating but having lots of mergeable operations or scalar arrays. Such programs may be sped up anywhere from 1.5 to 8 times by compilation.
The following function is an example of the kind of scalar-iterative algorithm that cannot be practically implemented using an interpreter. The problem is to find a contiguous segment of a given vector having the maximum possible sum [3]. If the vector contains only positive numbers, the answer is the whole vector. However, if there are both positive and negative numbers in the vector, some subsegment may give a larger sum. If all the numbers are negative, the answer is the empty segment with sum zero.
The algorithm may be described as follows: Compute a running sum of the argument. If the sum becomes negative, reset the sum to zero and mark the next element as the new "start" of the vector. If the sum takes on a value greater than any previously encountered, mark the current element as the end-of-segment and the current "start" position as the start-of-segment. Repeat for each element of the argument.
{del}Z{<-}MAXSEG V;I;S;X;B;E;F;N [1] @ DECLARE INTEGER: V[] S{jot} [2] N{<-}{iota}1 @ HOIST #IO REFERENCE [3] N{<-}{rho}V [4] I{<-}B{<-}E{<-}F{<-}X{<-}S{<-}0 [5] L1:{->}(N<I{<-}I+1)/L3 [6] S{<-}S+V[I] [7] {->}(S{>=}0)/L2 & S{<-}0 & F{<-}I [8] L2:{->}(S{<=}X)/L1 & E{<-}I & B{<-}F & X{<-}S [9] {->}L1 [10] L3:Z{<-}N{take}(-E){take}(E-B){rho}1 {del}
Line [1] is a declaration to the compiler that argument V is an integer vector and that variable S is an integer scalar. The declaration for S is included because S is initialized with the constant zero, which is regarded by the compiler as a boolean value. Because variables I, B, E, X, and F are defined from S, the compiler knows that they are integer scalars. The statement N{<-}{iota}1 on line [2] forces the compiler to fetch the value of #IO before entering the user loop (lines [5-9]). Without this hoisting, #IO would be fetched each time line [6] was executed, slightly reducing the speed of the program. [Incidently, in the example APL programs origin 1 is assumed.]
The compiler eliminates allocates for all the scalar variables in MAXSEG. Because MAXSEG processes only scalars in its loop, it is sped up more for larger arguments. The final statement, which generates the result mask, takes 3 memory allocations, dominating execution time for small arguments and limiting speedup potential. For larger arguments, a greater proportion of the time is spent in the loop, and the dramatic speedup of this section of the code becomes apparent.
Argument Speedup Length Factor 1 3 10 13 100 72 1000 182 10000 204
[The speedup factor is the execution time for the interpreted function divided by the time for the compiled function.]
Box-Jenkins time series analysis involves modelling time-variant data such as daily mean temperatures. An ARMA(N,M) time series model is "evaluated" by computing each predicted observation as the weighted sum of the previous N observations and the previous M differences between predicted and observed values. The process is inherently iterative; the next observation in the model cannot be computed until the previous observation has been computed. The following function evaluates the predicted values for an ARMA(N,N) time series model. The right argument is the time series data; the left argument is a 2-by-N matrix of autoregression and moving average coefficients.
{del}P{<-}AB ARMA V;E;A;B;I;V;D;R [1] @ DECLARE REAL: V[] AB[;] R{jot} [2] E{<-}{rho}V [3] I{<-}''{rho}{neg}1{take}{rho}AB [4] A{<-}AB[1;] & B{<-}AB[2;] [5] P{<-}E{rho}R{<-}0 & D{<-}E{rho}R [6] J{<-}1+I-{iota}I [7] L1:{->}(E<I{<-}I+1)/0 [8] P[I]{<-}(+/A{times}V[I-J])-+/B{times}D[I-J] [9] D[I]{<-}P[I]-V[I] [10] {->}L1 {del}
The declaration on line [1] tells the compiler that argument V is a real vector, that argument AB is a real matrix, and that variable R is a real scalar. Arrays P and D are initialized on line [5] by reshaping R because the compiler regards the constant zero as a boolean value.
The compiler-generated loops for the array operations on line [8] all merge, eliminating allocates in the user loop (lines [7-10]). The following table gives speedup factors for various right arguments given a left argument specifying an ARMA(N,N) model:
Loop | N Iteration | Count | 3 6 12 ------------+------------------------ | 10 | 13 11 10 | 100 | 29 24 19 | 1000 | 34 28 21 |
For a given N, the speedup increases as the program iterates more. For a given iteration count, the speedup decreases as N increases because the program processes larger arrays.
INTSNDP calculates the integral of the standard normal density probability function from negative infinity to X using an approximation from [1]. This function is distinctive in that it consists of a sequence of 21 consecutive scalar operations which merge, upon compilation, into one loop and one allocate.
{del}Z{<-}INTSNDP X;T;S [1] @ DECLARE REAL: X[] [2] T{<-}{divide}1+0.2316419{times}|X [3] S{<-}T{times}0.31938153+T{times}{neg}0.356563782+T{times}{+ +}1.781477937+T{times}{neg}1.821255978+T{times}1.330284429 [4] Z{<-}|(X{>=}0)-S{times}0.39894228040143{times}*{neg}0.5{+ +}{times}X{times}X {del}
In spite of the loop merging feat, the function is sped up only modestly. For arguments of 100 elements or more, the cost of the multiplication and exponentiation dwarfs the cost of 21 allocates, and it becomes apparent that the compiler's single loop is just twice as fast as the interpreter's 21 loops.
Argument Speedup Length Factor 1 5.5 10 3.9 100 2.3 1000 2.0 10000 2.1
The STATS function computes descriptive statistics (number of observations, minimum, maximum, mean, mean deviation, and variance) of a vector argument.
{del}Z{<-}STATS A;N;L;G;M;D;S;SS [1] @ DECLARE REAL: A[] Z[] [2] N{<-}''{rho}{rho}A [3] L{<-}{min}/A & G{<-}{max}/A [4] M{<-}(+/A){divide}N [5] D{<-}|A-M [6] S{<-}+/D & SS{<-}+/D*2 [7] Z{<-}6{rho}0 & Z[1]{<-}N [8] Z[2]{<-}L & Z[3]{<-}G & Z[4]{<-}M [9] Z[5]{<-}S{divide}N & Z[6]{<-}SS{divide}N-1 {del}
Line [1] is a declaration to the compiler that A and Z are real vectors. The declaration for Z is included because the compiler regards 6{rho}0 as a boolean array. Z is set one element at a time to avoid allocates associated with forming the result using catenate.
This version of the program has just one allocate, for the result. There are only two loops, the minimum possible (one calculates {min}/A, {max}/A, and +/A, the other calculates |A-M, +/D, and +/D*2). The table below gives speedup factors for various length arguments.
Argument Speedup Length Factor 1 4.0 10 5.1 100 4.0 1000 2.9 10000 2.5
pMAXRED calculates the largest value in each segment of a partitioned vector. The right argument is the data vector; the left argument is a conforming Boolean vector with ones marking the start of each segment. The result has as many elements as there are ones in the left argument, and the J-th element holds the largest value in the J-th segment of the right argument.
To avoid iteration, pMAXRED is usually expressed as in [4]:
{del}Z{<-}B pMAXRED D [1] Z{<-}D[({gradedown}D)[B/{gradeup}(+\B)[{gradedown}B]]] {del}
Non-iterative though it may be, this is not a particularly efficient solution to the problem. The grade primitive is expensive, and are not essential to computing the result. A straightforward iterative algorithm is:
{del}Z{<-}B pIMAXRED D;J;I;M [1] @ DECLARE INTEGER: B[] D[] MININT{jot} [2] J{<-}+/B & Z{<-}J{rho}{neg}1 & {->}(J=0)/0 [3] I{<-}''{rho}{rho}D & M{<-}MININT [4] L1:M{<-}M{max}D[I] [5] {->}B[I]{drop}L2 & Z[J]{<-}M & J{<-}J-1 & M{<-}MININT [6] L2:{->}(0<I{<-}I-1)/L1 {del}
Line [1] is a declaration to the compiler that B and D are integer vectors and that MININT is an integer scalar. (MININT holds the minimum value representable in the integer datatype, typically -2*31.) The right argument is declared integer because a choice had to be made or the compiler would pass most of the operations to the interpreter. A separate function, pRMAXRED, would presumably be used for real data. The left argument is declared integer because empirical tests showed it be faster that way (Boolean indexing is slow).
Compilation speeds up pIMAXRED considerably, to the point that it is faster than the clever non-iterative algorithm.
Argument Speedup Speedup over Length Factor non-it. version 1 3 1.9 10 10 2.5 100 33 4.3 1000 44 6.5 10000 46 8.4
[The third column gives the execution time for the non-iterative version divided by the time for the compiled iterative version.]
One ought not get the impression that all functions can be sped up by rewriting them iteratively. If the non-iterative function is not forced to use expensive or unnecessary operations, it will be hard to beat with an iterative APL program. Because of index error checking and lack of loop induction optimizations for user loops, compiled APL functions are not as efficient as the assembly language programs making up the interpreter.
For instance, the partitioned plus reduction operation, pPLUSRED, is implemented non-iteratively as:
pPLUSRED:T-{neg}1{drop}0,T{<-}(1{rotate}{alpha})/+\{omega}
which executes with 6 allocates and practically no unnecessary arithmetic. The iterative version, similar to pIMAXRED, is sped up plenty by compilation, but not enough to surpass the non-iterative version for large arguments.
Argument Speedup Speedup over Length Factor non-it. version 1 4 1.62 10 11 1.63 100 36 0.88 1000 51 0.43 10000 52 0.36
Although using a compiled iterative version of pIPLUSRED probably would not speed up an application, the factor of 3 slowdown is modest in comparison to the factor of 150 times slower for the same program iterpreted.
The assembler listing of the code generated by the compiler for the ARMA function is impressive. When the compiler generates the right intermediate instructions, the final machine code can hardly be improved.
However, not all programs are so fortunate. There are many optimizations a programmer would apply which the compiler does not (yet) know about. Although we could write more optimizations for the compiler, we could never outperform a clever human. Programmers can invent new optimizations for new programs and need understand how to apply the optimization to only the program at hand. For an optimization to be implemented in the compiler, its applicability to every possible program must be throughly understood.
The difficulty of implementing optimizations is of little consolation to the frustrated user who knows what intermediate code he wants, but is unable to get the APL compiler to generate it. We experienced this problem when attempting to use the compiler to speed up itself. By mid-1983, the compiler (written entirely in APL) had grown so huge and slow that it was no longer possible to test it adequately, let alone add new optimizations to it. It absolutely had to be sped up and was not up to the task of compiling itself.
The solution adopted was to create a new language to generate intermediate code for the compiler. Called TABL (for Tool and Application Building Language), it has all the intermediate language instructions available, plus additional features found in conventional compiled languages (such as data structure abstraction). Using TABL, it is possible to "call the shots" and get the compiler to generate whatever intermediate code you want. Although TABL is thought of as a low-level language, it is functionally a superset of APL.
The TABL compiler was relatively simple to build. About 80 percent of the code is shared with the APL compiler. TABL has proven very effective in practice. We rewrote the slowest parts of the optimizer in TABL and found speedups of 100 to 200 times over the interpreted APL versions. By rewriting about 2 percent of the compiler in TABL, the overall speed of the compiler was increased by a factor of 5 and development was able to continue.
Below are examples of simple APL expressions that a programmer writing in a low-level language could implement better than the compiler at present. Although some of the expansions can be mimicked with iterative APL code, needless index error checking and other inefficiencies may prevent this from being a viable alternative. All of the desired expansions are easily expressed in TABL.
The expression M[I;]{<-}M[I;]{divide}P, for scalars I and P, may be implemented with:
For j {in} {iota}1{drop}{rho}M M[I;j] {<-} M[I;j] {divide} P endfor
The compiler cannot use this algorithm if M is global. If the expression gives a domain error, the semantics of APL require that no elements of M be changed. The algorithm above would change those elements occuring before the non-zero in M[I;]. Instead, the compiler performs the indexed reference and indexed assignment as two separate operations and does an allocate for the result of M[I;]{divide}P.
The statement V[(V='{neg}')/{iota}{rho}V]{<-}'-', which changes high minus signs in vector V to low minus signs, may be implemented with:
Loop for i {in} {iota}{rho}V If V[i]='{neg}' then V[i] {<-} '-' endif endloop
The compiler presently passes compression to the interpreter, but even if compression was expanded, it is hard to see how this algorithm could be derived from it. Think about the expansion for compression: it takes one pass through the left argument just to figure out the shape of the result, then another pass to copy elements of the right argument to the result. How does one avoid having to materialize the Boolean left argument to compression? How does one identify cases of compression where one pass suffices?
The expression B{<-}N{drop}B,N{rho}0 for a 32-element (or shorter) Boolean vector B may be implemented in a single machine instruction as an N-bit arithmetic shift, assuming the bits of a boolean vector are stored consecutively in a word. (They are not in VSAPL.) A bitwise shift instruction is useful for implementing CRC checksums and hashing. The compiler performs this statement with 3 allocates.
The statement Z{<-}(N{drop}V){iota}S may be implemented without an allocate for N{drop}V by simply starting the search for S past the Nth element of V. The APL compiler is not presently able to avoid the allocate. When this statement is imbedded in a loop, the result may be a program which takes execution time proportional to ({rho}V)*2. If implemented without an allocate, the proportionality factor drops to ({rho}V).
The "delete trailing blanks" idiom DTB:({reverse}{or}\{reverse}{omega}{/=}' ')/{omega} may be implemented without the need to pack, reverse, and unpack Boolean arrays. Simply scan the argument from the right until a non-blank is encountered, and copy the rest of the argument to the result. The compiler implements this idiom literally, expands only the not-equal operation, and passes the rest to the interpreter, resulting in 5 allocates.
An APL compiler has been constructed. The compiler complements the interpreter by substantially speeding up iterative and scalar-oriented programs, which the interpreter executes very inefficiently. Certain non-iterative programs are sped up significantly, but many are not. Some flagrantly wasteful non-iterative programs can be rewritten iteratively to obtain faster execution.
Conditioned by years of using interpreters, APL programmers have come to regard "loop" as a dirty word. State-of-the-art algorithms designed for other languages are considered irrelevant to APL unless they can be expressed non-iteratively. Likewise "efficient" APL algorithms are usually of no interest to users of other languages. APL aficionados have restricted themselves to problems which can be practically solved using the interpreter, and less enthusiastic users are often forced away from APL because the problems they tackle do not lend themselves to non-iterative solutions. The compiler has the potential to align APL users' views of efficiency with those of the rest of the computing community, to facilitate the exchange of algorithms between APL and non-APL users, and to expand the range of problems solvable in APL.
In addition to the APL compiler, the TABL compiler may be used to express machine-oriented algorithms that are difficult or impossible to state in APL. The combination of the APL and TABL compilers ushers in a new era of efficiency for APL and opens up whole new fields where APL could not be applied before.
[1] Abramowitz, M. (ed.) Handbook of Mathematical Functions, U.S. Govt. Printing Office, Washington, D.C., 1964. Formula 26.2.17, page 932.
[2] Aho, A., and Ullman, J. Principles of Compiler Design, Addison-Wesley, Reading, Mass., 1977.
[3] Bently, J. Algorithm Design Techniques. Communications of the ACM, 27,9 (September 1984), 865-871.
[4] Smith, B. A Programming Technique for Non-Rectangular Data. Proc. APL 79 Conference, Rochester, New York, (June 1979), 262-267.
[5] Weigang, J. APL*PLUS Compiler Guide, Publication number P126, STSC, Rockville, MD, 1984.
[6] Wiedmann, C. Steps Towards an APL Compiler. Proc. APL 79 Conference, Rochester, New York, (June 1979), 321-328.
[7] Wiedmann, C. A Performance Comparison between an APL Interpreter and Compiler. Proc. APL 83 Conference, Washington, DC, (April 1983), 211-217.
[8] Wiedmann, C. Efficiency in the APL Environment. Proc. APL 85 Conference, Seattle, Washington, (May 1985).