Fork me on GitHub

Flatorize: Generate fast, flat, factorized code for mathematical expressions

by Guillaume Lathoud [1], April 2013, October 2014

flatorize lets you write well-encapsulated math code - i.e. many small functions - and automatically generates flat, very fast JavaScript code.

Speedups exceed +1000% in many cases, including matrix multiplication and the Discrete Fourier Transform (DFT).

Additionally:

Tests on the command line:

External links:

Contents #

Idea #

Think like this:

Matrix multiplication a * b: Each row of a with each column of b

...write code like you think, without caring about performance...

# Python   c = a * b
c = [[ sum(x*y for x,y in zip(ra,cb)) for cb in zip(*b) ] for ra in a ]
// JavaScript
function matmulrows_zip( a, b )
{
    return a.map( function (ra) { 
        return zip.apply( null, b ).map( function (cb) {
            return sum( 
                zip( ra, cb )
                    .map( function (xy) { return xy[ 0 ] * xy[ 1 ]; } )
            );
        } );
    } );
}

// --> Well encapsulated code, but slow due to function call overhead.

...and let flatorize transform the above into very fast JavaScript code:

// Generated "flatorized" code == factorized + flattened
// 
// - factorized: avoid repeating computations.
// - flattened:  no function call.

Details (HOWTO)#

The two core ideas are to factorize and to flatten the implementation of mathematical expressions, hence the name flatorize.

Consider the function f of complex numbers a,b,c:

function f(a,b,c) { return csub( csub(a,cadd(b,c)), cadd(b,c) ); };

Factorize #

To spare CPU, one can modify f by evaluating add(b,c) only once:

function f(a,b,c) { 
    var bc = add(b,c);
    return sub( sub(a,bc), bc ); 
}

Flatten #

Small "building block" functions like add or sub are useful to write code, but at runtime the CPU spends more time in function overhead than just doing the simple operation(s) in them.

In more complex cases like recursive expressions, the function overhead becomes even bigger.

One can eliminate function overhead by inlining each small function, that is replacing each small function with its implementation:

function f(a,b,c) { 
    var bc = b+c;
    return (a - bc) - bc;
}

The CPU performance should be pretty good at this point.

One could try to further optimize, e.g. writing a-2*bc, which amounts to mathematical expression simplification. This can be quite complicated, and it is not sure how much CPU speed can be gained this way - especially compared to function overhead elimination - so if we ever attempt simplification, we'll keep it very simple.

Usage: expr, part and flatorize #

flatorize('a,b,c', function f(a,b,c) { ... }) returns a new function, obtained by:

  1. calling f('a','b','c'), which returns an expression,
  2. factorizing and flattening the expression,
  3. generating the new code for the expression.

f(a,b,c) can use 3 ways to build and return an expression:

Example: complex numbers #

We now work with complex numbers, which we represent as arrays of two numbers [re,im]. Consider the function f of complex numbers a,b,c:

f = ;

To produce an optimized implementation f2 of the function f, we use flatorize and write the following code:

FZ    = flatorize;
cadd  = FZ('a,b',   function (a,b)   { 
    return cplx( FZ.expr( creal(a), '+', creal(b) ),
                 FZ.expr( cimag(a), '+', cimag(b) ) 
               ); 
});
csub  = FZ('a,b',   function (a,b)   { 
    return cplx( FZ.expr( creal(a), '-', creal(b) ),
                 FZ.expr( cimag(a), '-', cimag(b) ) 
               ); 
});
creal = FZ('a',     function (a)     { return FZ.part( a, 0 ); });
cimag = FZ('a',     function (a)     { return FZ.part( a, 1 ); });
cplx  = FZ('re,im', function (re,im) { return [ re, im ]; });

f  = 
f2 = FZ('a,b,c',f);

All functions returned by flatorize can be used to build further expressions: in the above example, creal is used to express cadd, which itself is used to expressed f. Definition order does not matter.

All functions returned by flatorize can also be called with values, for computation. For example, calling f2(a,b,c) executes the flat, factorized:

Example: matrix multiplication (arrays of rows)#

LOOPS #

Consider matrices expressed as 2-dimensional arrays of rows. Here is an implementation of matrix multiplication using 3 nested loops:

// LOOPS

Guessing the intent of matmulrows_loops from its implementation alone can be hard. Concepts are not encapsulated into separate functions, which slows down understanding, debugging and maintenance.

ZIP #

Can we express the concept of matrix multiplication in a more concise and better encapsulated way? For example, in Python:

# Python

a = [ [1,2,3,4,], [5,6,7,8,], [9,10,11,12,], ]
b = [ [13,14,], [15,16,], [17,18,], [19,20,]]

# Matrix multiplication: c = a * b

# Each number in c is the dot-product 
# of a row ra with a column cb

c = [[ sum(x*y for x,y in zip(ra,cb)) for cb in zip(*b) ] for ra in a ]

# zip(*b) implements the "transposed matrix"

# Result:
# >>> c
# [[170, 180], [426, 452], [682, 724]]

The expression is concise, shorter than the nested loops used above, and the various concepts are well encapsulated: map, sum and zip.

The reference to rows of a and columns of b is explicit through variables ra and cb, respectively. This can facilitate understanding.

A similar implementation in JavaScript requires to define zip and sum:

// ZIP (JavaScript)

Great! But given the number of nested function calls involved, matmulrows_zip may well run much slower than matmulrows_loops:

(Feel free to do it multiple times.)

(The speed measurement can last long in some browsers.)

Example of result:

ZIP   matrix multip.: 3.56e+4 matmulrows_zip(a,b) calls per second

LOOPS matrix multip.: 1.76e+6 matmulrows_loops(a,b) calls per second
-> speedup relative to ZIP: +4855%

ZIPFLAT: symbols first, numbers later #

Can we keep the nice encapsulations in matmulrows_zip *and* speed it up? We now write similar code using flatorize: instead of working on numbers, we work on symbols:

The core remains very similar. Indeed, compare the number version:

// ZIP

with the new symbolic version:

// ZIPFLAT

All we had to do was to replace:

Here is the generated implementation:

// ZIPFLAT
// Generated code (function matmulrows_zip_342)


(This code, as a few others below, was produced and inserted as you loaded the page.)

Now a look at speed:

(Feel free to do it multiple times.)

(The speed measurement can last long in some browsers.)

Example of result:

ZIP     matrix multip.: 3.53e+4 matmulrows_zip(a,b) calls per second

LOOPS   matrix multip.: 1.50e+6 matmulrows_loops(a,b) calls per second
-> speedup relative to ZIP: +4147%

ZIPFLAT matrix multip.: 1.69e+6 matmulrows_zip_342(a,b) calls per second
-> speedup relative to ZIP: +4698%

Depending on the JavaScript engine, LOOPS written by hand may run faster or slower than the generated ZIPFLAT code. However, both speeds have similar orders of magnitude.

More importantly:

We write well-encapsulated code, flatorize it automatically, and the generated code runs orders of magnitude faster (ZIPFLAT).

Trade-off: the matrix dimensions must be known to generate the flatorize'd code.

Example: Discrete Fourier Transform (DFT) #

We now implement a DFT using the Cooley-Tukey algorithm for both real & complex inputs (option). The implementation dft_ditfft2 shows that we can express recursive mathematical expressions in a "natural" way, that is very close to the original pseudocode.

dftreal16flat = flatorize('arr', dft_exprgenF( 4, { real: true } ));


Let us now test it on a "small" scale (16 point). First, let us generate the code:

 

So the code generation of a "flatorized" 16-point DFT takes very little time. This has to be done only one time.

Now that we have a "flatorized" implementation, let us check whether it runs faster than the original Cooley-Tukey:

(Feel free to do it multiple times.)

(The speed measurement can last long in some browsers.)

On a linux Ubuntu 12.04, I obtained with Firefox 21 a speedup of about +2000%, and with Chrome 27 about +1250%.

Pushing it: 1024-point Discrete Fourier Transform (DFT) #

dftreal1024flat = flatorize('arr', dft_exprgenF( 10, { real: true } ));

Since the recursion depth \(log_2{1024}=10\) starts to be quite significant, one can wonder how long the call to `flatorize` itself will take.

 (This one can last a few seconds.)

Now, let us measure how fast the flatorized DFT1024 implementation runs, as compared with the original Cooley-Tukey:

(Feel free to do it multiple times.)

(The speed measurement can last long in some browsers.)

In Chrome 27 on a linux Ubuntu 12.04, I got a small speedup +50% by the first measurement, and afterwards a consistent speedup of about +150%. In Firefox 21 I got a consistent speedup of about +350%.

Pushing it: all performance results #

Speedup brought by flatorize:

Speedup on:matmul342DFT16DFT32DFT64DFT128DFT256DFT512DFT1024

Firefox 26 (1st)+615%+2532%+1755%+1011%+638%+602%+597%+624%
Firefox 26 (2nd)+399%+2067%+1309%+830%+649%+654%+702%+301%
Firefox 26 (3rd)+434%+1945%+2076%+936%+713%+743%+685%+383%

Chrome 32 (1st)+87%+1388%+1072%+1421%+577%+264%-3%-67%
Chrome 32 (2nd)+102%+1044%+1284%+1557%+1409%+1600%+1440%+1462%
Chrome 32 (3rd)+98%+1107%+1141%+1285%+1428%+1404%+1515%+1580%

V8 3.19.10 (1st)+74%+1548%+1987%+1271%+548%+194%-12%-77%
V8 3.19.10 (2nd)+80%+1735%+2618%+2233%+1695%+1785%+1548%+1587%
V8 3.19.10 (3rd)+79%+1643%+2461%+2233%+1695%+1793%+1546%+1590%

(Feel free to do it multiple times.)

(The speed measurement can last long in some browsers.)

Result analysis: speedup #

Firefox 26 shows excellent speedups everywhere.

Chrome 32 and V8 3.19.10 mostly show excellent speedups, especially after the first pass, which likely triggers code optimization.

Result analysis: memory #

In all environments the memory consumption remains very stable.

Conclusion #

The huge performance improvements validate the interest of flatorize.

plugin is available that generates asm.js code, which runs even faster (tested in Firefox and Chrome).

Two other plugins generate fast code for the C language and the D language.

See also: Speed tests of the various solutions & languages (JS, asm.js, C).

In the future, one could also try WebCL, NaCl or PNaCl.


Acknowledgments #

Big thanks to the V8 developers, who solved issue #2612 in record time (concerning Chrome 24 and V8 3.17).


Disclaimer #

By no means did I intend flatorize to compete with highly specialized and/or native implementations of a particular task (e.g. FFTW to implement DFT).

Rather, this simple, task-agnostic approach – factorizing and flattening – can give huge enough performance gains in many cases, which spares the development and maintainance costs of writing optimized native code by hand.


Miscellaneous details #

Above, the three main functions were described: flatorize, flatorize.expr and flatorize.part. In addition:

  1. The method getDirect and the convenience function flatorize.now give you direct access to the flatorized implementation, which you can use for computations (even better performance). flatorize.now makes sense when you know that you will never use the flatorized function to build further expressions, e.g. in specific cases of a parameterizable implementation (matmul342); else use getDirect.
  2. Within a function that builds an expression, a call to the method evalnow delivers a value (instead of an expression). See above the call cpol.evalnow within dft_exprgenF.

For more details about these two points, and examples, you can have a look at the sources on github or here (files flatorize.js and examples.js).

.

V8: Several .sh files are provided to run the tests on V8, including d8_tryit_speed_all.sh and d8_tryit_speed_dft1024.sh; the rest on github.

.

Flatorizing DFT1024 in a tractable time – order of magnitude: 1 second – actually required a bit of work. The method is not specific to DFT, but rather very generic. The resulting, cleaned-up implementation: flatorize.js


License #