Fork me on GitHub

flatorize_asmjs: Generate fast TypedArray code that is compatible with asm.js

by Guillaume Lathoud [1], September 2014

This page presents a plugin method flatorize.getAsmjs() (GitHub source) that goes on top of flatorize (see the main article, GitHub source).

Examples describe how to use flatorize.getAsmjs() to generate asm.js/TypedArray code that runs very fast in at least Firefox & Chrome.

Each input or output can be a number or an array of numbers. All arrays are grouped into a single one (inputs and/or output), so they must have the same type: double, float or int in flatorize notation, i.e. respectively Float64Array, Float32Array or Int32Array in JavaScript notation.

See also:

Contents #

HOWTO: 2-step example#

Here is an expression definition that uses complex numbers (details in the main article):

// f:

A call to flatorize():

// note the type declarations, ignored by flatorize but used later for asm.js
f2 = flatorize('a:[2 float],b:[2 float],c:[2 float]->d:[2 float]',f);

...generates flatorized JavaScript code:

// f2.getDirect():

Then, a call to flatorize.getAsmjs():

...returns an asm.js generator:

// f2_asmjsGen:

The generator can be used as follows to compile and use the asm.js code:

(This check, as a few others below, ran as you loaded the page.)

Summary:

We used two steps to create the asm.js generator f2_asmjsGen. First, we called flatorize, then we called flatorize.getAsmjsGen():

// Note the type declarations, ignored by flatorize but used later for asm.js
f2 = flatorize('a:[2 float],b:[2 float],c:[2 float]->d:[2 float]',f);

// Now the type declarations will matter

Having the intermediate flatorized implementation f2 can be useful to build other flatorized implementations, i.e. to write well-encapsulated, maintainable code using many small functions.

We only need the second step — a faster asm.js implementation — for the functions actually used in massive computations.

HOWTO: 1-step shortcut#

If an intermediate flatorized implementation is not needed, one can directly create the asm.js generator in a single step:

2-dimensional array example: matrix multiplication#

A call to flatorize() (details in the main article):


...generates flatorized JavaScript code:

// matmulrows_zip_342.getDirect():

Then, a call to flatorize.getAsmjs():

...returns an asm.js generator:

// matmulrows_zip_342_asmjsGen:

The generator can be used as follows to compile and use the asm.js code:

Discrete Fourier Transform: DFT16 (real signals)#

A call to flatorize() (details in the main article):

...generates flatorized JavaScript code:

// dftreal16flat.getDirect():

Then, a call to flatorize.getAsmjs():

...returns an asm.js generator:

// dftreal16flat_asmjsGen:

The generator can be used as follows to compile and use the asm.js code:

asmjs_dftrealflat_check( 16 );

Discrete Fourier Transform: DFT1024 (real signals)#

A call to flatorize() (details in the main article):

...generates flatorized JavaScript code:

(Might last a few seconds.)

Then, a call to flatorize.getAsmjs():

var dftreal1024flat_asmjsGen = flatorize.getAsmjsGen( 
  { switcher: dftreal1024flat, name: "dftreal1024flat" } 
);

...returns an asm.js generator:

(Might last a few seconds)

The generator can be used as follows to compile and use the asm.js code:

asmjs_dftrealflat_check( 1024 );

Performance (1): with/without "use asm"#

We compare the speed with & without "use asm" statement, on DFT1024. The only difference is whether or not the "use asm" statement appears, the rest of the code remains the same.

(Feel free to do it multiple times.)

(The speed measurement can last long in some browsers.)

Example of result:

__________
Firefox 32:

without "use asm": speed: 2.15e+3 iterations/second.
with    "use asm": speed: 2.41e+4 iterations/second.
-> speedup: +1018%

without "use asm": speed: 2.13e+3 iterations/second.
with    "use asm": speed: 2.41e+4 iterations/second.
-> speedup: +1030%

without "use asm": speed: 2.10e+3 iterations/second.
with    "use asm": speed: 2.34e+4 iterations/second.
-> speedup: +1013%

__________
Chrome 38:

without "use asm": speed: 1.12e+4 iterations/second.
with    "use asm": speed: 1.23e+4 iterations/second.
-> speedup: +10%

without "use asm": speed: 3.48e+4 iterations/second.
with    "use asm": speed: 3.45e+4 iterations/second.
-> speedup: -1%

without "use asm": speed: 3.47e+4 iterations/second.
with    "use asm": speed: 3.52e+4 iterations/second.
-> speedup: +1%

speedup: as expected, Chrome does not care about "use asm", whereas in Firefox having the "use asm" statement leads to a +1000% speedup.

speed: at first, Chrome runs slower than Firefox, but afterwards, Chrome has the highest speed. Most likely the repeated use of the code triggers an extra optimization in Chrome after it "warms up".

Conclusion:

Use asm.js for a dramatic speedup in Firefox (+1000%).

Performance (2): Typed Arrays vs. normal arrays#

We compare the speed with Typed Arrays & with normal arrays, on DFT1024. We replace:

// Using Typed Arrays

var float64 = new stdlib.Float64Array( heap );

...

dftrealflat_buffer = 
    new ArrayBuffer( dftrealflat_asmjsGen.buffer_bytes )

with:

// Using normal arrays

var float64 = heap;

...

dftrealflat_buffer = 
    new Array( dftrealflat_asmjsGen.count )

To have a meaningful comparison, we remove "use asm" in both cases, because the "normal array" version cannot be compiled anyway.

(Feel free to do it multiple times.)

(The speed measurement can last long in some browsers.)

Example of result:

__________
Firefox 32:

with normal array: speed: 2.00e+3 iterations/second.
with  Typed Array: speed: 2.14e+3 iterations/second.
-> speedup: +7%

with normal array: speed: 2.08e+3 iterations/second.
with  Typed Array: speed: 2.14e+3 iterations/second.
-> speedup: +3%

with normal array: speed: 2.11e+3 iterations/second.
with  Typed Array: speed: 2.12e+3 iterations/second.
-> speedup: +1%

__________
Chrome 38:

with normal array: speed: 1.06e+4 iterations/second.
with  Typed Array: speed: 1.13e+4 iterations/second.
-> speedup: +6%

with normal array: speed: 2.95e+4 iterations/second.
with  Typed Array: speed: 3.42e+4 iterations/second.
-> speedup: +16%

with normal array: speed: 2.95e+4 iterations/second.
with  Typed Array: speed: 3.48e+4 iterations/second.
-> speedup: +18%

speedup: almost none in Firefox, and about +15% to +20% in Chrome.

speed: Since "use asm" was removed for this comparison, Firefox runs slower than previously. Chrome exhibits the same "warm up" behaviour.

Conclusion:

Use Typed Arrays for a speedup in Chrome (+15% to +20%).
Coding for asm.js brings you this speedup as a side-product.

Performance (3): in-place output vs. new output array#

We compare the speed of flatorize, which outputs a new array at each call,

return [ _1k, _c3, _4b ];

...with the speed of flatorize.getAsmjsGen(), which generates an in-place implementation with Typed Arrays:

float64[ 0 ] = _1k;
float64[ 1 ] = _c3;
float64[ 2 ] = _4b;

The speed tests run on on DFT1024. Based on the 2 previous results, to ensure a meaningful comparison, since flatorize uses normal arrays, we modify the code generated by flatorize.getAsmjsGen() to have it use normal arrays as well.

(Feel free to do it multiple times.)

(The speed measurement can last long in some browsers.)

Example of result:

__________
Firefox 32:

with new output array: speed: 2.05e+3 iterations/second.
with   in-place array: speed: 2.15e+3 iterations/second.
-> speedup: +5%

with new output array: speed: 2.09e+3 iterations/second.
with   in-place array: speed: 2.15e+3 iterations/second.
-> speedup: +3%

with new output array: speed: 2.09e+3 iterations/second.
with   in-place array: speed: 2.15e+3 iterations/second.
-> speedup: +3%

__________
Chrome 38:

with new output array: speed: 3.38e+3 iterations/second.
with   in-place array: speed: 4.81e+3 iterations/second.
-> speedup: +42%

with new output array: speed: 1.17e+4 iterations/second.
with   in-place array: speed: 2.94e+4 iterations/second.
-> speedup: +151%

with new output array: speed: 1.19e+4 iterations/second.
with   in-place array: speed: 3.01e+4 iterations/second.
-> speedup: +153%

speedup: very little in Firefox, but quite high in Chrome.

speed: Chrome exhibits the same "warm-up" behaviour as above. Interestingly, after the "warm-up", in-place arrays are even better optimized.

Performance: all together#

We compare the speed of flatorize with the speed of flatorize.getAsmjsGen(), with all improvements activated ("use asm", Typed Arrays, in-place output).

(Feel free to do it multiple times.)

(The speed measurement can last long in some browsers.)

Example of result:

__________
Firefox 32:

flatorize              : speed: 2.12e+3 iterations/second.
flatorize.getAsmjsGen(): speed: 2.49e+4 iterations/second.
-> speedup: +1075%

flatorize              : speed: 2.12e+3 iterations/second.
flatorize.getAsmjsGen(): speed: 2.42e+4 iterations/second.
-> speedup: +1045%

flatorize              : speed: 2.12e+3 iterations/second.
flatorize.getAsmjsGen(): speed: 2.39e+4 iterations/second.
-> speedup: +1028%

__________
Chrome 38:

flatorize              : speed: 3.55e+3 iterations/second.
flatorize.getAsmjsGen(): speed: 6.05e+3 iterations/second.
-> speedup: +70%

flatorize              : speed: 1.23e+4 iterations/second.
flatorize.getAsmjsGen(): speed: 3.51e+4 iterations/second.
-> speedup: +186%

flatorize              : speed: 1.14e+4 iterations/second.
flatorize.getAsmjsGen(): speed: 1.33e+4 iterations/second.
-> speedup: +17%

flatorize              : speed: 1.23e+4 iterations/second.
flatorize.getAsmjsGen(): speed: 3.47e+4 iterations/second.
-> speedup: +183%

flatorize              : speed: 1.25e+4 iterations/second.
flatorize.getAsmjsGen(): speed: 3.54e+4 iterations/second.
-> speedup: +183%

Not much left to say: huge speedups everywhere.

Conclusion#

Writing asm.js code brings high speedups in Firefox and Chrome. flatorize.getAsmjsGen() conveniently generates such code for you.

See also: more speed tests of the various solutions (JS, asm.js, C...).

Additional remark: flatorize vs. flatorize.getAsmjsGen()#

flatorize already generates very fast code (see the main article), and flatorize.getAsmjsGen() generates even faster code.

Usage trade-off: while flatorize always creates a new output array, flatorize.getAsmjsGen() uses side effects — in-place output — which requires slightly more care, but provides an extra speedup.

Additional remark: Firefox, asm.js and small tasks#

All the above showed an excellent extra speedup brought by flatorize.getAsmjsGen() compared to flatorize on a computationally intensive task like DFT1024.

However, on a smaller task like DFT16, in the Firefox case, you would get such a speedup only when called from asm.js client code, and not when called from non-asm.js client code. See also this stackoverflow page.

Additional remark: JavaScript vs. C#

On an Ubuntu laptop for the (heavy) DFT1024 case I measured a speed of about 47800 iterations per seconds for Chrome 39 and about 60000 iterations per seconds for clang (had to forget the increasingly unreliable GCC).

This is fast enough for me to do scientific computation in the browser with a much much faster and simpler developement process (JavaScript) than in C.

How to run the speed test:

flatorize_asmjs on Chrome 39: open the present page with Chrome 39 and go to any performance section, for example the first performance test, and run 3-4 times the test by clicking on the button that says "Measure the speed!". After the first run, Chrome should have optimized the code. Pick the median speed of the next few runs.

flatorize_c with clang: install V8, Python 3 and clang (for the latter two something like sudo apt-get install python3 and sudo apt-get install clang should be enough). In the command line do this:

  
jars@jars-desktop:~/gl/flatorize$ cd test/
jars@jars-desktop:~/gl/flatorize/test$ ./test_c_v8_speed.py
  

...and wait. This runs all necessary unit tests, then at the end the DFT1024 speed test. The final two lines should look like this:

test_v8_c_speed: (3) evaluate the speed of the C implementation of asmjs_dftreal1024

test_v8_c_speed done, speed in clang: 59856.2169386728 iterations/second = 65536 iterations / 1.094890445 seconds

More speed tests and comparisons

More speed tests of the various solutions & languages (JS, asm.js, C...).

Unit tests#

.

Detail: