## Devectorize -- A Julia Framework for De-vectorized Evaluation

*Devectorize* is a Julia framework, which provides macros and functions to *de-vectorize* a vector expression. With *Devectorize*, users can write computations in high-level vectorized way and at the same time enjoying the high run-time performance of de-vectorized loops. *Devectorize* automatically translates vectorized expressions into faster tight-loops, which often results in 2x - 8x performance gain.

*Devectorize*

Why In many programming languages (including Julia), expressions are immediately evaluated upon construction. This simple strategy often results in less than optimal behaviors, which, for example, include *creation of unnecessary temporaries* and *repeated memory round-trips*. Consider the following example,

`r = a .* b + c .* d + a`

With immediate evaluation, three temporaries, respectively for storing the results of `a .* b`

, `c .* d`

, and `a .* b + c .* d`

. Also, the array `a`

will be traversed twice. Moreover, computation on large arrays is often memory-bound -- the run-time performance largely depends on how many times you have to scan the arrays.

For the formula above, a much more efficient way to evaluate it can be expressed using for-loops as follows

```
n = length(a)
r = zeros(n)
for i = 1 : n
r[i] = a[i] * b[i] + c[i] * d[i] + a[i]
end
```

With this piece of code, you can get all the results in one pass, without creating any temporary arrays. However, low-level for-loops are often much longer and more difficult to read, write, and maintain.

Is it possible to combine the elegance of high-level expressions and the performance of low-level for-loops?

The answer is *Yes*. Let's look at the examples above, we can hold off the evaluation of all the temporaries until the assignment to `r`

happens -- at this point, an integrated loop is emitted to compute all results in one pass.

The powerful meta-programming framework of Julia makes it possible to achieve this goal using incredibly simple syntax. Taking advantage of this framework, *Devectorize* provides a macro `@devec`

:

`@devec r = a .* b + c .* d + a`

This statement is exactly the same as the one we saw above -- except for the macro `@devec`

, which performs all the magic of translating the formula into a one-pass loop behind the scenes.

The remaining part is organized into two section: **Basic Usage**, which introduces how to use *Devectorize* to improve the performance of your code, and **Design of the Framework**, which provides a brief overview of the framework and its structures.

## Basic Usage

You may install the package using Julia's official package manager:

`Pkg.add("Devectorize")`

To keep it always updated, you may have to switch Devectorize to the `master`

branch, and `git pull`

the latest commits.

For ordinary use, you only have to remember one macro -- `@devec`

. Putting it before the assignments that you want to *de-vectorize*, it will automatically translate your expressions into efficient loops. For example, you can write

`@devec r = exp(a + b) .* sum(c)`

To inspect the code generated by *Devectorize*, you can use the `@inspect_devec`

macro:

`@inspect_devec r = exp(a + b) .* sum(c)`

This statement will prints the generated codes (prior to evaluating them).

### Benchmark

Here is a table of benchmark results on some typical cases.

julia vec | @devec | hand-coded loop | |
---|---|---|---|

simple-ewise | 1.0000 | 2.6032x | 2.5719x |

complex-ewise | 1.0000 | 2.4581x | 2.4364x |

shift-dot | 1.0000 | 8.3237x | 8.2959x |

colwise-sum | 1.0000 | 1.3321x | 1.2771x |

rowwise-sum | 1.0000 | 4.2736x | 4.2444x |

colwise-eucdist | 1.0000 | 5.6502x | 5.5356x |

*The result was obtained with Julia commit 3f92b13210 (2013-02-03) on Mac OS X 10.8, using the script test/bench_devec.jl, which comes with the Devectorize package.*

Here, we use vectorized Julia code as the baseline, and report the performance gains (for example, if the baseline takes 1 sec, and devec takes 0.5sec, then the gain is 2x). We can see that codes tagged with the `@devec`

macro typically performs 2x to 5x faster than vectorized codes, and is comparable (sometimes even slightly faster than) a hand-coded for loop.

It is important to note that *Devectorize* only recognizes a subset of expressions of Julia (but the most commonly used subset), as listed below.

### Element-wise map of numbers and arrays

```
@devec r = a + b + c
@devec r = sin(a) + exp(a + 1.0) .* log(c)
```

Here is the list of operators and functions currently supported by *Devectorize*:

```
+, -, .+, .-, .*, ./, .^, max, min, clamp, blend,
.==, .!=, .<, .>, .<=, .>=,
sqrt, cbrt, sqr, rcp, floor, ceil, round, trunc,
exp, log, log10, exp2, log2, expm1, log1p,
sin, cos, tan, asin, acos, atan,
sinh, cosh, tanh, asinh, acosh, atanh,
erf, erfc, gamma, lgamma, digamma
```

**Notes:**

Operator

`*`

and`/`

are not supported, as they entail complex semantics depending on the arguments which may only be known at run-time. Users can use`.*`

and`./`

to express element-wise multiplication and division, which are perfectly supported in*Devectorize*.These three functions:

`sqr`

(x -> x * x),`rcp`

(x -> 1 / x), and`blend`

((c, x, y) -> c ? x : y) are not in the Base module of Julia. They are provided by*Devectorize*as extensions to make it easier to write vectorized expressions (and then`@devec`

it).

### Simple references

```
@devec r = x[:,1] + y[:,2]
@devec r = a[i,:] .* b
@devec r[:,j] = x + sin(a[:,j])
```

Simple reference here means the reference expressions in either of the following forms:
``a[:], a[i], a[i, j], a[:, :], a[:, i], a[i, :]`

, where i can be either an integer or a symbol that refers to an integer variable. Reference expressions can appear in both left and right hand side of an assignment. Support of more flexible references is planned for future releases.

Note that when you write

`r = a + b .* c`

*Devectorize* will emit codes that creates an array to store the results and bound it to `r`

, this process may entails some overhead of inferring the type and shape of the result and creating a new array. When `r`

has been created, you may eliminate such runtime overheads by writing

```
r[:] = a + b .* c
```

Then, the results will be directly written to `r`

, and no array will be created before evaluation.

### Op-assignment

```
@devec r += a
@devec r[:,i] .*= sin(a)
```

*Devectorize* will automatically translate them into ordinary assignment expressions.

### Full/Colwise/Rowwise reduction

```
@devec r = sum(a + b)
@devec r = maximum(sin(a), 1)
@devec r[:,j] = mean(a, 2)
```

*Devectorize* currently recognizes five reduction functions `sum`

, `maximum`

, `minimum`

, `mean`

, and `dot`

.

### Hybrid expressions

Consider the example below,

`@devec r = (a - sum(a)) .* b`

This seemingly simple expression actually requires two loops to evaluate, one for computing `sum(a)`

, and the other for the top-level element-wise expression. *Devectorize* recognizes such situations, and will emit correct codes to perform the evaluation. For the example above, *Devectorize* will first break the expression into two ones, as

```
@devec tmp1 = sum(a)
@devec r = (a - tmp1) .* b
```

Note that *Devectorize* only breaks expressions only when it is really necessary to do so, and tries to generate as few memory traversals as possible.

### Block expressions

```
@devec begin
a = sin(x) - cos(y)
b = sum(a) + exp(z)
c = x .* y - maximum(b)
end
```

In current implementation, *Devectorize* simply de-vectorizes each assignment expression respectively. In future version, it may use a more sophisticated algorithm to identify opportunities of sharing some computation across expressions.

## Design of the Framework

In *Devectorize*, the process of translating a Julia expression into de-vectorized codes goes through two stages:

- translate a Julia expression to a typed expression (enriched with semantic information), using
`texpr`

compile the typed expression into de-vectorized codes, using

`compile`

, which itself takes three steps:- decompose a given expression into a sequence of basic expressions (e.g. break a hybrid expression or a block expression)
- compose loops for each basic expression via a
*back-end*factory, using`compose`

- integrate all generated loops into a code block and return

### Typed expressions

Julia front-end parses any input expression into an instance of `Expr`

, which contains only syntatic information but not semantic information. To generate the code, one has to first understand the *semantics* (i.e. *meaning*) of the expression, e.g. whether it is doing a reduction or a element-wise transformation.

To express the *semantics* of an expression, *Devectorize* establishes a type hierarchy in `src/texpr.jl`

. The hierarchy can be briefly summarized as follows

```
TExpr
-- TEWise # everything can serve as an element-wise argument
-- TScalar # everything that is sure to be a scalar
-- TNum # numerical literals, e.g. 1, 2.0, ...
-- TScalarSym # a symbol that is known to be a scalar (e.g. result of a full reduction)
-- TRefScalar1 # a[i]
-- TRefScalar2 # a[i,j]
-- TSym # a symbol that refers to a variable (can be an array or a scalar)
-- TRef
-- TRef1D # a[:]
-- TRef2D # a[:,:]
-- TRefCol # a[:,j]
-- TRefRow # a[i,:]
-- TMap # element-wise map, e.g. sin(a), a + b, a + b .* c, ...
-- TReduc # full reduction, e.g. sum(a), sum(a + b .* c), ...
-- TColwiseReduc # column-wise reduction
-- TRowwiseReduc # row-wise reduction
-- TAssign # asssignment, e.g. a = sin(x), r[:,i] = a + cos(x[:,j]), ...
-- TBlock # a block of expressions
```

The function `texpr`

(also defined in `src/texpr.jl`

) takes an instance of `Expr`

as an argument, analyzes it. If the expression is recognized, it returns a typed expression (i.e. an instance of `TExpr`

), otherwise, it raises an error (to be more specific, throws an exception of type `DeError`

.)

The analysis performed in `texpr`

relies on the semantic information provided by the functions in `src/fun_traits.jl`

. These functions can tell you `sin`

is an element-wise mapping that takes one argument, while `sum`

is a reduction. They also tell you result type information, e.g. the element type of `a + b`

is `promote_type(eltype(a), eltype(b))`

, while that for `.==`

is `Bool`

.

### Contexts

To make the framework extensible, *Devectorize* introduces the notion of *Context*, which refers to a specific setting in which the codes are generated (e.g. CPU, SIMD, CUDA, OpenCL, etc)

The abstract type *EvalContext* (in `src/compile_base.jl`

) is used as the super class of all contexts. In current version, *Devectorize* provides a specific context type, namely `ScalarContext`

, in which expressions are mapped to de-vectorized for-loops.

In future, other contexts might be introduced (e.g. SIMD and CUDA), thus providing users options to choose specific ways to emit the evaluation code for their expressions.

### Compilation

The function `compile`

takes two arguments: a context and a typed expression, and returns a the generated codes. Generally, this function is a driver, which actually delegates the code generation to two functions: `compose_init`

(for generating codes for initialization) and `compose_main`

(for generating the main loops). These two functions are provided by specific back-ends.

To reduce the complexity of back-end implementation, the `compile`

function performs some preprocessing, which includes

- translates blocks and hybrid expressions into a sequence of basic expressions
- identifies trivial assignments (i.e.
`a = b`

), and simply emits it (as`a = b`

). Note that this simply bounds the name`a`

to the object referred by`b`

, which does not involve any*real computation*. - take precautions to prevent potential alias problems. For example, it translates
`a = b + sin(a)`

into two statements,`tmp = b + sin(a)`

, and a trivial assignment`a = tmp`

. The temporary name is generated using`gensym`

to avoid collision with other names.

After this processing, the back-end can be implemented in a much simpler way, without taking into account such intricacies.

The functions to generate codes for `ScalarContext`

are in `src/scalar_backend.jl`

.

### Code composition

The routines in `src/scalar_backend.jl`

uses *recursive kernel composition* to generate loop kernels.

Take the expression `a + b .* c`

for example. It first generates `get_value(a, i)`

, `get_value(b, i)`

, and `get_value(c, i)`

for the terminals `a`

, `b`

, and `c`

. Here, `get_value`

is an overloaded function to ensure correct behavior for different cases (e.g. `a`

can be either a scalar or an array).

For `b .* c`

, it takes the generated kernel for `b`

and `c`

(as above), combines them with the operator `.*`

, and then emits `.*(get_value(b, i), get_value(c, i)`

. Likewise, for `a + b .* c`

, it emits `+(get_value(a, i), .*(get_value(b, i), get_value(c, i))`

.

The `compose_main`

function will generate a loop that uses the generated kernel as the loop body.