Tim Disney

Hygiene in sweetjs

The most important feature of sweet.js is hygiene. Hygiene prevents variables names inside of macros from clashing with variables in the surrounding code. It’s what gives macros the power to actually be syntactic abstractions by hiding implementation details and allowing you to use a hygienic macro anywhere in your code.

For hygiene to work sweet.js must rename variables. Recently several people have asked me why sweet.js renames all the variables. Wouldn’t it be better and cleaner to only rename the variables that macros introduce?

The tl;dr is “because hygiene” but let’s unpack that a little.

Hygiene Part 1 (Binding)

The part of hygiene most people intuitively grok is keeping track of the variable bindings that a macro introduces. For example, the swap macro creates a tmp variable that should only be bound inside of the macro:

macro swap {
  rule { ($a, $b) } => {
    var tmp = $a;
    $a = $b;
    $b = tmp;
  }
}

var tmp = 10;
var b = 20;
swap (tmp, b)

Hygiene keeps the two tmp bindings distinct by renaming them both:

var tmp$1 = 10;
var b$2 = 20;

var tmp$3 = tmp$1;
tmp$1 = b$2;
b$2 = tmp$3;

This is the point where most people say “why bother renaming the variables outside of the macro”? Can’t you just rename the bindings created by the macro? Wouldn’t it be cleaner for the expansion to just be something like:

var tmp = 10;
var b = 20;

var tmp$1 = tmp;
tmp = b;
b = tmp$1;

Hygiene Part 2 (Reference)

The complication comes in with variable references. The body of a macro can contain references to bindings declared outside of the macro and those references must be consistent no matter the context in which the macro is invoked.

Some code to clarify. Let’s say you have a macro that uses a random number function:

var random = function(seed) { /* ... */ }
let m = macro {
    rule {()} => {
        var n = random(42);
        // ...
    }
}

This macro needs to refer to random in any context that it gets invoked. But its context could have a different binding to random!

function foo() {
    var random = 42;
    m ()
}

Hygiene needs to keep the two random bindings different. So sweet.js will expand this into something like:

var random$1 = function(seed$4) { /* ... */ }
function foo() {
    var random$2 = 42;
    var n$3 = random$1(42);
    // ...
}

Note that there is no way for hygiene to do this if it only renamed identifiers inside of macros since both random bindings were declared outside of the macro. Hygiene is necessarily a whole program transformation.

(ps if this sort of feels like a closure you’re on to something: one of the early techniques that led to modern hygiene algorithms was called syntactic closures)

Strictly speaking the hygiene algorithm is still conservative. Variable bindings declared outside of a macro that are never referenced by a macro don’t really need to be renamed. However, modifying the hygiene algorithm to only rename exactly what needs to be renamed seems pretty difficult (especially to do so efficiently). If anyone knows techniques for this definitely let me know (or even better submit a pull request).

How to read macros

In my last post I gave a little overview of sweet.js the hygienic macro system I built over the summer. Today I want to write a little bit about what makes sweet.js possible and why we haven’t really seen a macro system for JavaScript before now. I gave hints at some of this in my intern talk but now we can finally do a deep dive!

Basics

First, let’s take a look at compilers 101:

Parser Pipeline

The traditional way you build a compiler front end is to first write a lexer and a parser. Code (a string) is fed to the lexer which produces an array of tokens which gets fed to the parser which produces an Abstract Syntax Tree (AST).

For example,

if (foo > bar) {
    alert("w00t!");
}

gets lexed into something like:

["if", "(", "foo", ">", "bar", ")", "{" ...]

The lexer is basically responsible for throwing away unnecessary whitespace and grouping identifiers, strings, and numbers into discrete chunks (ie tokens). The array of tokens is then parsed into an AST that might look something like this:

// output of esprima
{
    "type": "Program",
    "body": [
        {
            "type": "IfStatement",
            "test": {
                "type": "BinaryExpression",
                "operator": ">",
                "left": {
                    "type": "Identifier",
                    "name": "foo"
                },
                "right": {
                    "type": "Identifier",
                    "name": "bar"
                }
            },
            "consequent": {
                "type": "BlockStatement",
                "body": [{
                        "type": "ExpressionStatement",
                        "expression": {
                            "type": "CallExpression",
                            "callee": {
                                "type": "Identifier",
                                "name": "alert"
                            },
                            "arguments": [{
                                    "type": "Literal",
                                    "value": "w00t!",
                                    "raw": "\"w00t!\""
                                }]
                        }
                    }]
            },
            "alternate": null
        }
    ]
}

The AST gives you the structure necessary to do code optimization/generation etc.

So where can we fit macros into this picture? Which representation is best for macros to do their stuff?

Well, by the time we get to an AST it’s too late since parsers only understand a fixed grammar (well, technically there is research on adaptive/extensible grammars but that way leads to madness!). Obviously the raw code as a string is too unstructured for macros so how about the array of tokens produced by the lexer?

Tokens are fine for cpp #define style macros but we want moar power! And, as it turns out, just normal tokens aren’t going to cut it for us. Consider this simple macro that provides a concise way to define functions:

macro def {
    case $name $params $body => {
        function $name $params $body
    }
}
def add(a, b) {
    return a + b;
}

which should be expanded into:

function add(a, b) {
    return a + b;
}

Critically, note that the macro needs to match $params with (a, b) and $body with { return a + b; }. However, we don’t have enough structure to do this with just the token array ["def", "add", "(", "a", ",", "b", ...]: we need to match the delimiters ((), {}, and []) first!

If you remember your compilers class (or went to wikipedia), delimiter matching is what separates context-free languages (what parsers recognize) from regular languages (what lexers recognize).

This is one of the reasons why macros are big in the lisp family of languages (scheme, racket, clojure, etc.). S-expressions (with (all (those (parentheses)))) are already fully delimited so it becomes almost trivial to do delimiter matching. Some people say this is due to homoiconicity but as Dave Herman pointed out, homoiconicity isn’t really the point. It’s not that the lisp family is homoiconic but rather that the nature of s-expressions makes it easy to implement read which is necessary for macros.

read is the crucial function that gives a little bit more structure to the array of tokens by matching up all those delimiters. Now instead of just a flat array of tokens we are going to get a read tree:

["def", "add", {
    type: "()",
    innerTokens: ["a", ",", "b"]
    }, {
    type: "{}",
    innerTokens: ["return", "a", "+", "b"]
}]

Note this doesn’t have all the structure of a full AST, it just knows about tokens and delimiters not expressions and statements. So now our def macro pattern variables will match up like so:

$params -> {
    type: "()",
    innerTokens: ["a", ",", "b"]
}
$body -> {
    type: "{}",
    innerTokens: ["return", "a", "+", "b"]
}

Great! Now our pipeline looks something like:

Pipeline With Read

The Devil in the Details

Ok, so all well and good but then why haven’t we seen people implement read and build a macro system for JavaScript before now?

It turns out that there’s this really annoying token (for potential macro implementers) in JavaScript: /.

Here’s the problem, depending on context / can mean two different things: the divide operator or the beginning of a regular expression literal.

10 / 2              // 5
/foo/.test("foo")   // true

(Well technically I guess / can also mean the start of a comment but this is always easy to figure out (since // always means line comment))

So how do we disambiguate between divide and regex? It turns out that the way a normal parser (like esprima for example) does it is by running the lexer and parser together and resolving the ambiguity via the current parsing context. In other words, as the parser is working through each production, it calls out to the lexer with a flag saying what context it is in. Depending on that context the lexer will either lex / as a divide token or as a regular expression literal.

But, we can’t use the parsing context in read because we don’t have any parsing context yet!

So, we somehow need to separate the lexer/reader from the parser.

Now you might think we could get away with just leaving / as an ambiguous token (say a divOrRegex token for example) to be handled by the parser once all the macros have been expanded away but consider this code fragment we might want to read:

... { /foo}bar/ ...
// as a token array this would be
// [... "{", "/", "foo", "}", "bar", "/", ...]

Remember that the entire point of read is to do delimiter matching, so should we match the } with the opening { or as part of a regular expression literal (remember /}/ is a valid regex that matches a single })? It completely depends on our interpretation of /!

Therefore, in our lexer/reader we must disambiguate the meaning of / without the help of the parser. So how do we do that?

This is the hard technical problem that Paul Stansifer (he also designed the Rust macro system) solved this summer, unlocking the power of JavaScript macros for us all!

The basic idea is when you see a / as you are reading, just look back a couple of tokens and a small fixed set of tokens will determine unambiguously if / should be a divide or the start of a regex literal. To figure out exactly how far back and which tokens to look for requires working through all the different cases in the JavaScript grammar which is hard but done!

A snippet of this algorithm goes something like:

if tok is /
    if tok-1 is )
        look back to matching (
        if identifier before ( in "if" "while"
                                  "for" "with"
            tok is start of regex literal
        else
            tok is divide
    ...

For example, if we have:

if (foo + 24 > bar) /baz/.test("foo")

When we see the / we note that the previous token was ) so we find its matching ( and note that the token before that was if so / must be the beginning of a regular expression.

What’s really cool here is that when we need to disambiguate / we’ve already been reading up to that point so (foo + 24 > bar) is a single token (the () token with inner tokens foo, +, 24, >, and bar) and checking the token before the parens is literally as simple as tokens[idx-2] === "if". By creating the read tree as we go along we don’t need to carry lookbehind state in a complicated way; in fact, in the worst case, we only have to look back 5 tokens.

If you want to read more about how this works, I’ve got the entire algorithm pseudo-coded up here and the actual JavaScript implementation in these relatively short two functions.

Hygienic Macros for JavaScript

Been slow to embloggen this (start of the quarter etc.) but my summer intern project was released a little while ago at sweetjs.org. Sweet.js is a hygienic macro compiler for JavaScript that takes JavaScript written with macros and produces normal JavaScript you can run in your browser or on node. It’s an experiment to see if macros can work well for JavaScript.

Macros allow you to extend the syntax of JavaScript by writing macro definitions that perform syntax transformations, thus allowing you to do cool things like add sweet syntax for var destructuring or even haskell-like do notation.

The idea is to provide a middle ground between no syntax extension and having to build an entire compile-to-js language like CoffeeScript, which can’t compose with other syntax extensions.

I talk a bit about the motivation and design in this presentation I gave at the end of my internship.

The language that most directly inspires sweet.js is, of course, scheme. I’m not really much of a schemer myself but have always admired the fancy work going on in that world and macros are pretty fancy. At the start of the summer I was still somewhat of a macro newbie but being at Mozilla was fantastic since I was able to draw on and learn from two well-versed scheme macrologists Dave Herman (who’s PhD thesis was on macros) and Paul Stansifer (who has been developing the Rust macro system).

Sweet.js is still in very early stages, lots of bugs and missing features, but I think it shows some real promise. Let me know what you think!

Memory Checking in Low-Level JavaScript

So this past month I’ve been helping out on the Low-Level JavaScript (LLJS) project. Bit of a change for me since I usually hide out at as high a level as I can possibly get :)

LLJS is an experiment in adding low-level features like pointers and manual memory management to a dialect of JavaScript. These low-level features are pretty cool and can get you some nice performance wins, but they also come at a cost since you are always one null pointer dereference or memory leak away from oblivion.

One way C/C++ programmers have handled this danger is by using an analysis tool like Valgrind to detect memory errors that happen while the program is running. Since memory checking has proven to be useful in the C world, we figured it might be helpful for LLJS. So, I’ve hacked up a valgrind-like memory analysis for LLJS. It can detect four kinds of memory errors:

  • reads/writes to unallocated memory (pointers we haven’t malloc‘ed)
  • reads from undefined memory (malloc‘ed memory we haven’t written to)
  • bad frees (calling free on memory that hasn’t been malloc‘ed)
  • leaks (any unfreed memory at the end of the program run)

How to use

Go grab the latest version of LLJS from its github repo. Then add the following snippet to the end of your script:

let m = require('memory');
console.log(m.memcheck.report());

This pulls in the memory module and prints out the result of the memory checking (this snippet is for node.js, so use load if you’re using SpiderMonkey or grab memory off the global object if you’re in the browser).

Then compile your LLJS code with the -m flag:

bin/ljc -m -o your_file.js your_file.ljs

and run it:

node --harmony-proxies your_file.js

As you can see the memory checker uses Proxies, which node/V8 hides behind a flag (no flag needed on SpiderMonkey).

For example, the following script:

extern console;

function test_unallocated() {
  let uint *x;
  // not allocated yet!
  return *x;
}
test_unallocated();

function test_leak() {
  // leak!
  let uint *leak = new uint;
}
test_leak();

function test_free() {
  let uint *x = new uint;
  delete x;
  delete x;
}
test_free();

let m = require('memory');
console.log(m.memcheck.report());

will print out the following report:

Unallocated:
0 at:
    test_unallocated:3:0

Undefined:
0 at:
    test_unallocated:3:0

Bad frees:
8184 at:
    test_free:16:0

Leaks:
8200 at:
    test_leak:10:0

How it works

Memory allocation in LLJS is implemented as a big typed array where pointers are really just array indexes into the array.

To do memory checking we create a second shadow memory typed array that stores the allocation status of each byte of real memory (if the byte has been allocated, initialized, etc.). Calls to malloc and free change the allocation status of the bytes in shadow memory. The real memory array is wrapped in a Proxy that intercepts each get and set operation and checks the shadow memory for a possible error.

For example, say we we allocate an int and try to do some pointer arithmetic and dereference unallocated memory:

let int *x = new int;
*(x+1)

When the dereference of x+1 happens, the Proxy wrapping real memory will check in shadow memory to see if x+1 has been allocated. Since it hasn’t, an unallocated memory access error is recorded.

This approach is basically a simplified version of how Valgrind does memory checking.

Stable node.js now runs contracts.coffee!

Took almost a year but now the stable version of node.js (0.8.0+) has a recent enough version of V8 to run contracts.coffee! The much needed proxy support is still behind a flag so you’ll need to supply --harmony_proxies and --harmony-collections as usual.

Also, the latest version of contracts.coffee (0.3.1) and is now up on npm and collects a few bugs fixes.

Subscribe to new posts here.
For older posts see the archive