We had previously not enabled TLS in MODE=tiny in order to keep the
smallest example programs (e.g. life.com) just 16kb in size. But it
was error prone doing that, so now we just always enable it because
this change uses hacks to ensure it won't increase life.com's size.
This change also fixes a bug on NetBSD, where signal handlers would
break thread local storage if SA_SIGINFO was being used. This looks
like it might be a bug in NetBSD, but it's got a simple workaround.
This change simplifies the thread-local storage support code. On Windows
and Mac OS X the startup latency of __enable_tls() has been reduced from
30ms to 1ms. On Windows, TLS memory accesses will now go much faster due
to better self-modifying code that prevents a function call and acquires
our thread information block pointer in a single instruction.