From 41cc053419d57b38f8bab71004488bcaad00036f Mon Sep 17 00:00:00 2001 From: Justine Tunney Date: Sun, 21 Jul 2024 20:47:24 -0700 Subject: [PATCH] Add more content to APE specification --- ape/specification.md | 408 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 408 insertions(+) diff --git a/ape/specification.md b/ape/specification.md index 95afcc12e..94e8de308 100644 --- a/ape/specification.md +++ b/ape/specification.md @@ -269,3 +269,411 @@ regcomp(&rx, For further details, see the canonical implementation in `cosmopolitan/tool/build/assimilate.c`. + +## Static Linking + +Actually Portable Executables are always statically linked. This +revision of the specification does not define any facility for storing +code in dynamic shared objects. + +Cosmopolitan Libc provides a solution that enables APE binaries have +limited access to dlopen(). By manually loading a platform-specific +executable and asking the OS-specific libc's dlopen() to load +OS-specific libraries, it becomes possible to use GPUs and GUIs. This +has worked great for AI projects like llamafile. + +There is no way for an Actually Portable Executable to interact with +OS-specific dynamic shared object extension modules to programming +languages. For example, a Lua interpreter compiled as an Actually +Portable Executable would have no way of linking extension libraries +downloaded from the Lua Rocks package manager. This is primarily because +different OSes define incompatible ABIs. + +While it was possible to polyglot PE+ELF+MachO to create multi-OS +executables, it simply isn't possible to do that same thing for +DLL+DLIB+SO. Therefore, in order to have DSOs, APE would need to either +choose one of the existing formats or invent one of its own, and then +develop its own parallel ecosystem of extension software. In the future, +the APE specification may expand to encompass this. However the focus to +date has been exclusively on executables with limited dlopen() support. + +## Application Binary Interface (ABI) + +APE binaries use the System V ABI, as defined by: + +- [System V ABI - AMD64 Architecture Processor Supplement](https://gitlab.com/x86-psABIs/x86-64-ABI) +- AARCH64 has a uniform consensus defined by ARM Limited + +There are however a few changes we've had to make. + +### Thread Local Storage + +#### aarch64 + +Here's the TLS memory layout on aarch64: + +``` + x28 + %tpidr_el0 + │ + │ _Thread_local + ┌───┼───┬──────────┬──────────┐ + │tib│dtv│ .tdata │ .tbss │ + ├───┴───┴──────────┴──────────┘ + │ + __get_tls() +``` + +The ARM64 code in actually portable executables use the `x28` register +to store the address of the thread information block. All aarch64 code +linked into these executables SHOULD be compiled with `-ffixed-x28` +which is supported by both Clang and GCC. + +The runtime library for an actually portable executables MAY choose to +use `tpidr_el0` instead, if OSes like MacOS aren't being targeted. For +example, if the goal is to create a Linux-only fat binary linker program +for Musl Libc, then choosing to use the existing `tpidr_el0` convention +would be friction-free alternative. + +It's not possible for an APE runtime that targets the full range of OSes +to use the `tpidr_el0` register for TLS because Apple won't allow it. On +MacOS ARM64 systems, this reigster can only be used by a runtime to +implement the `sched_getcpu()` system call. It's reserved by MacOS. + +#### x86-64 + +Here's the TLS memory layout on x86_64: + +``` + __get_tls() + │ + %fs OpenBSD/NetBSD + _Thread_local │ + ┌───┬──────────┬──────────┼───┐ + │pad│ .tdata │ .tbss │tib│ + └───┴──────────┴──────────┼───┘ + │ + Linux/FreeBSD/Windows/Mac %gs +``` + +Quite possibly the greatest challenge in Actually Portable Executable +working, has been overcoming the incompatibilities between OSes in how +thread-local storage works on x86-64. The AMD64 architecture defines two +special segment registers. Every OS uses one of these segment registers +to implement TLS support. However not all OSes agree on which register +to use. Some OSes grant userspace the power to define either of these +registers to hold any value that is desired. Some OSes only effectively +allow a single one of them to be changed. Lastly, some OSes, e.g. +Windows, claim ownership of the memory layout these registers point +towards too. + +Here's a breakdown on how much power is granted to userspace runtimes by +each OS when it comes to changing amd64 segment registers. + +| | %fs | %gs | +|---------|--------------|--------------| +| Linux | unrestricted | unrestricted | +| MacOS | inaccessible | unrestricted | +| Windows | inaccessible | restricted | +| FreeBSD | unrestricted | unrestricted | +| NetBSD | unrestricted | broken | +| OpenBSD | unrestricted | inaccessible | + +Therefore, regardless of which register one we choose, some OSes are +going to be incompatible. + +APE binaries are always built with a Linux compiler. So another issue +arises in the fact that our Linux-flavored GCC and Clang toolchains +(which are used to produce cross-OS binaries) are also only capable of +producing TLS instructions that use the %fs convention. + +To solve these challenges, the `cosmocc` compiler will rewrite binary +objects after they've been compiled by GCC, so that `%gs` register is +used, rather than `%fs`. Morphing x86-64 binaries after they've been +compiled is normally difficult, due to the complexity of the machine +instruction language. However GCC provides `-mno-tls-direct-seg-refs` +which greatly reduces the complexity of this task. This flag forgoes +some optimizations to make the generated code simpler. Rather than doing +clever arithmetic with `%fs` prefixes, the compiler will always generate +the thread information block address load as a separate instruction. + +```c +// Change AMD code to use %gs:0x30 instead of %fs:0 +// We assume -mno-tls-direct-seg-refs has been used +static void ChangeTlsFsToGs(unsigned char *p, size_t n) { + unsigned char *e = p + n - 9; + while (p <= e) { + // we're checking for the following expression: + // 0144 == p[0] && // %fs + // 0110 == (p[1] & 0373) && // rex.w (and ignore rex.r) + // (0213 == p[2] || // mov reg/mem → reg (word-sized) + // 0003 == p[2]) && // add reg/mem → reg (word-sized) + // 0004 == (p[3] & 0307) && // mod/rm (4,reg,0) means sib → reg + // 0045 == p[4] && // sib (5,4,0) → (rbp,rsp,0) → disp32 + // 0000 == p[5] && // displacement (von Neumann endian) + // 0000 == p[6] && // displacement + // 0000 == p[7] && // displacement + // 0000 == p[8] // displacement + uint64_t w = READ64LE(p) & READ64LE("\377\373\377\307\377\377\377\377"); + if ((w == READ64LE("\144\110\213\004\045\000\000\000") || + w == READ64LE("\144\110\003\004\045\000\000\000")) && + !p[8]) { + p[0] = 0145; // change %fs to %gs + p[5] = 0x30; // change 0 to 0x30 + p += 9; + } else { + ++p; + } + } +} +``` + +By favoring `%gs` we've now ensured friction-free compatibilty for the +APE runtime on MacOS, Linux, and FreeBSD which are all able to conform +easily to this convention. However additional work needs to be done at +runtime when an APE program is started on Windows, OpenBSD, and NetBSD. +On these platforms, all executable pages must be faulted and morped to +fixup the TLS instructions. + +On OpenBSD and NetBSD, this is as simple as undoing the example +operation above. Earlier at compile-time we turned `%fs` into `%gs`. +Now, at runtime, `%gs` must be turned back into `%fs`. Since the +executable is morphing itself, this is easier said than done. + +OpenBSD for example enforces a `W^X` invariant. Code that's executing +can't modify itself at the same time. The way Cosmopolitan solves this +is by defining a special part of the binary called `.text.privileged`. +This section is aligned to page boundaries. A GNU ld linker script is +used to ensure that code which morphs code is placed into this section, +through the use of a header-define cosmo-specific keyword `privileged`. +Additionally, the `fixupobj` program is used by the Cosmo build system +to ensure that compiled objects don't contain privileged functions that +call non-privileged functions. Needless to say, `mprotect()` needs to be +a privileged function, so that it can be used to disable the execute bit +on all other parts of the executable except for the privileged section, +thereby making it writable. Once this has been done, code can change. + +On Windows the diplacement bytes of the TLS instruction are changed to +use the `%gs:0x1480+i*8` ABI where `i` is a number assigned by the WIN32 +`TlsAlloc()` API. This avoids the need to call `TlsGetValue()` which is +implemented this exact same way under the hood. Even though 0x1480 isn't +explicitly documented by MSDN, this ABI is believed to be stable because +MSVC generates binaries that use this offset directly. The only caveat +is that `TlsAlloc()` must be called as early in the runtime init as +possible, to ensure an index less than 64 is returned. + +### Thread Information Block (TIB) + +The Actually Portable Exccutable Thread Information Block (TIB) is +defined by this version of the specification as follows: + +- The 64-bit TIB self-pointer is stored at offset 0x00. +- The 64-bit TIB self-pointer is stored at offset 0x30. +- The 32-bit `errno` value is stored at offset 0x3c. + +All other parts of the thread information block should be considered +unspecified and therefore reserved for future specifications. + +The APE thread information block is aligned on a 64-byte boundary. + +Cosmopolitan Libc v3.5.8 (c. 2024-07-21) currently implements a thread +information block that's 512 bytes in size. + +### Foreign Function Calls + +Even though APE programs always use the System V ABI, there arises the +occasional need to interface with foreign functions, e.g. WIN32. The +`__attribute__((__ms_abi__))` annotation introduced by GCC v6 is used +for this purpose. + +The ability to change a function's ABI on a case-by-case basis is +surprisingly enough supported by GCC, Clang, NVCC, and even the AMD HIP +compilers for both UNIX systems and Windows. All of these compilers +support both the System V ABI and the Microsoft x64 ABI. + +APE binaries will actually favor the Microsoft ABI even when running on +UNIX OSes for certain dlopen() use-cases. For example, if we control the +code to a CUDA module, which we compile on each OS separately from our +main APE binary, then any function that's inside the APE binary whose +pointer may be passed into a foreign module SHOULD be compiled to use +the Microsoft ABI. This is because in practice the OS-specific module +may need to be compiled by MSVC, where MS ABI is the *only* ABI, which +forces our UNIX programs to partially conform. Thankfully, all UNIX +compilers support doing it on a case-by-case basis. + +### Char Signedness + +Actually Portable Executable defines `char` as signed. + +Therefore conformant APE software MUST use `-fsigned-char` when building +code for aarch64, as well as any other architecture that (unlike x86-64) +would otherwise define `char` as being `unsigned char` by deafult. + +This decision was one of the cases where it made sense to offer a more +consistent runtime experience for fat multi-arch binaries. However you +SHOULD still write code to assume `char` can go either way. But if all +you care about is using APE, then you CAN assume `char` is signed. + +### Long Double + +On AMD64 platforms, APE binaries define `long double` as 80-bit. + +On ARM64 platforms, APE binaries define `long double` as 128-bit. + +We accept inconsistency in this case, because hardware acceleration is +far more valuable than stylistic consistency in the case of mathematics. + +One challenge arises on AMD64 for supporting `long double` across OSes. +Unlike UNIX systems, the Windows Executive on x86-64 initializes the x87 +FPU to have double (64-bit) precision rather than 80-bit. That's because +code compiled by MSVC treats `long double` as though it were `double` to +prefer always using the more modern SSE instructions. However System V +requires genuine 80-bit `long double` support on AMD64. + +Therefore, if an APE program detects that it's been started on a Windows +x86-64 system, then it SHOULD use the following assembly to initialize +the x87 FPU in System V ABI mode. + +```asm + fldcw 1f(%rip) + .rodata + .balign 2 +// 8087 FPU Control Word +// IM: Invalid Operation ───────────────┐ +// DM: Denormal Operand ───────────────┐│ +// ZM: Zero Divide ───────────────────┐││ +// OM: Overflow ─────────────────────┐│││ +// UM: Underflow ───────────────────┐││││ +// PM: Precision ──────────────────┐│││││ +// PC: Precision Control ───────┐ ││││││ +// {float,∅,double,long double}│ ││││││ +// RC: Rounding Control ──────┐ │ ││││││ +// {even, →-∞, →+∞, →0} │┌┤ ││││││ +// ┌┤││ ││││││ +// d││││rr││││││ +1: .short 0b00000000000000000001101111111 + .previous +``` + +## Executable File Alignment + +Actually Portable Executable is a statically-linked flat executable file +format that is, as a thing in itself, agnostic to file alignments. For +example, the shell script payload at the beginning of the file and its +statements have no such requirements. Alignment requirements are however +imposed by the executable formats that APE wraps. + +1. ELF requires that file offsets be congruent with virtual addresses + modulo the CPU page size. So when we add a shell script to the start + of an executable, we need to round up to the page size in order to + maintain ELF's invariant. Although no such roundup is required on the + program segments once the invariant is restored. ELF loaders will + happily map program headers from arbitrary file intervals (which may + overlap) onto arbitrarily virtual intervals (which don't need to be + contiguous). in order to do that, the loaders will generally use + unix's mmap() function which needs to have both page aligned + addresses and file offsets, even though the ELF programs headers + themselves do not. Since program headers start and stop at + potentially any byte, ELF loaders tease the intervals specified by + program headers into conforming to mmap() requirements by rounding + out intervals as necessary in order to ensure that both the mmap() + size and offset parameters are page-size aligned. This means with + ELF, we never need to insert any empty space into a file when we + don't want to; we can simply allow the offset to drift apart from the + virtual offset. + +2. PE doesn't care about congruency and instead specifies a second kind + of alignment. The minimum alignment of files is 512 because that's + what MS-DOS used. Where things get hairy is with PE's SizeOfHeaders + which has complex requirements. When the PE image base needs to be + skewed, Windows imposes a separate 64kb alignment requirement on the + image base. Therefore an APE executable's `__executable_start` should + be aligned on at least a 64kb address. + +3. Apple's Mach-O format is the strictest of them all. While both ELF + and PE are defined in such a way that invites great creativity, XNU + will simply refuse to an executable that does anything creative with + alignment. All loaded segments need to both start and end on a page + aligned address. XNU also wants segments to be contiguous similar to + portable executable, except it applies to both the file and virtual + spaces, which must follow the same structure. + +Actually Portable Executables must conform to the strictest requirements +demanded by the support vector. Therefore an APE binary that has headers +for all three of the above executable formats MUST conform to the Apple +way of doing things. GNU ld linker scripts aren't very good at producing +ELF binaries that rigidly conform to this simple naive layout. There are +so many ways things can go wrong, where third party code might slip its +own custom section name in-between the linker script sections that are +explicitly defined, thereby causing ELF's powerful features to manifest +and the resulting content overlapping. The best `ld` flag that helps is +`--orphan-handling=error` which can help with explaining such mysteries. + +While Cosmopolitan was originally defined to just use stock GNU tools, +this proved intractable over time, and the project has been evolving in +the direction of building its own. Inventing the `apelink` program was +what enabled the project to achieve multi-architecture binaries whereas +previously it was only possible to do multi-OS binaries. In the future, +our hope is that a fast power linker like Mold can be adapted to produce +fat APE binaries directly from object files in one pass. + +## Position Independent Code + +APE doesn't currently support position independent executable formats. +This is because APE was originally written for the GNU linker, where PIC +and PIE were after-thoughts and never fully incorporated with the older +more powerful linker script techniques upon which APE relies. Future +iterations of this specification are intended to converge on modern +standards, as our tooling becomes developed enough to support it. + +However this only applies to the wrapped executable formats themselves. +While our convention to date has been to always load ELF programs at the +4mb mark, this is not guaranteed across OSes and architectures. Programs +should have no expectations that a program will be loaded to any given +address. For example, Cosmo currently implements APE on AARCH64 as +loading executables to a starting address of 0x000800000000. This +address occupies a sweet spot of requirements. + +## Address Space + +In order to create a single binary that supports as many platforms as +possible without needing to be recompiled, there's a very narrow range +of addresses that can be used. That range is somewhere between 32 bits +and 39 bits. + +- Embedded devices that claim to be 64-bit will oftentimes only support + a virtual address space that's 39 bits in size. + +- We can't load executable images on AARCH64 beneath 0x100000000 (4gb) + because Apple forbids doing that, possibly in an effort to enforce a + best practice for spotting 32-bit to 64-bit transition bugs. Please + note that this restriction only applies to Apple ARM64 systems. The + x86-64 version of XNU will happily load APE binaries to 0x00400000. + +- The AMD64 architecture on desktops and servers can usually be counted + upon to provide a 47-bit address space. The Linux Kernel for instance + grants each userspace program full dominion over addresses 0x00200000 + through 0x00007fffffffffff provided the hardware supports this. On + modern workstations supporting Intel and AMD's new PML5T feature which + virtualizes memory using a radix trie that's five layers deep, Linux + is able to offer userspace its choice of fixed addresses from + 0x00200000 through 0x00ffffffffffffff. The only exception to this rule + we've encountered so far is that Windows 7 and Windows Vista behaved + similar to embedded devices in reducing the number of va bits. + +## Page Size + +APE software MUST be page size agnostic. For many years the industry had +converged on a strong consensus of having a page size that's 4096 bytes. +However this convention was never guaranteed. New computers have become +extremely popular, such as Apple Silicon, that use a 16kb page size. + +In addition to being page size agnostic, APE software that cares about +working correctly on Windows needs to be aware of the concept of +allocation granularity. While the page size on Windows is generally 4kb +in size, memory mappings can only be created on addresses that aligned +to the system allocation granularity, which is generally 64kb. If you +use a function like mmap() with Cosmopolitan Libc, then the `addr` and +`offset` parameters need to be aligned to `sysconf(_SC_GRANSIZE)` or +else your software won't work on Windows. Windows has other limitations +too, such as lacking the abiilty to carve or punch holes in mappings.