Add more content to APE specification

2025-07-04 02:08:30 +00:00 · 2024-07-21 20:47:24 -07:00 · 2024-07-21 20:47:24 -07:00 · 41cc053419
commit 41cc053419
parent 5d2d9e9640
1 changed files with 408 additions and 0 deletions
--- a/ape/specification.md
+++ b/ape/specification.md
@ -269,3 +269,411 @@ regcomp(&rx,
 For further details, see the canonical implementation in
 `cosmopolitan/tool/build/assimilate.c`.
 ## Static Linking
 Actually Portable Executables are always statically linked. This
 revision of the specification does not define any facility for storing
 code in dynamic shared objects.
 Cosmopolitan Libc provides a solution that enables APE binaries have
 limited access to dlopen(). By manually loading a platform-specific
 executable and asking the OS-specific libc's dlopen() to load
 OS-specific libraries, it becomes possible to use GPUs and GUIs. This
 has worked great for AI projects like llamafile.
 There is no way for an Actually Portable Executable to interact with
 OS-specific dynamic shared object extension modules to programming
 languages. For example, a Lua interpreter compiled as an Actually
 Portable Executable would have no way of linking extension libraries
 downloaded from the Lua Rocks package manager. This is primarily because
 different OSes define incompatible ABIs.
 While it was possible to polyglot PE+ELF+MachO to create multi-OS
 executables, it simply isn't possible to do that same thing for
 DLL+DLIB+SO. Therefore, in order to have DSOs, APE would need to either
 choose one of the existing formats or invent one of its own, and then
 develop its own parallel ecosystem of extension software. In the future,
 the APE specification may expand to encompass this. However the focus to
 date has been exclusively on executables with limited dlopen() support.
 ## Application Binary Interface (ABI)
 APE binaries use the System V ABI, as defined by:
 - [System V ABI - AMD64 Architecture Processor Supplement](https://gitlab.com/x86-psABIs/x86-64-ABI)
 - AARCH64 has a uniform consensus defined by ARM Limited
 There are however a few changes we've had to make.
 ### Thread Local Storage
 #### aarch64
 Here's the TLS memory layout on aarch64:
 ```
           x28
        %tpidr_el0
            │
            │    _Thread_local
        ┌───┼───┬──────────┬──────────┐
        │tib│dtv│  .tdata  │  .tbss   │
        ├───┴───┴──────────┴──────────┘
        │
    __get_tls()
 ```
 The ARM64 code in actually portable executables use the `x28` register
 to store the address of the thread information block. All aarch64 code
 linked into these executables SHOULD be compiled with `-ffixed-x28`
 which is supported by both Clang and GCC.
 The runtime library for an actually portable executables MAY choose to
 use `tpidr_el0` instead, if OSes like MacOS aren't being targeted. For
 example, if the goal is to create a Linux-only fat binary linker program
 for Musl Libc, then choosing to use the existing `tpidr_el0` convention
 would be friction-free alternative.
 It's not possible for an APE runtime that targets the full range of OSes
 to use the `tpidr_el0` register for TLS because Apple won't allow it. On
 MacOS ARM64 systems, this reigster can only be used by a runtime to
 implement the `sched_getcpu()` system call. It's reserved by MacOS.
 #### x86-64
 Here's the TLS memory layout on x86_64:
 ```
                          __get_tls()
                              │
                             %fs OpenBSD/NetBSD
           _Thread_local      │
    ┌───┬──────────┬──────────┼───┐
    │pad│  .tdata  │  .tbss   │tib│
    └───┴──────────┴──────────┼───┘
                              │
   Linux/FreeBSD/Windows/Mac %gs
 ```
 Quite possibly the greatest challenge in Actually Portable Executable
 working, has been overcoming the incompatibilities between OSes in how
 thread-local storage works on x86-64. The AMD64 architecture defines two
 special segment registers. Every OS uses one of these segment registers
 to implement TLS support. However not all OSes agree on which register
 to use. Some OSes grant userspace the power to define either of these
 registers to hold any value that is desired. Some OSes only effectively
 allow a single one of them to be changed. Lastly, some OSes, e.g.
 Windows, claim ownership of the memory layout these registers point
 towards too.
 Here's a breakdown on how much power is granted to userspace runtimes by
 each OS when it comes to changing amd64 segment registers.
 |         | %fs          | %gs          |
 |---------|--------------|--------------|
 | Linux   | unrestricted | unrestricted |
 | MacOS   | inaccessible | unrestricted |
 | Windows | inaccessible | restricted   |
 | FreeBSD | unrestricted | unrestricted |
 | NetBSD  | unrestricted | broken       |
 | OpenBSD | unrestricted | inaccessible |
 Therefore, regardless of which register one we choose, some OSes are
 going to be incompatible.
 APE binaries are always built with a Linux compiler. So another issue
 arises in the fact that our Linux-flavored GCC and Clang toolchains
 (which are used to produce cross-OS binaries) are also only capable of
 producing TLS instructions that use the %fs convention.
 To solve these challenges, the `cosmocc` compiler will rewrite binary
 objects after they've been compiled by GCC, so that `%gs` register is
 used, rather than `%fs`. Morphing x86-64 binaries after they've been
 compiled is normally difficult, due to the complexity of the machine
 instruction language. However GCC provides `-mno-tls-direct-seg-refs`
 which greatly reduces the complexity of this task. This flag forgoes
 some optimizations to make the generated code simpler. Rather than doing
 clever arithmetic with `%fs` prefixes, the compiler will always generate
 the thread information block address load as a separate instruction.
 ```c
 // Change AMD code to use %gs:0x30 instead of %fs:0
 // We assume -mno-tls-direct-seg-refs has been used
 static void ChangeTlsFsToGs(unsigned char *p, size_t n) {
  unsigned char *e = p + n - 9;
  while (p <= e) {
    // we're checking for the following expression:
    //   0144 == p[0] &&           // %fs
    //   0110 == (p[1] & 0373) &&  // rex.w (and ignore rex.r)
    //   (0213 == p[2] ||          // mov reg/mem → reg (word-sized)
    //   0003 == p[2]) &&          // add reg/mem → reg (word-sized)
    //   0004 == (p[3] & 0307) &&  // mod/rm (4,reg,0) means sib → reg
    //   0045 == p[4] &&           // sib (5,4,0) → (rbp,rsp,0) → disp32
    //   0000 == p[5] &&           // displacement (von Neumann endian)
    //   0000 == p[6] &&           // displacement
    //   0000 == p[7] &&           // displacement
    //   0000 == p[8]              // displacement
    uint64_t w = READ64LE(p) & READ64LE("\377\373\377\307\377\377\377\377");
    if ((w == READ64LE("\144\110\213\004\045\000\000\000") ||
         w == READ64LE("\144\110\003\004\045\000\000\000")) &&
        !p[8]) {
      p[0] = 0145;  // change %fs to %gs
      p[5] = 0x30;  // change 0 to 0x30
      p += 9;
    } else {
      ++p;
    }
  }
 }
 ```
 By favoring `%gs` we've now ensured friction-free compatibilty for the
 APE runtime on MacOS, Linux, and FreeBSD which are all able to conform
 easily to this convention. However additional work needs to be done at
 runtime when an APE program is started on Windows, OpenBSD, and NetBSD.
 On these platforms, all executable pages must be faulted and morped to
 fixup the TLS instructions.
 On OpenBSD and NetBSD, this is as simple as undoing the example
 operation above. Earlier at compile-time we turned `%fs` into `%gs`.
 Now, at runtime, `%gs` must be turned back into `%fs`. Since the
 executable is morphing itself, this is easier said than done.
 OpenBSD for example enforces a `W^X` invariant. Code that's executing
 can't modify itself at the same time. The way Cosmopolitan solves this
 is by defining a special part of the binary called `.text.privileged`.
 This section is aligned to page boundaries. A GNU ld linker script is
 used to ensure that code which morphs code is placed into this section,
 through the use of a header-define cosmo-specific keyword `privileged`.
 Additionally, the `fixupobj` program is used by the Cosmo build system
 to ensure that compiled objects don't contain privileged functions that
 call non-privileged functions. Needless to say, `mprotect()` needs to be
 a privileged function, so that it can be used to disable the execute bit
 on all other parts of the executable except for the privileged section,
 thereby making it writable. Once this has been done, code can change.
 On Windows the diplacement bytes of the TLS instruction are changed to
 use the `%gs:0x1480+i*8` ABI where `i` is a number assigned by the WIN32
 `TlsAlloc()` API. This avoids the need to call `TlsGetValue()` which is
 implemented this exact same way under the hood. Even though 0x1480 isn't
 explicitly documented by MSDN, this ABI is believed to be stable because
 MSVC generates binaries that use this offset directly. The only caveat
 is that `TlsAlloc()` must be called as early in the runtime init as
 possible, to ensure an index less than 64 is returned.
 ### Thread Information Block (TIB)
 The Actually Portable Exccutable Thread Information Block (TIB) is
 defined by this version of the specification as follows:
 - The 64-bit TIB self-pointer is stored at offset 0x00.
 - The 64-bit TIB self-pointer is stored at offset 0x30.
 - The 32-bit `errno` value is stored at offset 0x3c.
 All other parts of the thread information block should be considered
 unspecified and therefore reserved for future specifications.
 The APE thread information block is aligned on a 64-byte boundary.
 Cosmopolitan Libc v3.5.8 (c. 2024-07-21) currently implements a thread
 information block that's 512 bytes in size.
 ### Foreign Function Calls
 Even though APE programs always use the System V ABI, there arises the
 occasional need to interface with foreign functions, e.g. WIN32. The
 `__attribute__((__ms_abi__))` annotation introduced by GCC v6 is used
 for this purpose.
 The ability to change a function's ABI on a case-by-case basis is
 surprisingly enough supported by GCC, Clang, NVCC, and even the AMD HIP
 compilers for both UNIX systems and Windows. All of these compilers
 support both the System V ABI and the Microsoft x64 ABI.
 APE binaries will actually favor the Microsoft ABI even when running on
 UNIX OSes for certain dlopen() use-cases. For example, if we control the
 code to a CUDA module, which we compile on each OS separately from our
 main APE binary, then any function that's inside the APE binary whose
 pointer may be passed into a foreign module SHOULD be compiled to use
 the Microsoft ABI. This is because in practice the OS-specific module
 may need to be compiled by MSVC, where MS ABI is the *only* ABI, which
 forces our UNIX programs to partially conform. Thankfully, all UNIX
 compilers support doing it on a case-by-case basis.
 ### Char Signedness
 Actually Portable Executable defines `char` as signed.
 Therefore conformant APE software MUST use `-fsigned-char` when building
 code for aarch64, as well as any other architecture that (unlike x86-64)
 would otherwise define `char` as being `unsigned char` by deafult.
 This decision was one of the cases where it made sense to offer a more
 consistent runtime experience for fat multi-arch binaries. However you
 SHOULD still write code to assume `char` can go either way. But if all
 you care about is using APE, then you CAN assume `char` is signed.
 ### Long Double
 On AMD64 platforms, APE binaries define `long double` as 80-bit.
 On ARM64 platforms, APE binaries define `long double` as 128-bit.
 We accept inconsistency in this case, because hardware acceleration is
 far more valuable than stylistic consistency in the case of mathematics.
 One challenge arises on AMD64 for supporting `long double` across OSes.
 Unlike UNIX systems, the Windows Executive on x86-64 initializes the x87
 FPU to have double (64-bit) precision rather than 80-bit. That's because
 code compiled by MSVC treats `long double` as though it were `double` to
 prefer always using the more modern SSE instructions. However System V
 requires genuine 80-bit `long double` support on AMD64.
 Therefore, if an APE program detects that it's been started on a Windows
 x86-64 system, then it SHOULD use the following assembly to initialize
 the x87 FPU in System V ABI mode.
 ```asm
 	fldcw	1f(%rip)
 	.rodata
 	.balign	2
 //	8087 FPU Control Word
 //	 IM: Invalid Operation ───────────────┐
 //	 DM: Denormal Operand ───────────────┐│
 //	 ZM: Zero Divide ───────────────────┐││
 //	 OM: Overflow ─────────────────────┐│││
 //	 UM: Underflow ───────────────────┐││││
 //	 PM: Precision ──────────────────┐│││││
 //	 PC: Precision Control ───────┐  ││││││
 //	  {float,∅,double,long double}│  ││││││
 //	 RC: Rounding Control ──────┐ │  ││││││
 //	  {even, →-∞, →+∞, →0}      │┌┤  ││││││
 //	                           ┌┤││  ││││││
 //	                          d││││rr││││││
 1:	.short	0b00000000000000000001101111111
 	.previous
 ```
 ## Executable File Alignment
 Actually Portable Executable is a statically-linked flat executable file
 format that is, as a thing in itself, agnostic to file alignments. For
 example, the shell script payload at the beginning of the file and its
 statements have no such requirements. Alignment requirements are however
 imposed by the executable formats that APE wraps.
 1. ELF requires that file offsets be congruent with virtual addresses
   modulo the CPU page size. So when we add a shell script to the start
   of an executable, we need to round up to the page size in order to
   maintain ELF's invariant. Although no such roundup is required on the
   program segments once the invariant is restored. ELF loaders will
   happily map program headers from arbitrary file intervals (which may
   overlap) onto arbitrarily virtual intervals (which don't need to be
   contiguous). in order to do that, the loaders will generally use
   unix's mmap() function which needs to have both page aligned
   addresses and file offsets, even though the ELF programs headers
   themselves do not. Since program headers start and stop at
   potentially any byte, ELF loaders tease the intervals specified by
   program headers into conforming to mmap() requirements by rounding
   out intervals as necessary in order to ensure that both the mmap()
   size and offset parameters are page-size aligned. This means with
   ELF, we never need to insert any empty space into a file when we
   don't want to; we can simply allow the offset to drift apart from the
   virtual offset.
 2. PE doesn't care about congruency and instead specifies a second kind
   of alignment. The minimum alignment of files is 512 because that's
   what MS-DOS used. Where things get hairy is with PE's SizeOfHeaders
   which has complex requirements. When the PE image base needs to be
   skewed, Windows imposes a separate 64kb alignment requirement on the
   image base. Therefore an APE executable's `__executable_start` should
   be aligned on at least a 64kb address.
 3. Apple's Mach-O format is the strictest of them all. While both ELF
   and PE are defined in such a way that invites great creativity, XNU
   will simply refuse to an executable that does anything creative with
   alignment. All loaded segments need to both start and end on a page
   aligned address. XNU also wants segments to be contiguous similar to
   portable executable, except it applies to both the file and virtual
   spaces, which must follow the same structure.
 Actually Portable Executables must conform to the strictest requirements
 demanded by the support vector. Therefore an APE binary that has headers
 for all three of the above executable formats MUST conform to the Apple
 way of doing things. GNU ld linker scripts aren't very good at producing
 ELF binaries that rigidly conform to this simple naive layout. There are
 so many ways things can go wrong, where third party code might slip its
 own custom section name in-between the linker script sections that are
 explicitly defined, thereby causing ELF's powerful features to manifest
 and the resulting content overlapping. The best `ld` flag that helps is
 `--orphan-handling=error` which can help with explaining such mysteries.
 While Cosmopolitan was originally defined to just use stock GNU tools,
 this proved intractable over time, and the project has been evolving in
 the direction of building its own. Inventing the `apelink` program was
 what enabled the project to achieve multi-architecture binaries whereas
 previously it was only possible to do multi-OS binaries. In the future,
 our hope is that a fast power linker like Mold can be adapted to produce
 fat APE binaries directly from object files in one pass.
 ## Position Independent Code
 APE doesn't currently support position independent executable formats.
 This is because APE was originally written for the GNU linker, where PIC
 and PIE were after-thoughts and never fully incorporated with the older
 more powerful linker script techniques upon which APE relies. Future
 iterations of this specification are intended to converge on modern
 standards, as our tooling becomes developed enough to support it.
 However this only applies to the wrapped executable formats themselves.
 While our convention to date has been to always load ELF programs at the
 4mb mark, this is not guaranteed across OSes and architectures. Programs
 should have no expectations that a program will be loaded to any given
 address. For example, Cosmo currently implements APE on AARCH64 as
 loading executables to a starting address of 0x000800000000. This
 address occupies a sweet spot of requirements.
 ## Address Space
 In order to create a single binary that supports as many platforms as
 possible without needing to be recompiled, there's a very narrow range
 of addresses that can be used. That range is somewhere between 32 bits
 and 39 bits.
 - Embedded devices that claim to be 64-bit will oftentimes only support
  a virtual address space that's 39 bits in size.
 - We can't load executable images on AARCH64 beneath 0x100000000 (4gb)
  because Apple forbids doing that, possibly in an effort to enforce a
  best practice for spotting 32-bit to 64-bit transition bugs. Please
  note that this restriction only applies to Apple ARM64 systems. The
  x86-64 version of XNU will happily load APE binaries to 0x00400000.
 - The AMD64 architecture on desktops and servers can usually be counted
  upon to provide a 47-bit address space. The Linux Kernel for instance
  grants each userspace program full dominion over addresses 0x00200000
  through 0x00007fffffffffff provided the hardware supports this. On
  modern workstations supporting Intel and AMD's new PML5T feature which
  virtualizes memory using a radix trie that's five layers deep, Linux
  is able to offer userspace its choice of fixed addresses from
  0x00200000 through 0x00ffffffffffffff. The only exception to this rule
  we've encountered so far is that Windows 7 and Windows Vista behaved
  similar to embedded devices in reducing the number of va bits.
 ## Page Size
 APE software MUST be page size agnostic. For many years the industry had
 converged on a strong consensus of having a page size that's 4096 bytes.
 However this convention was never guaranteed. New computers have become
 extremely popular, such as Apple Silicon, that use a 16kb page size.
 In addition to being page size agnostic, APE software that cares about
 working correctly on Windows needs to be aware of the concept of
 allocation granularity. While the page size on Windows is generally 4kb
 in size, memory mappings can only be created on addresses that aligned
 to the system allocation granularity, which is generally 64kb. If you
 use a function like mmap() with Cosmopolitan Libc, then the `addr` and
 `offset` parameters need to be aligned to `sysconf(_SC_GRANSIZE)` or
 else your software won't work on Windows. Windows has other limitations
 too, such as lacking the abiilty to carve or punch holes in mappings.