It's now possible to use execve() when the parent process isn't built by
cosmo. In such cases, the current process will kill all threads and then
linger around, waiting for the newly created process to die, and then we
propagate its exit code to the parent. This should help bazel and others
Allocating private anonymous memory is now 5x faster on Windows. This is
thanks to VirtualAlloc() which is faster than the file mapping APIs. The
fork() function also now goes 30% faster, since we are able to avoid the
VirtualProtect() calls on mappings in most cases now.
Fixes#1253
Function trace logs will report stack usage accurately. It won't include
the argv/environ block. Our clone() polyfill is now simpler and does not
use as much stack memory. Function call tracing on x86 is now faster too
This change adds tests for the new memory manager code particularly with
its windows support. Function call tracing now works reliably on Silicon
since our function hooker was missing new Apple self-modifying code APIs
Many tests that were disabled a long time ago on aarch64 are reactivated
by this change, now that arm support is on equal terms with x86. There's
been a lot of places where ftrace could cause deadlocks, which have been
hunted down across all platforms thanks to new tests. A bug in Windows's
kill() function has been identified.
This change makes fork() go nearly as fast as sys_fork() on UNIX. As for
Windows this change shaves about 4-5ms off fork() + wait() latency. This
is accomplished by using WriteProcessMemory() from the parent process to
setup the address space of a suspended process; it is better than a pipe
This change fixes a bug where nsync waiter objects would leak. It'd mean
that long-running programs like runitd would run out of file descriptors
on NetBSD where waiter objects have ksem file descriptors. On other OSes
this bug is mostly harmless since the worst that can happen with a futex
is to leak a little bit of ram. The bug was caused because tib_nsync was
sneaking back in after the finalization code had cleared it. This change
refactors the thread exiting code to handle nsync teardown appropriately
and in making this change I found another issue, which is that user code
which is buggy, and tries to exit without joining joinable threads which
haven't been detached, would result in a deadlock. That doesn't sound so
bad, except the main thread is a joinable thread. So this deadlock would
be triggered in ways that put libc at fault. So we now auto-join threads
and libc will log a warning to --strace when that happens for any thread
On Windows, mmap() now chooses addresses transactionally. It reduces the
risk of badness when interacting with the WIN32 memory manager. We don't
throw darts anymore. There is also no more retry limit, since we recover
from mystery maps more gracefully. The subroutine for combining adjacent
maps has been rewritten for clarity. The print maps subroutine is better
This change goes to great lengths to perfect the stack overflow code. On
Windows you can now longjmp() out of a crash signal handler. Guard pages
previously weren't being restored properly by the signal handler. That's
fixed, so on Windows you can now handle a stack overflow multiple times.
Great thought has been put into selecting the perfect SIGSTKSZ constants
so you can save sigaltstack() memory. You can now use kprintf() with 512
bytes of stack available. The guard pages beneath the main stack are now
recorded in the memory manager.
This change fixes getcontext() so it works right with the %rax register.
This change doubles the performance of thread spawning. That's thanks to
our new stack manager, which allows us to avoid zeroing stacks. It gives
us 15µs spawns rather than 30µs spawns on Linux. Also, pthread_exit() is
faster now, since it doesn't need to acquire the pthread GIL. On NetBSD,
that helps us avoid allocating too many semaphores. Even if that happens
we're now able to survive semaphores running out and even memory running
out, when allocating *NSYNC waiter objects. I found a lot more rare bugs
in the POSIX threads runtime that could cause things to crash, if you've
got dozens of threads all spawning and joining dozens of threads. I want
cosmo to be world class production worthy for 2025 so happy holidays all
This change introduces a new deadlock detector for Cosmo's POSIX threads
implementation. Error check mutexes will now track a DAG of nested locks
and report EDEADLK when a deadlock is theoretically possible. These will
occur rarely, but it's important for production hardening your code. You
don't even need to change your mutexes to use the POSIX error check mode
because `cosmocc -mdbg` will enable error checking on mutexes by default
globally. When cycles are found, an error message showing your demangled
symbols describing the strongly connected component are printed and then
the SIGTRAP is raised, which means you'll also get a backtrace if you're
using ShowCrashReports() too. This new error checker is so low-level and
so pure that it's able to verify the relationships of every libc runtime
lock, including those locks upon which the mutex implementation depends.
It's now possible with cosmo and redbean, to deliver a signal to a child
process after it has called execve(). However the executed program needs
to be compiled using cosmocc. The cosmo runtime WinMain() implementation
now intercepts a _COSMO_PID environment variable that's set by execve().
It ensures the child process will use the same C:\ProgramData\cosmo\sigs
file, which is where kill() will place the delivered signal. We are able
to do this on Windows even better than NetBSD, which has a bug with this
Fixes#1334
On Windows, sometimes fork() could crash with message likes:
fork() ViewOrDie(170000) failed with win32 error 487
This is due to a bug in our file descriptor inheritance. We have cursors
which are shared between processes. They let us track the file positions
of read() and write() operations. At startup they were being mmap()ed to
memory addresses that were assigned by WIN32. That's bad because Windows
likes to give us memory addresses beneath the program image in the first
4mb range that are likely to conflict with other assignments. That ended
up causing problems because fork() needs to be able to assume that a map
will be possible to resurrect at the same address. But for one reason or
another, Windows libraries we don't control could sneak allocations into
the memory space that overlap with these mappings. This change solves it
by choosing a random memory address instead when mapping cursor objects.
We were too zealous about security before by only setting the owner bits
and that would cause issues for projects like redbean that check "other"
bits to determine if it's safe to serve a file. Since that doesn't exist
on Windows, it's better to have things work than not work. So what we'll
do instead is return modes like 0664 for files and 0775 for directories.
This change gets rsync working without any warning or errors. On Windows
we now create a bunch of C:\var\sig\x\y.pid shared memory files, so sigs
can be delivered between processes. WinMain() creates this file when the
process starts. If the program links signaling system calls then we make
a thread at startup too, which allows asynchronous delivery each quantum
and cancelation points can spot these signals potentially faster on wait
See #1240
- wcsstr() is now linearly complex
- strstr16() is now linearly complex
- strstr() is now vectorized on aarch64 (10x)
- strstr() now uses KMP on pathological cases
- memmem() is now vectorized on aarch64 (10x)
- memmem() now uses KMP on pathological cases
- Disable shared_ptr::owner_before until fixed
- Make iswlower(), iswupper() consistent with glibc
- Remove figure space from iswspace() implementation
- Include line and paragraph separator in iswcntrl()
- Use Musl wcwidth(), iswalpha(), iswpunct(), towlower(), towupper()
C++ code compiles very slowly with cosmocc, possibly because we're using
LLVM LIBCXX with GCC, and LLVM doesn't work as hard to make GCC go fast.
Therefore, it should be possible, to ask cosmocc to favor Clang over GCC
under the hood. On llamafile, my intention's to use this to make certain
files, e.g. llama.cpp/common.cpp, go from taking 17 seconds to 5 seconds
This new -mclang flag isn't ready for production yet since there's still
the question of how to get Clang to generate SJLJ exception code. If you
use this, then it's recommended you also pass -fno-exceptions.
The tradeoff is we're adding a 121mb binary to the cosmocc distribution.
There are no plans as of yet to fully migrate to Clang since GCC is very
good and has always treated us well.
It turns out sched_getcpu() didn't work on many platforms. So the system
call now has tests and is well documented. We now employ new workarounds
on platforms where it isn't supported in our malloc() implementation. It
was previously the case that malloc() was only scalable on Linux/Windows
for x86-64. Now the other platforms are scalable too.
This is a breaking change. It defines the new environment variable named
_COSMO_FDS_V2 which is used for inheriting non-stdio file descriptors on
execve() or posix_spawn(). No effort has been spent thus far integrating
with the older variable. If a new binary launches the older ones or vice
versa they'll only be able to pass stdin / stdout / stderr to each other
therefore it's important that you upgrade all your cosmo binaries if you
depend on this functionality. You'll be glad you did because inheritance
of file descriptors is more aligned with the POSIX standard than before.
This change ensures that if a file descriptor for an open disk file gets
shared by multiple processes within a process tree, then lseek() changes
will be visible across processes, and read() / write() are synchronized.
Note this only applies to Windows, because UNIX kernels already do this.
We now have implement all of Musl's localization code, the same way that
Musl implements localization. You may need setlocale(LC_ALL, "C.UTF-8"),
just in case anything stops working as expected.
The cosmocc.zip toolchain will now include four builds of the libcosmo.a
runtime libraries. You can pass the -mdbg flag if you want to debug your
cosmopolitan runtime. You can pass the -moptlinux flag if you don't want
windows code lurking in your binary. See tool/cosmocc/README.md for more
details on how these flags may be used and their important implications.
- CLOCK_THREAD_CPUTIME_ID
- CLOCK_PROCESS_CPUTIME_ID
Cosmo now supports the above constants universally across supported OSes
therefore it's now safe to let programs detect their presence w/ #ifdefs
Cosmopolitan Libc once called this important function although somewhere
along the way, possibly in a refactoring, it got removed and __tls_alloc
has always been zero ever since.
The memory leak detector was crashing. When using gc() you shouldn't use
the CheckForMemoryLeaks() function from inside the same function, due to
how it runs the atexit handlers.
Cosmopolitan now supports mremap(), which is only supported on Linux and
NetBSD. First, it allows memory mappings to be relocated without copying
them; this can dramatically speed up data structures like std::vector if
the array size grows larger than 256kb. The mremap() system call is also
10x faster than munmap() when shrinking large memory mappings.
There's now two functions, getpagesize() and getgransize() which help to
write portable code that uses mmap(MAP_FIXED). Alternative sysconf() may
be called with our new _SC_GRANSIZE. The madvise() system call now has a
better wrapper with improved documentation.
It's now possible to create thousands of thousands of sparse independent
memory mappings, without any slowdown. The memory manager is better with
tracking memory protection now, particularly on Windows in a precise way
that can be restored during fork(). You now have the highest quality mem
manager possible. It's even better than some OSes like XNU, where mmap()
is implemented as an O(n) operation which means sadly things aren't much
improved over there. With this change the llamafile HTTP server endpoint
at /tokenize with a prompt of 50 tokens is now able to handle 2.6m r/sec
This change reduces o/tiny/examples/life from 44kb to 24kb in size since
it avoids linking mmap() when unnecessary. This is important, to helping
cosmo not completely lose touch with its roots.
- Ensure SIGTHR isn't blocked in newly created threads
- Use TIB rather than thread_local for thread atexits
- Make POSIX thread keys atomic within thread
- Don't bother logging prctl() to --strace
- Log thread destructor names to --strace