Commit graph

264 commits

Author SHA1 Message Date
Justine Tunney
42a3bb729a
Make execve() linger when it can't spoof parent
It's now possible to use execve() when the parent process isn't built by
cosmo. In such cases, the current process will kill all threads and then
linger around, waiting for the newly created process to die, and then we
propagate its exit code to the parent. This should help bazel and others

Allocating private anonymous memory is now 5x faster on Windows. This is
thanks to VirtualAlloc() which is faster than the file mapping APIs. The
fork() function also now goes 30% faster, since we are able to avoid the
VirtualProtect() calls on mappings in most cases now.

Fixes #1253
2025-01-04 21:13:37 -08:00
Justine Tunney
c97a858470
Remove missing definitions 2025-01-04 00:20:45 -08:00
Justine Tunney
e939659b70
Fix ordering of pthread_create(pthread_t *thread)
This change fixes a bug where signal_latency_async_test would flake less
than 1/1000 of the time. What was happening was pthread_kill(sender_thr)
would return EFAULT. This was because pthread_create() was not returning
the thread object pointer until after clone() had been called. So it was
actually possible for the main thread to stall after calling clone() and
during that time the receiver would launch and receive a signal from the
sender thread, and then fail when it tried to send a pong. I thought I'd
use a barrier at first, in the test, to synchronize thread creation, but
I firmly believe that pthread_create() was to blame and now that's fixed
2025-01-03 17:34:29 -08:00
Justine Tunney
ed6d133a27
Use tgkill() on Linux and FreeBSD
This eliminates the chance of rare bugs when thread IDs are recycled.
2025-01-03 17:27:13 -08:00
Justine Tunney
662e7b217f
Remove pthread_setcanceltype() from non-dbg strace 2025-01-02 22:25:29 -08:00
Justine Tunney
a15958edc6
Remove some legacy cruft
Function trace logs will report stack usage accurately. It won't include
the argv/environ block. Our clone() polyfill is now simpler and does not
use as much stack memory. Function call tracing on x86 is now faster too
2025-01-02 18:44:07 -08:00
Justine Tunney
fde03f8487
Remove leaf attribute where appropriate
This change fixes a bug where gcc assumed thread synchronization such as
pthread_cond_wait() wouldn't alter static variables, because the headers
were using __attribute__((__leaf__)) inappropriately.
2025-01-02 08:07:15 -08:00
Justine Tunney
0b3c81dd4e
Make fork() go 30% faster
This change makes fork() go nearly as fast as sys_fork() on UNIX. As for
Windows this change shaves about 4-5ms off fork() + wait() latency. This
is accomplished by using WriteProcessMemory() from the parent process to
setup the address space of a suspended process; it is better than a pipe
2025-01-01 04:59:38 -08:00
Justine Tunney
98c5847727
Fix fork waiter leak in nsync
This change fixes a bug where nsync waiter objects would leak. It'd mean
that long-running programs like runitd would run out of file descriptors
on NetBSD where waiter objects have ksem file descriptors. On other OSes
this bug is mostly harmless since the worst that can happen with a futex
is to leak a little bit of ram. The bug was caused because tib_nsync was
sneaking back in after the finalization code had cleared it. This change
refactors the thread exiting code to handle nsync teardown appropriately
and in making this change I found another issue, which is that user code
which is buggy, and tries to exit without joining joinable threads which
haven't been detached, would result in a deadlock. That doesn't sound so
bad, except the main thread is a joinable thread. So this deadlock would
be triggered in ways that put libc at fault. So we now auto-join threads
and libc will log a warning to --strace when that happens for any thread
2024-12-31 01:30:13 -08:00
Justine Tunney
379cd77078
Improve memory manager and signal handling
On Windows, mmap() now chooses addresses transactionally. It reduces the
risk of badness when interacting with the WIN32 memory manager. We don't
throw darts anymore. There is also no more retry limit, since we recover
from mystery maps more gracefully. The subroutine for combining adjacent
maps has been rewritten for clarity. The print maps subroutine is better

This change goes to great lengths to perfect the stack overflow code. On
Windows you can now longjmp() out of a crash signal handler. Guard pages
previously weren't being restored properly by the signal handler. That's
fixed, so on Windows you can now handle a stack overflow multiple times.
Great thought has been put into selecting the perfect SIGSTKSZ constants
so you can save sigaltstack() memory. You can now use kprintf() with 512
bytes of stack available. The guard pages beneath the main stack are now
recorded in the memory manager.

This change fixes getcontext() so it works right with the %rax register.
2024-12-27 01:33:00 -08:00
Justine Tunney
36e5861b0c
Reduce stack virtual memory consumption on Linux 2024-12-25 20:58:08 -08:00
Justine Tunney
2de3845b25
Build tool for hunting down flakes 2024-12-24 11:36:16 -08:00
Justine Tunney
93e22c581f
Reduce pthread memory usage 2024-12-24 10:30:59 -08:00
Justine Tunney
ec2db4e40e
Avoid pthread_rwlock_wrlock() starvation 2024-12-24 10:30:11 -08:00
Justine Tunney
55b7aa1632
Allow user to override pthread mutex and cond 2024-12-23 21:57:52 -08:00
Justine Tunney
c8e10eef30
Make bulk_free() go faster 2024-12-23 20:31:57 -08:00
Justine Tunney
624573207e
Make threads faster and more reliable
This change doubles the performance of thread spawning. That's thanks to
our new stack manager, which allows us to avoid zeroing stacks. It gives
us 15µs spawns rather than 30µs spawns on Linux. Also, pthread_exit() is
faster now, since it doesn't need to acquire the pthread GIL. On NetBSD,
that helps us avoid allocating too many semaphores. Even if that happens
we're now able to survive semaphores running out and even memory running
out, when allocating *NSYNC waiter objects. I found a lot more rare bugs
in the POSIX threads runtime that could cause things to crash, if you've
got dozens of threads all spawning and joining dozens of threads. I want
cosmo to be world class production worthy for 2025 so happy holidays all
2024-12-21 22:13:00 -08:00
Justine Tunney
af7bd80430
Eliminate cyclic locks in runtime
This change introduces a new deadlock detector for Cosmo's POSIX threads
implementation. Error check mutexes will now track a DAG of nested locks
and report EDEADLK when a deadlock is theoretically possible. These will
occur rarely, but it's important for production hardening your code. You
don't even need to change your mutexes to use the POSIX error check mode
because `cosmocc -mdbg` will enable error checking on mutexes by default
globally. When cycles are found, an error message showing your demangled
symbols describing the strongly connected component are printed and then
the SIGTRAP is raised, which means you'll also get a backtrace if you're
using ShowCrashReports() too. This new error checker is so low-level and
so pure that it's able to verify the relationships of every libc runtime
lock, including those locks upon which the mutex implementation depends.
2024-12-16 22:25:12 -08:00
Justine Tunney
2d43d400c6
Support process shared pthread_rwlock
Cosmo now has a non-nsync implementation of POSIX read-write locks. It's
possible to call pthread_rwlockattr_setpshared in PTHREAD_PROCESS_SHARED
mode. Furthermore, if cosmo is built with PTHREAD_USE_NSYNC set to zero,
then Cosmo shouldn't use nsync at all. That's helpful if you want to not
link any Apache 2.0 licensed code.
2024-12-13 03:00:06 -08:00
Justine Tunney
9ddbfd921e
Introduce cosmo_futex_wait and cosmo_futex_wake
Cosmopolitan Futexes are now exposed as a public API.
2024-11-22 11:25:15 -08:00
Justine Tunney
fef24d622a
Work around copy_file_range() bug in eCryptFs
When programs like ar.ape and compile.ape are run on eCryptFs partitions
on Linux, copy_file_range() will fail with EINVAL which is wrong because
eCryptFs which doesn't support this system call, should raise EOPNOTSUPP

See https://github.com/jart/cosmopolitan/discussions/1305
2024-09-29 16:35:38 -07:00
Justine Tunney
dd8c4dbd7d
Write more tests for signal handling
There's now a much stronger level of assurance that signaling on Windows
will be atomic, low-latency, low tail latency, and shall never deadlock.
2024-09-21 05:24:56 -07:00
Justine Tunney
949c398327
Clean up more code 2024-09-15 02:45:16 -07:00
Justine Tunney
462ba6909e
Speed up unnamed POSIX semaphores
When sem_wait() used its futexes it would always use process shared mode
which can be problematic on platforms like Windows, where that causes it
to use the slow futex polyfill. Now when sem_init() is called in private
mode that'll be passed along so we can use a faster WaitOnAddress() call
2024-09-13 06:25:27 -07:00
Justine Tunney
fbdf9d028c
Rewrite Windows poll()
We can now await signals, files, pipes, and console simultaneously. This
change also gives a deeper review and testing to changes made yesterday.
2024-09-10 20:04:02 -07:00
Justine Tunney
a0a404a431
Fix issues with previous commit 2024-09-10 01:59:46 -07:00
Justine Tunney
2f48a02b44
Make recursive mutexes faster
Recursive mutexes now go as fast as normal mutexes. The tradeoff is they
are no longer safe to use in signal handlers. However you can still have
signal safe mutexes if you set your mutex to both recursive and pshared.
You can also make functions that use recursive mutexes signal safe using
sigprocmask to ensure recursion doesn't happen due to any signal handler

The impact of this change is that, on Windows, many functions which edit
the file descriptor table rely on recursive mutexes, e.g. open(). If you
develop your app so it uses pread() and pwrite() then your app should go
very fast when performing a heavily multithreaded and contended workload

For example, when scaling to 40+ cores, *NSYNC mutexes can go as much as
1000x faster (in CPU time) than the naive recursive lock implementation.
Now recursive will use *NSYNC under the hood when it's possible to do so
2024-09-10 00:08:59 -07:00
Justine Tunney
95fee8614d
Test recursive mutex code more 2024-09-09 00:19:23 -07:00
Justine Tunney
dd8544c3bd
Delve into clock rabbit hole
The worst issue I had with consts.sh for clock_gettime is how it defined
too many clocks. So I looked into these clocks all day to figure out how
how they overlap in functionality. I discovered counter-intuitive things
such as how CLOCK_MONOTONIC should be CLOCK_UPTIME on MacOS and BSD, and
that CLOCK_BOOTTIME should be CLOCK_MONOTONIC on MacOS / BSD. Windows 10
also has some incredible new APIs, that let us simplify clock_gettime().

  - Linux CLOCK_REALTIME         -> GetSystemTimePreciseAsFileTime()
  - Linux CLOCK_MONOTONIC        -> QueryUnbiasedInterruptTimePrecise()
  - Linux CLOCK_MONOTONIC_RAW    -> QueryUnbiasedInterruptTimePrecise()
  - Linux CLOCK_REALTIME_COARSE  -> GetSystemTimeAsFileTime()
  - Linux CLOCK_MONOTONIC_COARSE -> QueryUnbiasedInterruptTime()
  - Linux CLOCK_BOOTTIME         -> QueryInterruptTimePrecise()

Documentation on the clock crew has been added to clock_gettime() in the
docstring and in redbean's documentation too. You can read that to learn
interesting facts about eight essential clocks that survived this purge.
This is original research you will not find on Google, OpenAI, or Claude

I've tested this change by porting *NSYNC to become fully clock agnostic
since it has extensive tests for spotting irregularities in time. I have
also included these tests in the default build so they no longer need to
be run manually. Both CLOCK_REALTIME and CLOCK_MONOTONIC are good across
the entire amd64 and arm64 test fleets.
2024-09-04 01:32:46 -07:00
Justine Tunney
3c61a541bd
Introduce pthread_condattr_setclock()
This is one of the few POSIX APIs that was missing. It lets you choose a
monotonic clock for your condition variables. This might improve perf on
some platforms. It might also grant more flexibility with NTP configs. I
know Qt is one project that believes it needs this. To introduce this, I
needed to change some the *NSYNC APIs, to support passing a clock param.
There's also new benchmarks, demonstrating Cosmopolitan's supremacy over
many libc implementations when it comes to mutex performance. Cygwin has
an alarmingly bad pthread_mutex_t implementation. It is so bad that they
would have been significantly better off if they'd used naive spinlocks.
2024-09-02 23:45:42 -07:00
Justine Tunney
90460ceb3c
Make Cosmo mutexes competitive with Apple Libc
While we have always licked glibc and musl libc on gnu/systemd sadly the
Apple Libc implementation of pthread_mutex_t is better than ours. It may
be due to how the XNU kernel and M2 microprocessor are in league when it
comes to scheduling processes and the NSYNC behavior is being penalized.
We can solve this by leaning more heavily on ulock using Drepper's algo.
It's kind of ironic that Linux's official mutexes work terribly on Linux
but almost as good as Apple Libc if used on MacOS.
2024-09-02 19:03:11 -07:00
Justine Tunney
2ec413b5a9
Fix bugs in poll(), select(), ppoll(), and pselect()
poll() and select() now delegate to ppoll() and pselect() for assurances
that both polyfill implementations are correct and well-tested. Poll now
polyfills XNU and BSD quirks re: the hanndling of POLLNVAL and the other
similar status flags. This change resolves a misunderstanding concerning
how select(exceptfds) is intended to map to POLPRI. We now use E2BIG for
bouncing requests that exceed the 64 handle limit on Windows. With pipes
and consoles on Windows our poll impl will now report POLLHUP correctly.

Issues with Windows path generation have been fixed. For example, it was
problematic on Windows to say: posix_spawn_file_actions_addchdir_np("/")
due to the need to un-UNC paths in some additional places. Calling fstat
on UNC style volume path handles will now work. posix_spawn now supports
simulating the opening of /dev/null and other special paths on Windows.

Cosmopolitan no longer defines epoll(). I think wepoll is a nice project
for using epoll() on Windows socket handles. However we need generalized
file descriptor support to make epoll() for Windows work well enough for
inclusion in a C library. It's also not worth having epoll() if we can't
get it to work on XNU and BSD OSes which provide different abstractions.
Even epoll() on Linux isn't that great of an abstraction since it's full
of footguns. Last time I tried to get it to be useful I had little luck.
Considering how long it took to get poll() and select() to be consistent
across platforms, we really have no business claiming to have epoll too.
While it'd be nice to have fully implemented, the only software that use
epoll() are event i/o libraries used by things like nodejs. Event i/o is
not the best paradigm for handling i/o; threads make so much more sense.
2024-09-02 00:29:52 -07:00
Justine Tunney
11d9fb521d
Make atomics faster on aarch64
This change implements the compiler runtime for ARM v8.1 ISE atomics and
gets rid of the mandatory -mno-outline-atomics flag. It can dramatically
speed things up, on newer ARM CPUs, as indicated by the changed lines in
test/libc/thread/footek_test.c. In llamafile dispatching on hwcap atomic
also shaved microseconds off synchronization barriers.
2024-08-16 11:14:46 -07:00
Justine Tunney
31194165d2
Remove .internal from more header filenames 2024-08-04 12:52:25 -07:00
Justine Tunney
bb815eafaf
Update Musl Libc code
We now have implement all of Musl's localization code, the same way that
Musl implements localization. You may need setlocale(LC_ALL, "C.UTF-8"),
just in case anything stops working as expected.
2024-07-30 22:51:29 -07:00
Justine Tunney
cf1559c448
Remove __threaded variable 2024-07-28 23:43:30 -07:00
Justine Tunney
e18fe1e112
Freshen build/bootstrap/cocmd
See https://news.ycombinator.com/item?id=41055121
2024-07-27 23:22:11 -07:00
Justine Tunney
18a620cc1a
Make some improvements of little consequence 2024-07-27 08:20:18 -07:00
Justine Tunney
59692b0882
Make spinlocks faster (take two)
This change is green on x86 and arm test fleet.
2024-07-26 00:45:24 -07:00
Justine Tunney
02e1cbcd00
Revert "Make spin locks go faster"
This reverts commit c8e25d811c.
2024-07-25 22:24:32 -07:00
Justine Tunney
c8e25d811c
Make spin locks go faster 2024-07-25 17:37:11 -07:00
Justine Tunney
d3a13e8d70
Improve lock hierarchy
- NetBSD no longer needs a spin lock to create semaphores
- Windows fork() now locks process manager in correct order
2024-07-24 16:05:48 -07:00
Justine Tunney
7ba9a73840
Remove more _Atomic keywords from public headers
It's been thirteen years and C++ still hasn't implemented this wonderful
simple builtin keyword. In C++23 a solution was provided for making this
work in C++ which is libcxx's stdatomic.h. Including that header schleps
in literally 253 unique header files!! Many of the header files it needs
are libc header files like pthread.h where we need to have the _Atomic()
keyword, but since <atomic> depends on pthreads we can't have it include
the <stdatomic.h> header that defines _Atomic for C++ users, and instead
we simply make the type non-atomic, hoping and praying only C code shall
use those internal data structures. This just shows how STL clowns can't
be trusted to define the innermost primitives of a language. They should
instead be focusing on being the best at algorithms and data structures.
2024-07-24 13:56:03 -07:00
Justine Tunney
5dd7ddb9ea
Remove bad defines from early days of project
These definitions were causing issues with building LLVM. It is possible
they also caused crashes we've seen with our MacOS ARM64 OpenMP support.
2024-07-24 12:11:21 -07:00
Justine Tunney
e398f3887c
Make more improvements to threads and mappings
- NetBSD should now have faster synchronization
- POSIX barriers may now be shared across processes
- An edge case with memory map tracking has been fixed
- Grand Central Dispatch is no longer used on MacOS ARM64
- POSIX mutexes in normal mode now use futexes across processes
2024-07-24 01:19:54 -07:00
Justine Tunney
6e809ee49b
Add unit test for process shared conditions 2024-07-22 18:48:54 -07:00
Justine Tunney
61c36c1dd6
Allow pthread_condattr_setpshared() to set shared 2024-07-22 18:41:45 -07:00
Justine Tunney
0a9a6f86bb
Support process shared condition variables 2024-07-22 16:35:29 -07:00
Justine Tunney
3de6632be6
Graduate some clock_gettime() constants to #define
- CLOCK_THREAD_CPUTIME_ID
- CLOCK_PROCESS_CPUTIME_ID

Cosmo now supports the above constants universally across supported OSes
therefore it's now safe to let programs detect their presence w/ #ifdefs
2024-07-22 07:14:35 -07:00
Justine Tunney
5d2d9e9640
Add back missing TlsAlloc() call
Cosmopolitan Libc once called this important function although somewhere
along the way, possibly in a refactoring, it got removed and __tls_alloc
has always been zero ever since.
2024-07-21 20:45:27 -07:00