Commit graph

44 commits

Author SHA1 Message Date
Justine Tunney
2f48a02b44
Make recursive mutexes faster
Recursive mutexes now go as fast as normal mutexes. The tradeoff is they
are no longer safe to use in signal handlers. However you can still have
signal safe mutexes if you set your mutex to both recursive and pshared.
You can also make functions that use recursive mutexes signal safe using
sigprocmask to ensure recursion doesn't happen due to any signal handler

The impact of this change is that, on Windows, many functions which edit
the file descriptor table rely on recursive mutexes, e.g. open(). If you
develop your app so it uses pread() and pwrite() then your app should go
very fast when performing a heavily multithreaded and contended workload

For example, when scaling to 40+ cores, *NSYNC mutexes can go as much as
1000x faster (in CPU time) than the naive recursive lock implementation.
Now recursive will use *NSYNC under the hood when it's possible to do so
2024-09-10 00:08:59 -07:00
Justine Tunney
90460ceb3c
Make Cosmo mutexes competitive with Apple Libc
While we have always licked glibc and musl libc on gnu/systemd sadly the
Apple Libc implementation of pthread_mutex_t is better than ours. It may
be due to how the XNU kernel and M2 microprocessor are in league when it
comes to scheduling processes and the NSYNC behavior is being penalized.
We can solve this by leaning more heavily on ulock using Drepper's algo.
It's kind of ironic that Linux's official mutexes work terribly on Linux
but almost as good as Apple Libc if used on MacOS.
2024-09-02 19:03:11 -07:00
Justine Tunney
59692b0882
Make spinlocks faster (take two)
This change is green on x86 and arm test fleet.
2024-07-26 00:45:24 -07:00
Justine Tunney
02e1cbcd00
Revert "Make spin locks go faster"
This reverts commit c8e25d811c.
2024-07-25 22:24:32 -07:00
Justine Tunney
c8e25d811c
Make spin locks go faster 2024-07-25 17:37:11 -07:00
Justine Tunney
e398f3887c
Make more improvements to threads and mappings
- NetBSD should now have faster synchronization
- POSIX barriers may now be shared across processes
- An edge case with memory map tracking has been fixed
- Grand Central Dispatch is no longer used on MacOS ARM64
- POSIX mutexes in normal mode now use futexes across processes
2024-07-24 01:19:54 -07:00
Justine Tunney
0a9a6f86bb
Support process shared condition variables 2024-07-22 16:35:29 -07:00
Justine Tunney
86d884cce2
Get rid of .internal.h convention in LIBC_INTRIN 2024-07-19 19:38:00 -07:00
Justine Tunney
8c645fa1ee
Make mmap() scalable
It's now possible to create thousands of thousands of sparse independent
memory mappings, without any slowdown. The memory manager is better with
tracking memory protection now, particularly on Windows in a precise way
that can be restored during fork(). You now have the highest quality mem
manager possible. It's even better than some OSes like XNU, where mmap()
is implemented as an O(n) operation which means sadly things aren't much
improved over there. With this change the llamafile HTTP server endpoint
at /tokenize with a prompt of 50 tokens is now able to handle 2.6m r/sec
2024-07-05 23:26:00 -07:00
Justine Tunney
464858dbb4
Fix bugs with new memory manager
This fixes a regression in mmap(MAP_FIXED) on Windows caused by a recent
revision. This change also fixes ZipOS so it no longer needs a MAP_FIXED
mapping to open files from the PKZIP store. The memory mapping mutex was
implemented incorrectly earlier which meant that ftrace and strace could
cause cause crashes. This lock and other recursive mutexes are rewritten
so that it should be provable that recursive mutexes in cosmopolitan are
asynchronous signal safe.
2024-06-29 10:53:57 -07:00
Jōshin
e16a7d8f3b
flip et / noet in modelines
`et` means `expandtab`.

```sh
rg 'vi: .* :vi' -l -0 | \
  xargs -0 sed -i '' 's/vi: \(.*\) et\(.*\)  :vi/vi: \1 xoet\2:vi/'
rg 'vi: .*  :vi' -l -0 | \
  xargs -0 sed -i '' 's/vi: \(.*\)noet\(.*\):vi/vi: \1et\2  :vi/'
rg 'vi: .*  :vi' -l -0 | \
  xargs -0 sed -i '' 's/vi: \(.*\)xoet\(.*\):vi/vi: \1noet\2:vi/'
```
2023-12-07 22:17:11 -05:00
Jōshin
394d998315
Fix vi modelines (#989)
At least in neovim, `│vi:` is not recognized as a modeline because it
has no preceding whitespace. After fixing this, opening a file yields
an error because `net` is not an option. (`noet`, however, is.)
2023-12-05 14:37:54 -08:00
Justine Tunney
fadb64a2bf
Introduce pthread_rwlock_try{rd,wr}lock
This also changes recursive mutexes to favor cpu over scheduler yield.
2023-10-31 22:13:08 -07:00
Justine Tunney
3ffc17c50e
Add Cosmopolitan to uname() 2023-09-21 23:51:55 -07:00
Justine Tunney
ec480f5aa0
Make improvements
- Every unit test now passes on Apple Silicon. The final piece of this
  puzzle was porting our POSIX threads cancelation support, since that
  works differently on ARM64 XNU vs. AMD64. Our semaphore support on
  Apple Silicon is also superior now compared to AMD64, thanks to the
  grand central dispatch library which lets *NSYNC locks go faster.

- The Cosmopolitan runtime is now more stable, particularly on Windows.
  To do this, thread local storage is mandatory at all runtime levels,
  and the innermost packages of the C library is no longer being built
  using ASAN. TLS is being bootstrapped with a 128-byte TIB during the
  process startup phase, and then later on the runtime re-allocates it
  either statically or dynamically to support code using _Thread_local.
  fork() and execve() now do a better job cooperating with threads. We
  can now check how much stack memory is left in the process or thread
  when functions like kprintf() / execve() etc. call alloca(), so that
  ENOMEM can be raised, reduce a buffer size, or just print a warning.

- POSIX signal emulation is now implemented the same way kernels do it
  with pthread_kill() and raise(). Any thread can interrupt any other
  thread, regardless of what it's doing. If it's blocked on read/write
  then the killer thread will cancel its i/o operation so that EINTR can
  be returned in the mark thread immediately. If it's doing a tight CPU
  bound operation, then that's also interrupted by the signal delivery.
  Signal delivery works now by suspending a thread and pushing context
  data structures onto its stack, and redirecting its execution to a
  trampoline function, which calls SetThreadContext(GetCurrentThread())
  when it's done.

- We're now doing a better job managing locks and handles. On NetBSD we
  now close semaphore file descriptors in forked children. Semaphores on
  Windows can now be canceled immediately, which means mutexes/condition
  variables will now go faster. Apple Silicon semaphores can be canceled
  too. We're now using Apple's pthread_yield() funciton. Apple _nocancel
  syscalls are now used on XNU when appropriate to ensure pthread_cancel
  requests aren't lost. The MbedTLS library has been updated to support
  POSIX thread cancelations. See tool/build/runitd.c for an example of
  how it can be used for production multi-threaded tls servers. Handles
  on Windows now leak less often across processes. All i/o operations on
  Windows are now overlapped, which means file pointers can no longer be
  inherited across dup() and fork() for the time being.

- We now spawn a thread on Windows to deliver SIGCHLD and wakeup wait4()
  which means, for example, that posix_spawn() now goes 3x faster. POSIX
  spawn is also now more correct. Like Musl, it's now able to report the
  failure code of execve() via a pipe although our approach favors using
  shared memory to do that on systems that have a true vfork() function.

- We now spawn a thread to deliver SIGALRM to threads when setitimer()
  is used. This enables the most precise wakeups the OS makes possible.

- The Cosmopolitan runtime now uses less memory. On NetBSD for example,
  it turned out the kernel would actually commit the PT_GNU_STACK size
  which caused RSS to be 6mb for every process. Now it's down to ~4kb.
  On Apple Silicon, we reduce the mandatory upstream thread size to the
  smallest possible size to reduce the memory overhead of Cosmo threads.
  The examples directory has a program called greenbean which can spawn
  a web server on Linux with 10,000 worker threads and have the memory
  usage of the process be ~77mb. The 1024 byte overhead of POSIX-style
  thread-local storage is now optional; it won't be allocated until the
  pthread_setspecific/getspecific functions are called. On Windows, the
  threads that get spawned which are internal to the libc implementation
  use reserve rather than commit memory, which shaves a few hundred kb.

- sigaltstack() is now supported on Windows, however it's currently not
  able to be used to handle stack overflows, since crash signals are
  still generated by WIN32. However the crash handler will still switch
  to the alt stack, which is helpful in environments with tiny threads.

- Test binaries are now smaller. Many of the mandatory dependencies of
  the test runner have been removed. This ensures many programs can do a
  better job only linking the the thing they're testing. This caused the
  test binaries for LIBC_FMT for example, to decrease from 200kb to 50kb

- long double is no longer used in the implementation details of libc,
  except in the APIs that define it. The old code that used long double
  for time (instead of struct timespec) has now been thoroughly removed.

- ShowCrashReports() is now much tinier in MODE=tiny. Instead of doing
  backtraces itself, it'll just print a command you can run on the shell
  using our new `cosmoaddr2line` program to view the backtrace.

- Crash report signal handling now works in a much better way. Instead
  of terminating the process, it now relies on SA_RESETHAND so that the
  default SIG_IGN behavior can terminate the process if necessary.

- Our pledge() functionality has now been fully ported to AARCH64 Linux.
2023-09-18 21:04:47 -07:00
Justine Tunney
00084577a3
Improve posix_spawn() some more 2023-09-12 08:58:57 -07:00
Justine Tunney
0d748ad58e
Fix warnings
This change fixes Cosmopolitan so it has fewer opinions about compiler
warnings. The whole repository had to be cleaned up to be buildable in
-Werror -Wall mode. This lets us benefit from things like strict const
checking. Some actual bugs might have been caught too.
2023-09-01 20:50:18 -07:00
Justine Tunney
648bf6555c
Reimport zip into third party 2022-10-16 13:39:41 -07:00
Justine Tunney
60cb435cb4
Implement pthread_atfork()
If threads are being used, then fork() will now acquire and release and
runtime locks so that fork() may be safely used from threads. This also
makes vfork() thread safe, because pthread mutexes will do nothing when
the process is a child of vfork(). More torture tests have been written
to confirm this all works like a charm. Additionally:

- Invent hexpcpy() api
- Rename nsync_malloc_() to kmalloc()
- Complete posix named semaphore implementation
- Make pthread_create() asynchronous signal safe
- Add rm, rmdir, and touch to command interpreter builtins
- Invent sigisprecious() and modify sigset functions to use it
- Add unit tests for posix_spawn() attributes and fix its bugs

One unresolved problem is the reclaiming of *NSYNC waiter memory in the
forked child processes, within apps which have threads waiting on locks
2022-10-16 12:25:13 -07:00
Justine Tunney
654ceaba7d
Clean up threading code some more 2022-09-13 20:17:34 -07:00
Justine Tunney
6f7d0cb1c3
Pay off more technical debt
This makes breaking changes to add underscores to many non-standard
function names provided by the c library. MODE=tiny is now tinier and
we now use smaller locks that are better for tiny apps in this mode.
Some headers have been renamed to be in the same folder as the build
package, so it'll be easier to know which build dependency is needed.
Certain old misguided interfaces have been removed. Intel intrinsics
headers are now listed in libc/isystem (but not in the amalgamation)
to help further improve open source compatibility. Header complexity
has also been reduced. Lastly, more shell scripts are now available.
2022-09-12 23:36:56 -07:00
Justine Tunney
b5cb71ab84
Use *NSYNC for POSIX threads locking APIs
Condition variables, barriers, and r/w locks now work very well.
2022-09-11 11:04:50 -07:00
Justine Tunney
cfcf5918bc
Rewrite recursive mutex code 2022-09-10 09:18:52 -07:00
Justine Tunney
155b378a39
Tidy up the threading implementation
The organization of the source files is now much more rational.
Old experiments that didn't work out are now deleted. Naming of
things like files is now more intuitive.
2022-09-10 02:56:25 -07:00
Justine Tunney
4339d9f15e
Add pthread attributes and other libc functions 2022-09-07 05:28:32 -07:00
Justine Tunney
7ff0ea8c13
Make pthread mutexes more scalable
pthread_mutex_lock() now uses a better algorithm which goes much faster
in multithreaded environments that have lock contention. This comes at
the cost of adding some fixed-cost overhead to mutex invocations. That
doesn't matter for Cosmopolitan because our core libraries all encode
locking operations as NOP instructions when in single-threaded mode.
Overhead only applies starting the moment you first call clone().
2022-09-05 15:57:51 -07:00
Justine Tunney
9be364d40a
Implement POSIX threads API 2022-09-05 08:27:15 -07:00
Justine Tunney
83d41e4588 Clean up some code 2022-08-20 12:32:51 -07:00
Justine Tunney
05b8f82371 Fold LIBC_BITS into LIBC_INTRIN 2022-08-11 12:13:18 -07:00
Justine Tunney
10fd8bdb70 Unbloat the build
This change resurrects ae5d06dc53
2022-08-11 00:15:29 -07:00
Justine Tunney
c1d99676c4 Revert "Unbloat build config"
This reverts commit ae5d06dc53.
2022-08-10 12:44:56 -07:00
Justine Tunney
ae5d06dc53 Unbloat build config
- 10.5% reduction of o//depend dependency graph
- 8.8% reduction in latency of make command
- Fix issue with temporary file cleanup

There's a new -w option in compile.com that turns off the recent
Landlock output path workaround for "good commands" which do not
unlink() the output file like GNU tooling does.

Our new GNU Make unveil sandboxing appears to have zero overhead
in the grand scheme of things. Full builds are pretty fast since
the only thing that's actually slowed us down is probably libcxx

    make -j16 MODE=rel
    RL: took 85,732,063µs wall time
    RL: ballooned to 323,612kb in size
    RL: needed 828,560,521µs cpu (11% kernel)
    RL: caused 39,080,670 page faults (99% memcpy)
    RL: 350,073 context switches (72% consensual)
    RL: performed 0 reads and 11,494,960 write i/o operations

pledge() and unveil() no longer consider ENOSYS to be an error.
These functions have also been added to Python's cosmo module.

This change also removes some WIN32 APIs and System Five magnums
which we're not using and it's doubtful anyone else would be too
2022-08-10 04:43:09 -07:00
Justine Tunney
4f4889ddf7 Use futexes on OpenBSD and improve threading 2022-07-17 19:59:49 -07:00
Justine Tunney
fbc053e018 Make fixes and improvements
- Introduce __assert_disable global
- Improve strsignal() thread safety
- Make system call tracing thread safe
- Fix SO_RCVTIMEO / SO_SNDTIMEO on Windows
- Refactor DescribeFoo() functions into one place
- Fix fork() on Windows when TLS and MAP_STACK exist
- Round upwards in setsockopt(SO_RCVTIMEO) on Windows
- Disable futexes on OpenBSD which seem extremely broken
- Implement a better kludge for monotonic time on Windows
2022-06-25 21:09:09 -07:00
Justine Tunney
a4601a24d3 Perform some code cleanup 2022-06-23 10:21:07 -07:00
Justine Tunney
b8e9f71d33 Add futexes for OpenBSD 2022-06-21 07:17:05 -07:00
Justine Tunney
d5312b60f7 Make improvements to locking
This change makes pthread_mutex_lock() as fast as _spinlock() by
default. Thread instability issues on NetBSD have been resolved.
Improvements made to gdtoa thread code. Crash reporting will now
synchronize between threads in a slightly better way.
2022-06-19 01:30:12 -07:00
Justine Tunney
c06ffd458c Write some lock contention tests 2022-06-16 09:06:09 -07:00
Justine Tunney
e466dd0553 Add torture test for zipos file descriptors
This change hardens the code for opening /zip/ files using the system
call interface. Thread safety and signal safety has been improved for
file descriptors in general. We now document fixed addresses that are
needed for low level allocations.
2022-06-15 16:29:49 -07:00
Justine Tunney
a3865ecc3c Make more fixes and improvements
- Fix Makefile flaking due to ZIPOBJ_FLAGS generation
- Make printf() floating point and gdtoa thread safe
- Polish up the runit / runitd programs some more
- Prune some more makefile dependencies
2022-06-13 11:02:13 -07:00
Justine Tunney
8b72490431 Make mutex calling code 10x tinier
Calls to lock/unlock functions are now NOPs by default. The first time
clone() is called, they get turned into CALL instructions. Doing this
caused funcctions like fputc() to shrink from 85 bytes to 45+4 bytes.
Since the ANSI solution of `(__threaded && lock())` inlines os much
superfluous binary content into functions all over the place.
2022-06-12 20:17:12 -07:00
Justine Tunney
8cdec62f5b Apply even more fixups
- Finish cleaning up the stdio unlocked APIs
- Make __cxa_finalize() properly thread safe
- Don't log locks if threads aren't being used
- Add some more mutex guards to places using _mmi
- Specific lock names now appear in the --ftrace logs
- Fix mkdeps.com generating invalid Makefiles sometimes
- Simplify and fix bugs in the test runner infrastructure
- Fix issue where sometimes some functions wouldn't be logged
2022-06-12 11:57:00 -07:00
Justine Tunney
4ddfc47d6e Make some more fixups 2022-06-12 09:37:17 -07:00
Justine Tunney
c260345e06 Make locks more reliable
This change switches most of the core locks to be re-entrant, in order
to reduce the chance of deadlocking code that does, clever things with
asynchronous signal handlers. This change implements it it in pthreads
so we're one step closer to having a standardized threading primitives
2022-06-11 02:07:20 -07:00