Commit graph

94 commits

Author SHA1 Message Date
Justine Tunney
af7bd80430
Eliminate cyclic locks in runtime
This change introduces a new deadlock detector for Cosmo's POSIX threads
implementation. Error check mutexes will now track a DAG of nested locks
and report EDEADLK when a deadlock is theoretically possible. These will
occur rarely, but it's important for production hardening your code. You
don't even need to change your mutexes to use the POSIX error check mode
because `cosmocc -mdbg` will enable error checking on mutexes by default
globally. When cycles are found, an error message showing your demangled
symbols describing the strongly connected component are printed and then
the SIGTRAP is raised, which means you'll also get a backtrace if you're
using ShowCrashReports() too. This new error checker is so low-level and
so pure that it's able to verify the relationships of every libc runtime
lock, including those locks upon which the mutex implementation depends.
2024-12-16 22:25:12 -08:00
Justine Tunney
9ddbfd921e
Introduce cosmo_futex_wait and cosmo_futex_wake
Cosmopolitan Futexes are now exposed as a public API.
2024-11-22 11:25:15 -08:00
Justine Tunney
dcf9596620
Make more fixups and quality assurance 2024-10-07 15:29:53 -07:00
Justine Tunney
12cc2de22e
Make contended mutexes 30% faster on aarch64
On Raspberry Pi 5, benchmark_mu_contended takes 359µs in *NSYNC upstream
and in Cosmopolitan it takes 272µs.
2024-09-26 09:24:25 -07:00
Justine Tunney
dd8c4dbd7d
Write more tests for signal handling
There's now a much stronger level of assurance that signaling on Windows
will be atomic, low-latency, low tail latency, and shall never deadlock.
2024-09-21 05:24:56 -07:00
Justine Tunney
87a6669900
Make more Windows socket fixes and improvements
This change makes send() / sendto() always block on Windows. It's needed
because poll(POLLOUT) doesn't guarantee a socket is immediately writable
on Windows, and it caused rsync to fail because it made that assumption.
The only exception is when a SO_SNDTIMEO is specified which will EAGAIN.

Tests are added confirming MSG_WAITALL and MSG_NOSIGNAL work as expected
on all our supported OSes. Most of the platform-specific MSG_FOO magnums
have been deleted, with the exception of MSG_FASTOPEN. Your --strace log
will now show MSG_FOO flags as symbols rather than numbers.

I've also removed cv_wait_example_test because it's 0.3% flaky with Qemu
under system load since it depends on a process being readily scheduled.
2024-09-18 20:29:42 -07:00
Justine Tunney
fbdf9d028c
Rewrite Windows poll()
We can now await signals, files, pipes, and console simultaneously. This
change also gives a deeper review and testing to changes made yesterday.
2024-09-10 20:04:02 -07:00
Justine Tunney
cceddd21b2
Reduce latency of poll() on Windows
When polling sockets poll() can now let you know about an event in about
10µs rather than 10ms. If you're not polling sockets then poll() reports
console events now in microseconds instead of milliseconds.
2024-09-10 04:12:21 -07:00
Justine Tunney
2f48a02b44
Make recursive mutexes faster
Recursive mutexes now go as fast as normal mutexes. The tradeoff is they
are no longer safe to use in signal handlers. However you can still have
signal safe mutexes if you set your mutex to both recursive and pshared.
You can also make functions that use recursive mutexes signal safe using
sigprocmask to ensure recursion doesn't happen due to any signal handler

The impact of this change is that, on Windows, many functions which edit
the file descriptor table rely on recursive mutexes, e.g. open(). If you
develop your app so it uses pread() and pwrite() then your app should go
very fast when performing a heavily multithreaded and contended workload

For example, when scaling to 40+ cores, *NSYNC mutexes can go as much as
1000x faster (in CPU time) than the naive recursive lock implementation.
Now recursive will use *NSYNC under the hood when it's possible to do so
2024-09-10 00:08:59 -07:00
Justine Tunney
95fee8614d
Test recursive mutex code more 2024-09-09 00:19:23 -07:00
Justine Tunney
dd8544c3bd
Delve into clock rabbit hole
The worst issue I had with consts.sh for clock_gettime is how it defined
too many clocks. So I looked into these clocks all day to figure out how
how they overlap in functionality. I discovered counter-intuitive things
such as how CLOCK_MONOTONIC should be CLOCK_UPTIME on MacOS and BSD, and
that CLOCK_BOOTTIME should be CLOCK_MONOTONIC on MacOS / BSD. Windows 10
also has some incredible new APIs, that let us simplify clock_gettime().

  - Linux CLOCK_REALTIME         -> GetSystemTimePreciseAsFileTime()
  - Linux CLOCK_MONOTONIC        -> QueryUnbiasedInterruptTimePrecise()
  - Linux CLOCK_MONOTONIC_RAW    -> QueryUnbiasedInterruptTimePrecise()
  - Linux CLOCK_REALTIME_COARSE  -> GetSystemTimeAsFileTime()
  - Linux CLOCK_MONOTONIC_COARSE -> QueryUnbiasedInterruptTime()
  - Linux CLOCK_BOOTTIME         -> QueryInterruptTimePrecise()

Documentation on the clock crew has been added to clock_gettime() in the
docstring and in redbean's documentation too. You can read that to learn
interesting facts about eight essential clocks that survived this purge.
This is original research you will not find on Google, OpenAI, or Claude

I've tested this change by porting *NSYNC to become fully clock agnostic
since it has extensive tests for spotting irregularities in time. I have
also included these tests in the default build so they no longer need to
be run manually. Both CLOCK_REALTIME and CLOCK_MONOTONIC are good across
the entire amd64 and arm64 test fleets.
2024-09-04 01:32:46 -07:00
Justine Tunney
3c61a541bd
Introduce pthread_condattr_setclock()
This is one of the few POSIX APIs that was missing. It lets you choose a
monotonic clock for your condition variables. This might improve perf on
some platforms. It might also grant more flexibility with NTP configs. I
know Qt is one project that believes it needs this. To introduce this, I
needed to change some the *NSYNC APIs, to support passing a clock param.
There's also new benchmarks, demonstrating Cosmopolitan's supremacy over
many libc implementations when it comes to mutex performance. Cygwin has
an alarmingly bad pthread_mutex_t implementation. It is so bad that they
would have been significantly better off if they'd used naive spinlocks.
2024-09-02 23:45:42 -07:00
Justine Tunney
90460ceb3c
Make Cosmo mutexes competitive with Apple Libc
While we have always licked glibc and musl libc on gnu/systemd sadly the
Apple Libc implementation of pthread_mutex_t is better than ours. It may
be due to how the XNU kernel and M2 microprocessor are in league when it
comes to scheduling processes and the NSYNC behavior is being penalized.
We can solve this by leaning more heavily on ulock using Drepper's algo.
It's kind of ironic that Linux's official mutexes work terribly on Linux
but almost as good as Apple Libc if used on MacOS.
2024-09-02 19:03:11 -07:00
Justine Tunney
884d89235f
Harden against aba problem 2024-08-26 20:01:55 -07:00
Justine Tunney
111ec9a989
Fix bug we added to *NSYNC a while ago
This is believed to fix a crash, that's possible in nsync_waiter_free_()
when you call pthread_cond_timedwait(), or nsync_cv_wait_with_deadline()
where an assertion can fail. Thanks ipv4.games for helping me find this!
2024-08-26 12:25:50 -07:00
Justine Tunney
31194165d2
Remove .internal from more header filenames 2024-08-04 12:52:25 -07:00
Justine Tunney
8d8aecb6d9
Avoid legacy instruction penalties on x86 2024-07-31 01:02:38 -07:00
Justine Tunney
642e9cb91a
Introduce cosmocc flags -mdbg -mtiny -moptlinux
The cosmocc.zip toolchain will now include four builds of the libcosmo.a
runtime libraries. You can pass the -mdbg flag if you want to debug your
cosmopolitan runtime. You can pass the -moptlinux flag if you don't want
windows code lurking in your binary. See tool/cosmocc/README.md for more
details on how these flags may be used and their important implications.
2024-07-26 05:10:25 -07:00
Justine Tunney
d3a13e8d70
Improve lock hierarchy
- NetBSD no longer needs a spin lock to create semaphores
- Windows fork() now locks process manager in correct order
2024-07-24 16:05:48 -07:00
Justine Tunney
5dd7ddb9ea
Remove bad defines from early days of project
These definitions were causing issues with building LLVM. It is possible
they also caused crashes we've seen with our MacOS ARM64 OpenMP support.
2024-07-24 12:11:21 -07:00
Justine Tunney
e398f3887c
Make more improvements to threads and mappings
- NetBSD should now have faster synchronization
- POSIX barriers may now be shared across processes
- An edge case with memory map tracking has been fixed
- Grand Central Dispatch is no longer used on MacOS ARM64
- POSIX mutexes in normal mode now use futexes across processes
2024-07-24 01:19:54 -07:00
Justine Tunney
86d884cce2
Get rid of .internal.h convention in LIBC_INTRIN 2024-07-19 19:38:00 -07:00
Justine Tunney
f590e96abd
Work around QEMU bugs 2024-07-07 15:42:46 -07:00
Justine Tunney
8c645fa1ee
Make mmap() scalable
It's now possible to create thousands of thousands of sparse independent
memory mappings, without any slowdown. The memory manager is better with
tracking memory protection now, particularly on Windows in a precise way
that can be restored during fork(). You now have the highest quality mem
manager possible. It's even better than some OSes like XNU, where mmap()
is implemented as an O(n) operation which means sadly things aren't much
improved over there. With this change the llamafile HTTP server endpoint
at /tokenize with a prompt of 50 tokens is now able to handle 2.6m r/sec
2024-07-05 23:26:00 -07:00
Justine Tunney
15ea0524b3
Reduce code size of mandatory runtime
This change reduces o/tiny/examples/life from 44kb to 24kb in size since
it avoids linking mmap() when unnecessary. This is important, to helping
cosmo not completely lose touch with its roots.
2024-07-04 02:50:20 -07:00
Justine Tunney
d461c6f47d
Do more quality assurance work 2024-06-24 06:53:49 -07:00
Justine Tunney
d1d4388201
Delete ASAN
It hasn't been helpful enough to be justify the maintenance burden. What
actually does help is mprotect(), kprintf(), --ftrace and --strace which
can always be counted upon to work correctly. We aren't losing much with
this change. Support for ASAN on AARCH64 was never implemented. Applying
ASAN to the core libc runtimes was disabled many months ago. If there is
some way to have an ASAN runtime for user programs that is less invasive
we can potentially consider reintroducing support. But now is premature.
2024-06-22 05:45:49 -07:00
Justine Tunney
6ffed14b9c
Rewrite memory manager
Actually Portable Executable now supports Android. Cosmo's old mmap code
required a 47 bit address space. The new implementation is very agnostic
and supports both smaller address spaces (e.g. embedded) and even modern
56-bit PML5T paging for x86 which finally came true on Zen4 Threadripper

Cosmopolitan no longer requires UNIX systems to observe the Windows 64kb
granularity; i.e. sysconf(_SC_PAGE_SIZE) will now report the host native
page size. This fixes a longstanding POSIX conformance issue, concerning
file mappings that overlap the end of file. Other aspects of conformance
have been improved too, such as the subtleties of address assignment and
and the various subtleties surrounding MAP_FIXED and MAP_FIXED_NOREPLACE

On Windows, mappings larger than 100 megabytes won't be broken down into
thousands of independent 64kb mappings. Support for MAP_STACK is removed
by this change; please use NewCosmoStack() instead.

Stack overflow avoidance is now being implemented using the POSIX thread
APIs. Please use GetStackBottom() and GetStackAddr(), instead of the old
error-prone GetStackAddr() and HaveStackMemory() APIs which are removed.
2024-06-22 05:45:11 -07:00
Justine Tunney
a6baba1b07
Stop using .com extension in monorepo
The WIN32 CreateProcess() function does not require an .exe or .com
suffix in order to spawn an executable. Now that we have Cosmo bash
we're no longer so dependent on the cmd.exe prompt.
2024-03-03 03:12:19 -08:00
Justine Tunney
957c61cbbf
Release Cosmopolitan v3.3
This change upgrades to GCC 12.3 and GNU binutils 2.42. The GNU linker
appears to have changed things so that only a single de-duplicated str
table is present in the binary, and it gets placed wherever the linker
wants, regardless of what the linker script says. To cope with that we
need to stop using .ident to embed licenses. As such, this change does
significant work to revamp how third party licenses are defined in the
codebase, using `.section .notice,"aR",@progbits`.

This new GCC 12.3 toolchain has support for GNU indirect functions. It
lets us support __target_clones__ for the first time. This is used for
optimizing the performance of libc string functions such as strlen and
friends so far on x86, by ensuring AVX systems favor a second codepath
that uses VEX encoding. It shaves some latency off certain operations.
It's a useful feature to have for scientific computing for the reasons
explained by the test/libcxx/openmp_test.cc example which compiles for
fifteen different microarchitectures. Thanks to the upgrades, it's now
also possible to use newer instruction sets, such as AVX512FP16, VNNI.

Cosmo now uses the %gs register on x86 by default for TLS. Doing it is
helpful for any program that links `cosmo_dlopen()`. Such programs had
to recompile their binaries at startup to change the TLS instructions.
That's not great, since it means every page in the executable needs to
be faulted. The work of rewriting TLS-related x86 opcodes, is moved to
fixupobj.com instead. This is great news for MacOS x86 users, since we
previously needed to morph the binary every time for that platform but
now that's no longer necessary. The only platforms where we need fixup
of TLS x86 opcodes at runtime are now Windows, OpenBSD, and NetBSD. On
Windows we morph TLS to point deeper into the TIB, based on a TlsAlloc
assignment, and on OpenBSD/NetBSD we morph %gs back into %fs since the
kernels do not allow us to specify a value for the %gs register.

OpenBSD users are now required to use APE Loader to run Cosmo binaries
and assimilation is no longer possible. OpenBSD kernel needs to change
to allow programs to specify a value for the %gs register, or it needs
to stop marking executable pages loaded by the kernel as mimmutable().

This release fixes __constructor__, .ctor, .init_array, and lastly the
.preinit_array so they behave the exact same way as glibc.

We no longer use hex constants to define math.h symbols like M_PI.
2024-02-20 13:27:59 -08:00
Justine Tunney
2ab9e9f7fd
Make improvements
- Introduce portable sched_getcpu() api
- Support GCC's __target_clones__ feature
- Make fma() go faster on x86 in default mode
- Remove some asan checks from core libraries
- WinMain() now ensures $HOME and $USER are defined
2024-02-12 10:23:00 -08:00
Justine Tunney
c1e18e7903
Restore MODE=dbg support
We recently broke MODE=dbg support when we added C++ exception support.
This change adds the missing UBSAN interfaces, needed to get it working
again. Some of the ASAN checking in the SJLJ guts needed to be disabled
since I doubt anyone's combined the two features until now.
2024-01-26 23:07:18 -08:00
Jōshin
3a8e01a77a
more modeline errata (#1019)
Somehow or another, I previously had missed `BUILD.mk` files.

In the process I found a few straggler cases where the modeline was
different from the file, including one very involved manual fix where a
file had been treated like it was ts=2 and ts=8 on separate occasions.

The commit history in the PR shows the gory details; the BUILD.mk was
automated, everything else was mostly manual.
2023-12-16 23:07:10 -05:00
Jōshin
2fc507c98f
Fix more vi modelines (#1006)
* modelines: tw -> sw

shiftwidth, not textwidth.

* space-surround modelines

* fix irregular modelines

* Fix modeline in titlegen.c
2023-12-13 02:28:11 -05:00
Jōshin
e16a7d8f3b
flip et / noet in modelines
`et` means `expandtab`.

```sh
rg 'vi: .* :vi' -l -0 | \
  xargs -0 sed -i '' 's/vi: \(.*\) et\(.*\)  :vi/vi: \1 xoet\2:vi/'
rg 'vi: .*  :vi' -l -0 | \
  xargs -0 sed -i '' 's/vi: \(.*\)noet\(.*\):vi/vi: \1et\2  :vi/'
rg 'vi: .*  :vi' -l -0 | \
  xargs -0 sed -i '' 's/vi: \(.*\)xoet\(.*\):vi/vi: \1noet\2:vi/'
```
2023-12-07 22:17:11 -05:00
Jōshin
394d998315
Fix vi modelines (#989)
At least in neovim, `│vi:` is not recognized as a modeline because it
has no preceding whitespace. After fixing this, opening a file yields
an error because `net` is not an option. (`noet`, however, is.)
2023-12-05 14:37:54 -08:00
Justine Tunney
fa20edc44d
Reduce header complexity
- Remove most __ASSEMBLER__ __LINKER__ ifdefs
- Rename libc/intrin/bits.h to libc/serialize.h
- Block pthread cancelation in fchmodat() polyfill
- Remove `clang-format off` statements in third_party
2023-11-28 14:39:42 -08:00
Justine Tunney
96f979dfc5
Rename makefiles BUILD.mk
This way they appear at the top of directory listings.
2023-11-28 11:21:08 -08:00
Justine Tunney
751d20d98d
Fix nsync_mu_unlock_slow_() on Apple Silicon
We torture test dlmalloc() in test/libc/stdio/memory_test.c. That test
was crashing on occasion on Apple M1 microprocessors when dlmalloc was
using *NSYNC locks. It was relatively easy to spot the cause, which is
this one particular compare and swap operation, which needed to change
to use sequentially-consistent ordering rather than an acquire barrier
2023-11-13 11:07:13 -08:00
Justine Tunney
d0ad2694ed
Iterate more on recent changes 2023-11-11 00:28:22 -08:00
Justine Tunney
241f949540
Use dynamic memory for *NSYNC waiters 2023-11-10 01:42:06 -08:00
Justine Tunney
d7917ea076
Make win32 i/o signals atomic and longjmp() safe 2023-11-04 20:33:29 -07:00
Justine Tunney
d259e573b6
Remove *NSYNC WIN32 semaphores
This implementation doesn't work as well as WIN32 futexes. This code
path was only added back when we were having issues with set context
however that's been solved so we can go back to the much better code
2023-11-01 07:18:58 -07:00
Justine Tunney
c9fecf3a55
Make improvements
- You can now run `make -j8 toolchain` on Windows
- You can now run `make -j` on MacOS ARM64 and BSD OSes
- You can now use our Emacs dev environment on MacOS/Windows
- Fix bug where the x16 register was being corrupted by --ftrace
- The programs under build/bootstrap/ are updated as fat binaries
- The Makefile now explains how to download cosmocc-0.0.12 toolchain
- The build scripts under bin/ now support "cosmo" branded toolchains
- stat() now goes faster on Windows (shaves 100ms off `make` latency)
- Code cleanup and added review on the Windows signal checking code
- posix_spawnattr_setrlimit() now works around MacOS ARM64 bugs
- Landlock Make now favors posix_spawn() on non-Linux/OpenBSD
- posix_spawn() now has better --strace logging on Windows
- fstatat() can now avoid EACCES in more cases on Windows
- fchmod() can now change the readonly bit on Windows
2023-10-15 16:45:00 -07:00
Justine Tunney
2db2f40a98
Rewrite special file handling on Windows
This change gets GNU grep working. What caused it to not work, is it
wouldn't write to an output file descriptor when its dev/ino equaled
/dev/null's. So now we invent special dev/ino values for these files
2023-10-14 02:53:34 -07:00
Justine Tunney
49b0eaa69f
Improve threading and i/o routines
- On Windows connect() can now be interrupted by a signal; connect() w/
  O_NONBLOCK will now raise EINPROGRESS; and connect() with SO_SNDTIMEO
  will raise ETIMEDOUT after the interval has elapsed.

- We now get the AcceptEx(), ConnectEx(), and TransmitFile() functions
  from the WIN32 API the officially blessed way, using WSAIoctl().

- Do nothing on Windows when fsync() is called on a directory handle.
  This was raising EACCES earlier becaues GENERIC_WRITE is required on
  the handle. It's possible to FlushFileBuffers() a directory handle if
  it's opened with write access but MSDN doesn't document what it does.
  If you have any idea, please let us know!

- Prefer manual reset event objects for read() and write() on Windows.

- Do some code cleanup on our dlmalloc customizations.

- Fix errno type error in Windows blocking routines.

- Make the futex polyfill simpler and faster.
2023-10-12 23:13:04 -07:00
Justine Tunney
3b4dbc9fdd
Make some more fixes
This change deletes mkfifo() so that GNU Make on Windows will work in
parallel mode using its pipe-based implementation. There's an example
called greenbean2 now, which shows how to build a scalable web server
for Windows with 10k+ threads. The accuracy of clock_nanosleep is now
significantly improved on Linux.
2023-10-09 12:22:00 -07:00
Justine Tunney
791f79fcb3
Make improvements
- We now serialize the file descriptor table when spawning / executing
  processes on Windows. This means you can now inherit more stuff than
  just standard i/o. It's needed by bash, which duplicates the console
  to file descriptor #255. We also now do a better job serializing the
  environment variables, so you're less likely to encounter E2BIG when
  using your bash shell. We also no longer coerce environ to uppercase

- execve() on Windows now remotely controls its parent process to make
  them spawn a replacement for itself. Then it'll be able to terminate
  immediately once the spawn succeeds, without having to linger around
  for the lifetime as a shell process for proxying the exit code. When
  process worker thread running in the parent sees the child die, it's
  given a handle to the new child, to replace it in the process table.

- execve() and posix_spawn() on Windows will now provide CreateProcess
  an explicit handle list. This allows us to remove handle locks which
  enables better fork/spawn concurrency, with seriously correct thread
  safety. Other codebases like Go use the same technique. On the other
  hand fork() still favors the conventional WIN32 inheritence approach
  which can be a little bit messy, but is *controlled* by guaranteeing
  perfectly clean slates at both the spawning and execution boundaries

- sigset_t is now 64 bits. Having it be 128 bits was a mistake because
  there's no reason to use that and it's only supported by FreeBSD. By
  using the system word size, signal mask manipulation on Windows goes
  very fast. Furthermore @asyncsignalsafe funcs have been rewritten on
  Windows to take advantage of signal masking, now that it's much more
  pleasant to use.

- All the overlapped i/o code on Windows has been rewritten for pretty
  good signal and cancelation safety. We're now able to ensure overlap
  data structures are cleaned up so long as you don't longjmp() out of
  out of a signal handler that interrupted an i/o operation. Latencies
  are also improved thanks to the removal of lots of "busy wait" code.
  Waits should be optimal for everything except poll(), which shall be
  the last and final demon we slay in the win32 i/o horror show.

- getrusage() on Windows is now able to report RUSAGE_CHILDREN as well
  as RUSAGE_SELF, thanks to aggregation in the process manager thread.
2023-10-08 08:59:53 -07:00
Justine Tunney
85f64f3851
Make futexes 100x better on x86 MacOS
Thanks to @autumnjolitz (in #876) the Cosmopolitan codebase is now
acquainted with Apple's outstanding ulock system calls which offer
something much closer to futexes than Grand Central Dispatch which
wasn't quite as good, since its wait function can't be interrupted
by signals (therefore necessitating a busy loop) and it also needs
semaphore objects to be created and freed. Even though ulock is an
internal Apple API, strictly speaking, the benefits of futexes are
so great that it's worth the risk for now especially since we have
the GCD implementation still as a quick escape hatch if it changes

Here's why this change is important for x86 XNU users. Cosmo has a
suboptimal polyfill when the operating system doesn't offer an API
that let's us implement futexes properly. Sadly we had to use that
on X86 XNU until now. The polyfill works using clock_nanosleep, to
poll the futex in a busy loop with exponential backoff. On XNU x86
clock_nanosleep suffers from us not being able to use a fast clock
gettime implementation, which had a compounding effect that's made
the polyfill function even more poorly. On X86 XNU we also need to
polyfill sched_yield() using select(), which made things even more
troublesome. Now that we have futexes we don't have any busy loops
anymore for both condition variables and thread joining so optimal
performance is attained. To demonstrate, consider these benchmarks

Before:

    $ ./lockscale_test.com -b
    consumed 38.8377   seconds real time and
              0.087131 seconds cpu time

After:

    $ ./lockscale_test.com -b
    consumed 0.007955 seconds real time and
             0.011515 seconds cpu time

Fixes #876
2023-10-03 15:15:43 -07:00
Justine Tunney
ee8a861635
Fix semaphore deadlock on Apple Silicon 2023-10-03 09:26:46 -07:00