cosmopolitan

mirror of https://github.com/jart/cosmopolitan.git synced 2025-01-31 11:37:35 +00:00

Author	SHA1	Message	Date
Justine Tunney	98c5847727	Fix fork waiter leak in nsync This change fixes a bug where nsync waiter objects would leak. It'd mean that long-running programs like runitd would run out of file descriptors on NetBSD where waiter objects have ksem file descriptors. On other OSes this bug is mostly harmless since the worst that can happen with a futex is to leak a little bit of ram. The bug was caused because tib_nsync was sneaking back in after the finalization code had cleared it. This change refactors the thread exiting code to handle nsync teardown appropriately and in making this change I found another issue, which is that user code which is buggy, and tries to exit without joining joinable threads which haven't been detached, would result in a deadlock. That doesn't sound so bad, except the main thread is a joinable thread. So this deadlock would be triggered in ways that put libc at fault. So we now auto-join threads and libc will log a warning to --strace when that happens for any thread	2024-12-31 01:30:13 -08:00
Justine Tunney	dd8544c3bd	Delve into clock rabbit hole The worst issue I had with consts.sh for clock_gettime is how it defined too many clocks. So I looked into these clocks all day to figure out how how they overlap in functionality. I discovered counter-intuitive things such as how CLOCK_MONOTONIC should be CLOCK_UPTIME on MacOS and BSD, and that CLOCK_BOOTTIME should be CLOCK_MONOTONIC on MacOS / BSD. Windows 10 also has some incredible new APIs, that let us simplify clock_gettime(). - Linux CLOCK_REALTIME -> GetSystemTimePreciseAsFileTime() - Linux CLOCK_MONOTONIC -> QueryUnbiasedInterruptTimePrecise() - Linux CLOCK_MONOTONIC_RAW -> QueryUnbiasedInterruptTimePrecise() - Linux CLOCK_REALTIME_COARSE -> GetSystemTimeAsFileTime() - Linux CLOCK_MONOTONIC_COARSE -> QueryUnbiasedInterruptTime() - Linux CLOCK_BOOTTIME -> QueryInterruptTimePrecise() Documentation on the clock crew has been added to clock_gettime() in the docstring and in redbean's documentation too. You can read that to learn interesting facts about eight essential clocks that survived this purge. This is original research you will not find on Google, OpenAI, or Claude I've tested this change by porting *NSYNC to become fully clock agnostic since it has extensive tests for spotting irregularities in time. I have also included these tests in the default build so they no longer need to be run manually. Both CLOCK_REALTIME and CLOCK_MONOTONIC are good across the entire amd64 and arm64 test fleets.	2024-09-04 01:32:46 -07:00
Justine Tunney	3c61a541bd	Introduce pthread_condattr_setclock() This is one of the few POSIX APIs that was missing. It lets you choose a monotonic clock for your condition variables. This might improve perf on some platforms. It might also grant more flexibility with NTP configs. I know Qt is one project that believes it needs this. To introduce this, I needed to change some the *NSYNC APIs, to support passing a clock param. There's also new benchmarks, demonstrating Cosmopolitan's supremacy over many libc implementations when it comes to mutex performance. Cygwin has an alarmingly bad pthread_mutex_t implementation. It is so bad that they would have been significantly better off if they'd used naive spinlocks.	2024-09-02 23:45:42 -07:00
Justine Tunney	e398f3887c	Make more improvements to threads and mappings - NetBSD should now have faster synchronization - POSIX barriers may now be shared across processes - An edge case with memory map tracking has been fixed - Grand Central Dispatch is no longer used on MacOS ARM64 - POSIX mutexes in normal mode now use futexes across processes	2024-07-24 01:19:54 -07:00
Justine Tunney	8c645fa1ee	Make mmap() scalable It's now possible to create thousands of thousands of sparse independent memory mappings, without any slowdown. The memory manager is better with tracking memory protection now, particularly on Windows in a precise way that can be restored during fork(). You now have the highest quality mem manager possible. It's even better than some OSes like XNU, where mmap() is implemented as an O(n) operation which means sadly things aren't much improved over there. With this change the llamafile HTTP server endpoint at /tokenize with a prompt of 50 tokens is now able to handle 2.6m r/sec	2024-07-05 23:26:00 -07:00
Justine Tunney	957c61cbbf	Release Cosmopolitan v3.3 This change upgrades to GCC 12.3 and GNU binutils 2.42. The GNU linker appears to have changed things so that only a single de-duplicated str table is present in the binary, and it gets placed wherever the linker wants, regardless of what the linker script says. To cope with that we need to stop using .ident to embed licenses. As such, this change does significant work to revamp how third party licenses are defined in the codebase, using `.section .notice,"aR",@progbits`. This new GCC 12.3 toolchain has support for GNU indirect functions. It lets us support __target_clones__ for the first time. This is used for optimizing the performance of libc string functions such as strlen and friends so far on x86, by ensuring AVX systems favor a second codepath that uses VEX encoding. It shaves some latency off certain operations. It's a useful feature to have for scientific computing for the reasons explained by the test/libcxx/openmp_test.cc example which compiles for fifteen different microarchitectures. Thanks to the upgrades, it's now also possible to use newer instruction sets, such as AVX512FP16, VNNI. Cosmo now uses the %gs register on x86 by default for TLS. Doing it is helpful for any program that links `cosmo_dlopen()`. Such programs had to recompile their binaries at startup to change the TLS instructions. That's not great, since it means every page in the executable needs to be faulted. The work of rewriting TLS-related x86 opcodes, is moved to fixupobj.com instead. This is great news for MacOS x86 users, since we previously needed to morph the binary every time for that platform but now that's no longer necessary. The only platforms where we need fixup of TLS x86 opcodes at runtime are now Windows, OpenBSD, and NetBSD. On Windows we morph TLS to point deeper into the TIB, based on a TlsAlloc assignment, and on OpenBSD/NetBSD we morph %gs back into %fs since the kernels do not allow us to specify a value for the %gs register. OpenBSD users are now required to use APE Loader to run Cosmo binaries and assimilation is no longer possible. OpenBSD kernel needs to change to allow programs to specify a value for the %gs register, or it needs to stop marking executable pages loaded by the kernel as mimmutable(). This release fixes __constructor__, .ctor, .init_array, and lastly the .preinit_array so they behave the exact same way as glibc. We no longer use hex constants to define math.h symbols like M_PI.	2024-02-20 13:27:59 -08:00
Jōshin	2fc507c98f	Fix more vi modelines (#1006 ) * modelines: tw -> sw shiftwidth, not textwidth. * space-surround modelines * fix irregular modelines * Fix modeline in titlegen.c	2023-12-13 02:28:11 -05:00
Jōshin	e16a7d8f3b	flip et / noet in modelines `et` means `expandtab`. ```sh rg 'vi: .* :vi' -l -0 \| \ xargs -0 sed -i '' 's/vi: $.$ et$.$ :vi/vi: \1 xoet\2:vi/' rg 'vi: .* :vi' -l -0 \| \ xargs -0 sed -i '' 's/vi: $.$noet$.$:vi/vi: \1et\2 :vi/' rg 'vi: .* :vi' -l -0 \| \ xargs -0 sed -i '' 's/vi: $.$xoet$.$:vi/vi: \1noet\2:vi/' ```	2023-12-07 22:17:11 -05:00
Jōshin	394d998315	Fix vi modelines (#989 ) At least in neovim, `│vi:` is not recognized as a modeline because it has no preceding whitespace. After fixing this, opening a file yields an error because `net` is not an option. (`noet`, however, is.)	2023-12-05 14:37:54 -08:00
Justine Tunney	fa20edc44d	Reduce header complexity - Remove most __ASSEMBLER__ __LINKER__ ifdefs - Rename libc/intrin/bits.h to libc/serialize.h - Block pthread cancelation in fchmodat() polyfill - Remove `clang-format off` statements in third_party	2023-11-28 14:39:42 -08:00
Justine Tunney	d0ad2694ed	Iterate more on recent changes	2023-11-11 00:28:22 -08:00
Justine Tunney	d259e573b6	Remove *NSYNC WIN32 semaphores This implementation doesn't work as well as WIN32 futexes. This code path was only added back when we were having issues with set context however that's been solved so we can go back to the much better code	2023-11-01 07:18:58 -07:00
Justine Tunney	49b0eaa69f	Improve threading and i/o routines - On Windows connect() can now be interrupted by a signal; connect() w/ O_NONBLOCK will now raise EINPROGRESS; and connect() with SO_SNDTIMEO will raise ETIMEDOUT after the interval has elapsed. - We now get the AcceptEx(), ConnectEx(), and TransmitFile() functions from the WIN32 API the officially blessed way, using WSAIoctl(). - Do nothing on Windows when fsync() is called on a directory handle. This was raising EACCES earlier becaues GENERIC_WRITE is required on the handle. It's possible to FlushFileBuffers() a directory handle if it's opened with write access but MSDN doesn't document what it does. If you have any idea, please let us know! - Prefer manual reset event objects for read() and write() on Windows. - Do some code cleanup on our dlmalloc customizations. - Fix errno type error in Windows blocking routines. - Make the futex polyfill simpler and faster.	2023-10-12 23:13:04 -07:00
Justine Tunney	3b4dbc9fdd	Make some more fixes This change deletes mkfifo() so that GNU Make on Windows will work in parallel mode using its pipe-based implementation. There's an example called greenbean2 now, which shows how to build a scalable web server for Windows with 10k+ threads. The accuracy of clock_nanosleep is now significantly improved on Linux.	2023-10-09 12:22:00 -07:00
Justine Tunney	791f79fcb3	Make improvements - We now serialize the file descriptor table when spawning / executing processes on Windows. This means you can now inherit more stuff than just standard i/o. It's needed by bash, which duplicates the console to file descriptor #255. We also now do a better job serializing the environment variables, so you're less likely to encounter E2BIG when using your bash shell. We also no longer coerce environ to uppercase - execve() on Windows now remotely controls its parent process to make them spawn a replacement for itself. Then it'll be able to terminate immediately once the spawn succeeds, without having to linger around for the lifetime as a shell process for proxying the exit code. When process worker thread running in the parent sees the child die, it's given a handle to the new child, to replace it in the process table. - execve() and posix_spawn() on Windows will now provide CreateProcess an explicit handle list. This allows us to remove handle locks which enables better fork/spawn concurrency, with seriously correct thread safety. Other codebases like Go use the same technique. On the other hand fork() still favors the conventional WIN32 inheritence approach which can be a little bit messy, but is controlled by guaranteeing perfectly clean slates at both the spawning and execution boundaries - sigset_t is now 64 bits. Having it be 128 bits was a mistake because there's no reason to use that and it's only supported by FreeBSD. By using the system word size, signal mask manipulation on Windows goes very fast. Furthermore @asyncsignalsafe funcs have been rewritten on Windows to take advantage of signal masking, now that it's much more pleasant to use. - All the overlapped i/o code on Windows has been rewritten for pretty good signal and cancelation safety. We're now able to ensure overlap data structures are cleaned up so long as you don't longjmp() out of out of a signal handler that interrupted an i/o operation. Latencies are also improved thanks to the removal of lots of "busy wait" code. Waits should be optimal for everything except poll(), which shall be the last and final demon we slay in the win32 i/o horror show. - getrusage() on Windows is now able to report RUSAGE_CHILDREN as well as RUSAGE_SELF, thanks to aggregation in the process manager thread.	2023-10-08 08:59:53 -07:00
Justine Tunney	85f64f3851	Make futexes 100x better on x86 MacOS Thanks to @autumnjolitz (in #876) the Cosmopolitan codebase is now acquainted with Apple's outstanding ulock system calls which offer something much closer to futexes than Grand Central Dispatch which wasn't quite as good, since its wait function can't be interrupted by signals (therefore necessitating a busy loop) and it also needs semaphore objects to be created and freed. Even though ulock is an internal Apple API, strictly speaking, the benefits of futexes are so great that it's worth the risk for now especially since we have the GCD implementation still as a quick escape hatch if it changes Here's why this change is important for x86 XNU users. Cosmo has a suboptimal polyfill when the operating system doesn't offer an API that let's us implement futexes properly. Sadly we had to use that on X86 XNU until now. The polyfill works using clock_nanosleep, to poll the futex in a busy loop with exponential backoff. On XNU x86 clock_nanosleep suffers from us not being able to use a fast clock gettime implementation, which had a compounding effect that's made the polyfill function even more poorly. On X86 XNU we also need to polyfill sched_yield() using select(), which made things even more troublesome. Now that we have futexes we don't have any busy loops anymore for both condition variables and thread joining so optimal performance is attained. To demonstrate, consider these benchmarks Before: $ ./lockscale_test.com -b consumed 38.8377 seconds real time and 0.087131 seconds cpu time After: $ ./lockscale_test.com -b consumed 0.007955 seconds real time and 0.011515 seconds cpu time Fixes #876	2023-10-03 15:15:43 -07:00
Justine Tunney	ec480f5aa0	Make improvements - Every unit test now passes on Apple Silicon. The final piece of this puzzle was porting our POSIX threads cancelation support, since that works differently on ARM64 XNU vs. AMD64. Our semaphore support on Apple Silicon is also superior now compared to AMD64, thanks to the grand central dispatch library which lets *NSYNC locks go faster. - The Cosmopolitan runtime is now more stable, particularly on Windows. To do this, thread local storage is mandatory at all runtime levels, and the innermost packages of the C library is no longer being built using ASAN. TLS is being bootstrapped with a 128-byte TIB during the process startup phase, and then later on the runtime re-allocates it either statically or dynamically to support code using _Thread_local. fork() and execve() now do a better job cooperating with threads. We can now check how much stack memory is left in the process or thread when functions like kprintf() / execve() etc. call alloca(), so that ENOMEM can be raised, reduce a buffer size, or just print a warning. - POSIX signal emulation is now implemented the same way kernels do it with pthread_kill() and raise(). Any thread can interrupt any other thread, regardless of what it's doing. If it's blocked on read/write then the killer thread will cancel its i/o operation so that EINTR can be returned in the mark thread immediately. If it's doing a tight CPU bound operation, then that's also interrupted by the signal delivery. Signal delivery works now by suspending a thread and pushing context data structures onto its stack, and redirecting its execution to a trampoline function, which calls SetThreadContext(GetCurrentThread()) when it's done. - We're now doing a better job managing locks and handles. On NetBSD we now close semaphore file descriptors in forked children. Semaphores on Windows can now be canceled immediately, which means mutexes/condition variables will now go faster. Apple Silicon semaphores can be canceled too. We're now using Apple's pthread_yield() funciton. Apple _nocancel syscalls are now used on XNU when appropriate to ensure pthread_cancel requests aren't lost. The MbedTLS library has been updated to support POSIX thread cancelations. See tool/build/runitd.c for an example of how it can be used for production multi-threaded tls servers. Handles on Windows now leak less often across processes. All i/o operations on Windows are now overlapped, which means file pointers can no longer be inherited across dup() and fork() for the time being. - We now spawn a thread on Windows to deliver SIGCHLD and wakeup wait4() which means, for example, that posix_spawn() now goes 3x faster. POSIX spawn is also now more correct. Like Musl, it's now able to report the failure code of execve() via a pipe although our approach favors using shared memory to do that on systems that have a true vfork() function. - We now spawn a thread to deliver SIGALRM to threads when setitimer() is used. This enables the most precise wakeups the OS makes possible. - The Cosmopolitan runtime now uses less memory. On NetBSD for example, it turned out the kernel would actually commit the PT_GNU_STACK size which caused RSS to be 6mb for every process. Now it's down to ~4kb. On Apple Silicon, we reduce the mandatory upstream thread size to the smallest possible size to reduce the memory overhead of Cosmo threads. The examples directory has a program called greenbean which can spawn a web server on Linux with 10,000 worker threads and have the memory usage of the process be ~77mb. The 1024 byte overhead of POSIX-style thread-local storage is now optional; it won't be allocated until the pthread_setspecific/getspecific functions are called. On Windows, the threads that get spawned which are internal to the libc implementation use reserve rather than commit memory, which shaves a few hundred kb. - sigaltstack() is now supported on Windows, however it's currently not able to be used to handle stack overflows, since crash signals are still generated by WIN32. However the crash handler will still switch to the alt stack, which is helpful in environments with tiny threads. - Test binaries are now smaller. Many of the mandatory dependencies of the test runner have been removed. This ensures many programs can do a better job only linking the the thing they're testing. This caused the test binaries for LIBC_FMT for example, to decrease from 200kb to 50kb - long double is no longer used in the implementation details of libc, except in the APIs that define it. The old code that used long double for time (instead of struct timespec) has now been thoroughly removed. - ShowCrashReports() is now much tinier in MODE=tiny. Instead of doing backtraces itself, it'll just print a command you can run on the shell using our new `cosmoaddr2line` program to view the backtrace. - Crash report signal handling now works in a much better way. Instead of terminating the process, it now relies on SA_RESETHAND so that the default SIG_IGN behavior can terminate the process if necessary. - Our pledge() functionality has now been fully ported to AARCH64 Linux.	2023-09-18 21:04:47 -07:00
Justine Tunney	a359de7893	Get rid of kmalloc() This changes *NSYNC to allocate waiters on the stack so our locks don't need to depend on dynamic memory. This make our runtiem simpler, and it also fixes bugs with thread cancellation support.	2023-09-11 21:56:00 -07:00
Justine Tunney	bcf9af94bf	Get threads working well on MacOS Arm64 - Now using 10x better GCD semaphores - We now generate Linux-like thread ids - We now use fast system clock / sleep libraries - The APE M1 loader now generates Linux-like stacks	2023-06-04 01:57:10 -07:00
Justine Tunney	c995838e5c	Make improvements - Clean up sigaction() code - Add a port scanner example - Introduce a ParseCidr() API - Clean up our futex abstraction code - Fix a harmless integer overflow in ParseIp() - Use kernel semaphores on NetBSD to make threads much faster	2022-11-07 02:26:06 -08:00
Justine Tunney	022536cab6	Make futexes cancellable by pthreads	2022-11-04 18:36:34 -07:00
Justine Tunney	8111462789	Add posix semaphores support There's still some bugs to work out on Windows and OpenBSD.	2022-10-14 09:21:02 -07:00
Justine Tunney	9b7c8db846	Perform some code maintenance - Change IDT code so kprintf() isn't mandatory dependency - Document current intentions around pthread_cancel() - Make _npassert() an _unassert() in MODE=tiny	2022-10-09 13:00:46 -07:00
Justine Tunney	672ccda37c	Clean up some sleep code	2022-10-08 03:00:48 -07:00
Justine Tunney	134ffee519	Change support vector to Windows 8+ Doing this makes binaries tinier, since we don't need to have all the extra code for supporting a 32-bit address space. It also benefits us because we're able to use WIN32 futexes, which makes locking simpler. `b69f3d2488` is what officially ended our Windows 7 support. This change is merely a formalization. You can use old versions of Cosmo now and forevermore if you need Windows 7 since our repository is hermetic and vendors all its dependencies. Won't fix #617	2022-09-15 03:55:05 -07:00
Justine Tunney	b5cb71ab84	Use *NSYNC for POSIX threads locking APIs Condition variables, barriers, and r/w locks now work very well.	2022-09-11 11:04:50 -07:00

26 commits