cosmopolitan

mirror of https://github.com/jart/cosmopolitan.git synced 2025-07-08 12:18:31 +00:00

Author	SHA1	Message	Date
Justine Tunney	2f48a02b44	Make recursive mutexes faster Recursive mutexes now go as fast as normal mutexes. The tradeoff is they are no longer safe to use in signal handlers. However you can still have signal safe mutexes if you set your mutex to both recursive and pshared. You can also make functions that use recursive mutexes signal safe using sigprocmask to ensure recursion doesn't happen due to any signal handler The impact of this change is that, on Windows, many functions which edit the file descriptor table rely on recursive mutexes, e.g. open(). If you develop your app so it uses pread() and pwrite() then your app should go very fast when performing a heavily multithreaded and contended workload For example, when scaling to 40+ cores, NSYNC mutexes can go as much as 1000x faster (in CPU time) than the naive recursive lock implementation. Now recursive will use NSYNC under the hood when it's possible to do so	2024-09-10 00:08:59 -07:00
Justine Tunney	610c951f71	Fix the build	2024-08-26 16:44:05 -07:00
Justine Tunney	863c704684	Add string similarity function	2024-08-17 16:45:07 -07:00
Justine Tunney	1671283f1a	Avoid clobbering errno	2024-08-15 23:54:14 -07:00
Justine Tunney	0a79c6961f	Make malloc scalable on all platforms It turns out sched_getcpu() didn't work on many platforms. So the system call now has tests and is well documented. We now employ new workarounds on platforms where it isn't supported in our malloc() implementation. It was previously the case that malloc() was only scalable on Linux/Windows for x86-64. Now the other platforms are scalable too.	2024-08-15 23:32:53 -07:00
Justine Tunney	31194165d2	Remove .internal from more header filenames	2024-08-04 12:52:25 -07:00
Justine Tunney	8d8aecb6d9	Avoid legacy instruction penalties on x86	2024-07-31 01:02:38 -07:00
Justine Tunney	cf1559c448	Remove __threaded variable	2024-07-28 23:43:30 -07:00
Justine Tunney	cdfcee51ca	Properly serialize fork() operations This change solves an issue where many threads attempting to spawn forks at once would cause fork() performance to degrade with the thread count. Things got real nasty on NetBSD, which slowed down the whole test fleet, because there's no vfork() and we're forced to use fork() in our server. threads count task 1 1062 fork+exit+wait 2 668 fork+exit+wait 4 66 fork+exit+wait 8 19 fork+exit+wait 16 22 fork+exit+wait 32 16 fork+exit+wait Things are now much less bad on NetBSD, but not great, since it does not have futexes; we rely on its semaphore file descriptors to do conditions threads count task 1 1085 fork+exit+wait 2 842 fork+exit+wait 4 532 fork+exit+wait 8 400 fork+exit+wait 16 276 fork+exit+wait 32 66 fork+exit+wait With OpenBSD which also lacks vfork(), things were just as bad as NetBSD threads count task 1 584 fork+exit+wait 2 687 fork+exit+wait 4 206 fork+exit+wait 8 24 fork+exit+wait 16 33 fork+exit+wait 32 26 fork+exit+wait But since OpenBSD has futexes fork() works terrifically thanks to *NSYNC threads count task 1 525 fork+exit+wait 2 580 fork+exit+wait 4 451 fork+exit+wait 8 479 fork+exit+wait 16 408 fork+exit+wait 32 373 fork+exit+wait This issue would most likely only manifest itself, when pthread_atfork() callers manage to slip a spin lock into the outermost position of fork's list of locks. Since fork() is very slow, a spin lock can be devastating Needless to say vfork() rules and anyone who says differently is kidding themselves. Look at what a FreeBSD 14.1 virtual machine with equal specs can do over the course of three hundred milliseconds. threads count task 1 2559 vfork+exit+wait 2 5389 vfork+exit+wait 4 34933 vfork+exit+wait 8 43273 vfork+exit+wait 16 49648 vfork+exit+wait 32 40247 vfork+exit+wait So it's a shame that so few OSes support vfork(). It creates an unsavory situation, where someone wanting to build a server that spawns processes would be better served to not use threads and favor a multiprocess model	2024-07-27 08:23:44 -07:00
Justine Tunney	efb3a34608	Fix package.sh build error	2024-07-26 06:57:24 -07:00
Justine Tunney	86d884cce2	Get rid of .internal.h convention in LIBC_INTRIN	2024-07-19 19:38:00 -07:00
Justine Tunney	1ff037df3c	Add some documentation	2024-07-19 04:46:26 -07:00
Justine Tunney	f7780de24b	Make realloc() go 100x faster on Linux/NetBSD Cosmopolitan now supports mremap(), which is only supported on Linux and NetBSD. First, it allows memory mappings to be relocated without copying them; this can dramatically speed up data structures like std::vector if the array size grows larger than 256kb. The mremap() system call is also 10x faster than munmap() when shrinking large memory mappings. There's now two functions, getpagesize() and getgransize() which help to write portable code that uses mmap(MAP_FIXED). Alternative sysconf() may be called with our new _SC_GRANSIZE. The madvise() system call now has a better wrapper with improved documentation.	2024-07-07 12:40:30 -07:00
Justine Tunney	01587de761	Simplify memory manager	2024-07-05 05:47:15 -07:00
Justine Tunney	c4c812c154	Introduce ctl::set and ctl::map We now have a C++ red-black tree implementation that implements standard template library compatible APIs while compiling 10x faster than libcxx. It's not as beautiful as the red-black tree implementation in Plinko but this will get the job done and the test proves it upholds all invariants This change also restores CheckForMemoryLeaks() support and fixes a real actual bug I discovered with Doug Lea's dlmalloc_inspect_all() function.	2024-06-23 22:27:11 -07:00
Justine Tunney	388e236360	Revert misguided dlmalloc optimization	2024-06-22 09:55:02 -07:00
Justine Tunney	d1d4388201	Delete ASAN It hasn't been helpful enough to be justify the maintenance burden. What actually does help is mprotect(), kprintf(), --ftrace and --strace which can always be counted upon to work correctly. We aren't losing much with this change. Support for ASAN on AARCH64 was never implemented. Applying ASAN to the core libc runtimes was disabled many months ago. If there is some way to have an ASAN runtime for user programs that is less invasive we can potentially consider reintroducing support. But now is premature.	2024-06-22 05:45:49 -07:00
Justine Tunney	6ffed14b9c	Rewrite memory manager Actually Portable Executable now supports Android. Cosmo's old mmap code required a 47 bit address space. The new implementation is very agnostic and supports both smaller address spaces (e.g. embedded) and even modern 56-bit PML5T paging for x86 which finally came true on Zen4 Threadripper Cosmopolitan no longer requires UNIX systems to observe the Windows 64kb granularity; i.e. sysconf(_SC_PAGE_SIZE) will now report the host native page size. This fixes a longstanding POSIX conformance issue, concerning file mappings that overlap the end of file. Other aspects of conformance have been improved too, such as the subtleties of address assignment and and the various subtleties surrounding MAP_FIXED and MAP_FIXED_NOREPLACE On Windows, mappings larger than 100 megabytes won't be broken down into thousands of independent 64kb mappings. Support for MAP_STACK is removed by this change; please use NewCosmoStack() instead. Stack overflow avoidance is now being implemented using the POSIX thread APIs. Please use GetStackBottom() and GetStackAddr(), instead of the old error-prone GetStackAddr() and HaveStackMemory() APIs which are removed.	2024-06-22 05:45:11 -07:00
Justine Tunney	cc2c1893c5	Fix some nits	2024-06-05 04:05:49 -07:00
Justine Tunney	3609f65de3	Make malloc() go 200x faster If pthread_create() is linked into the binary, then the cosmo runtime will create an independent dlmalloc arena for each core. Whenever the malloc() function is used it will index `g_heaps[sched_getcpu() / 2]` to find the arena with the greatest hyperthread / numa locality. This may be configured via an environment variable. For example if you say `export COSMOPOLITAN_HEAP_COUNT=1` then you can restore the old ways. Your process may be configured to have anywhere between 1 - 128 heaps We need this revision because it makes multithreaded C++ applications faster. For example, an HTTP server I'm working on that makes extreme use of the STL went from 16k to 2000k requests per second, after this change was made. To understand why, try out the malloc_test benchmark which calls malloc() + realloc() in a loop across many threads, which sees a a 250x improvement in process clock time and 200x on wall time The tradeoff is this adds ~25ns of latency to individual malloc calls compared to MODE=tiny, once the cosmo runtime has transitioned into a fully multi-threaded state. If you don't need malloc() to be scalable then cosmo provides many options for you. For starters the heap count variable above can be set to put the process back in single heap mode plus you can go even faster still, if you include tinymalloc.inc like many of the programs in tool/build/.. are already doing since that'll shave tens of kb off your binary footprint too. Theres also MODE=tiny which is configured to use just 1 plain old dlmalloc arena by default Another tradeoff is we need more memory now (except in MODE=tiny), to track the provenance of memory allocation. This is so allocations can be freely shared across threads, and because OSes can reschedule code to different CPUs at any time.	2024-06-05 02:02:14 -07:00
Justine Tunney	07cef612c3	Make dlmalloc 2.4x faster for multithreading This change adds a TLS freelist for small dynamic memory allocations. Cosmopolitan's TIB is now 512 bytes in size. Single-threaded malloc() performance isn't impacted by this, until pthread_create() is called. Single-threaded programs may also want to consider using: #include "libc/mem/tinymalloc.inc" Which will shave 30k off the executable size and sometimes go faster.	2024-05-28 11:18:34 -07:00
Justine Tunney	f029375d39	Introduce MAP_HUGETLB	2024-05-24 11:44:44 -07:00
Justine Tunney	8bfd56b59e	Rename _bsr/_bsf to bsr/bsf Now that these functions are behind _COSMO_SOURCE there's no reason for having the ugly underscore anymore. To use these functions, you need to pass -mcosmo to cosmocc.	2024-03-04 17:33:26 -08:00
Justine Tunney	957c61cbbf	Release Cosmopolitan v3.3 This change upgrades to GCC 12.3 and GNU binutils 2.42. The GNU linker appears to have changed things so that only a single de-duplicated str table is present in the binary, and it gets placed wherever the linker wants, regardless of what the linker script says. To cope with that we need to stop using .ident to embed licenses. As such, this change does significant work to revamp how third party licenses are defined in the codebase, using `.section .notice,"aR",@progbits`. This new GCC 12.3 toolchain has support for GNU indirect functions. It lets us support __target_clones__ for the first time. This is used for optimizing the performance of libc string functions such as strlen and friends so far on x86, by ensuring AVX systems favor a second codepath that uses VEX encoding. It shaves some latency off certain operations. It's a useful feature to have for scientific computing for the reasons explained by the test/libcxx/openmp_test.cc example which compiles for fifteen different microarchitectures. Thanks to the upgrades, it's now also possible to use newer instruction sets, such as AVX512FP16, VNNI. Cosmo now uses the %gs register on x86 by default for TLS. Doing it is helpful for any program that links `cosmo_dlopen()`. Such programs had to recompile their binaries at startup to change the TLS instructions. That's not great, since it means every page in the executable needs to be faulted. The work of rewriting TLS-related x86 opcodes, is moved to fixupobj.com instead. This is great news for MacOS x86 users, since we previously needed to morph the binary every time for that platform but now that's no longer necessary. The only platforms where we need fixup of TLS x86 opcodes at runtime are now Windows, OpenBSD, and NetBSD. On Windows we morph TLS to point deeper into the TIB, based on a TlsAlloc assignment, and on OpenBSD/NetBSD we morph %gs back into %fs since the kernels do not allow us to specify a value for the %gs register. OpenBSD users are now required to use APE Loader to run Cosmo binaries and assimilation is no longer possible. OpenBSD kernel needs to change to allow programs to specify a value for the %gs register, or it needs to stop marking executable pages loaded by the kernel as mimmutable(). This release fixes __constructor__, .ctor, .init_array, and lastly the .preinit_array so they behave the exact same way as glibc. We no longer use hex constants to define math.h symbols like M_PI.	2024-02-20 13:27:59 -08:00
Justine Tunney	1a28e35c62	Use good locks in dlmalloc Using mere spin locks causes runitd.com to go painstakingly slow on NetBSD for reasons that aren't clear yet.	2023-12-28 04:57:36 -08:00
Jōshin	3a8e01a77a	more modeline errata (#1019 ) Somehow or another, I previously had missed `BUILD.mk` files. In the process I found a few straggler cases where the modeline was different from the file, including one very involved manual fix where a file had been treated like it was ts=2 and ts=8 on separate occasions. The commit history in the PR shows the gory details; the BUILD.mk was automated, everything else was mostly manual.	2023-12-16 23:07:10 -05:00
Jōshin	2fc507c98f	Fix more vi modelines (#1006 ) * modelines: tw -> sw shiftwidth, not textwidth. * space-surround modelines * fix irregular modelines * Fix modeline in titlegen.c	2023-12-13 02:28:11 -05:00
Jōshin	e16a7d8f3b	flip et / noet in modelines `et` means `expandtab`. ```sh rg 'vi: .* :vi' -l -0 \| \ xargs -0 sed -i '' 's/vi: \(.\) et\(.\) :vi/vi: \1 xoet\2:vi/' rg 'vi: .* :vi' -l -0 \| \ xargs -0 sed -i '' 's/vi: \(.\)noet\(.\):vi/vi: \1et\2 :vi/' rg 'vi: .* :vi' -l -0 \| \ xargs -0 sed -i '' 's/vi: \(.\)xoet\(.\):vi/vi: \1noet\2:vi/' ```	2023-12-07 22:17:11 -05:00
Jōshin	394d998315	Fix vi modelines (#989 ) At least in neovim, `│vi:` is not recognized as a modeline because it has no preceding whitespace. After fixing this, opening a file yields an error because `net` is not an option. (`noet`, however, is.)	2023-12-05 14:37:54 -08:00
Justine Tunney	fa20edc44d	Reduce header complexity - Remove most __ASSEMBLER__ __LINKER__ ifdefs - Rename libc/intrin/bits.h to libc/serialize.h - Block pthread cancelation in fchmodat() polyfill - Remove `clang-format off` statements in third_party	2023-11-28 14:39:42 -08:00
Justine Tunney	96f979dfc5	Rename makefiles BUILD.mk This way they appear at the top of directory listings.	2023-11-28 11:21:08 -08:00
Justine Tunney	751d20d98d	Fix nsync_mu_unlock_slow_() on Apple Silicon We torture test dlmalloc() in test/libc/stdio/memory_test.c. That test was crashing on occasion on Apple M1 microprocessors when dlmalloc was using *NSYNC locks. It was relatively easy to spot the cause, which is this one particular compare and swap operation, which needed to change to use sequentially-consistent ordering rather than an acquire barrier	2023-11-13 11:07:13 -08:00
Justine Tunney	bed77186c3	Use simple locks in dlmalloc	2023-11-12 09:00:49 -08:00
Justine Tunney	d2f49ca175	Improve mkdeps Our makefile generator now accepts badly formatted include lines. It's now more hermetic with better error checking in the cosmo repo, and it can be configured to not be hermetic at all.	2023-11-10 04:14:27 -08:00
Justine Tunney	49b0eaa69f	Improve threading and i/o routines - On Windows connect() can now be interrupted by a signal; connect() w/ O_NONBLOCK will now raise EINPROGRESS; and connect() with SO_SNDTIMEO will raise ETIMEDOUT after the interval has elapsed. - We now get the AcceptEx(), ConnectEx(), and TransmitFile() functions from the WIN32 API the officially blessed way, using WSAIoctl(). - Do nothing on Windows when fsync() is called on a directory handle. This was raising EACCES earlier becaues GENERIC_WRITE is required on the handle. It's possible to FlushFileBuffers() a directory handle if it's opened with write access but MSDN doesn't document what it does. If you have any idea, please let us know! - Prefer manual reset event objects for read() and write() on Windows. - Do some code cleanup on our dlmalloc customizations. - Fix errno type error in Windows blocking routines. - Make the futex polyfill simpler and faster.	2023-10-12 23:13:04 -07:00
Justine Tunney	ec480f5aa0	Make improvements - Every unit test now passes on Apple Silicon. The final piece of this puzzle was porting our POSIX threads cancelation support, since that works differently on ARM64 XNU vs. AMD64. Our semaphore support on Apple Silicon is also superior now compared to AMD64, thanks to the grand central dispatch library which lets *NSYNC locks go faster. - The Cosmopolitan runtime is now more stable, particularly on Windows. To do this, thread local storage is mandatory at all runtime levels, and the innermost packages of the C library is no longer being built using ASAN. TLS is being bootstrapped with a 128-byte TIB during the process startup phase, and then later on the runtime re-allocates it either statically or dynamically to support code using _Thread_local. fork() and execve() now do a better job cooperating with threads. We can now check how much stack memory is left in the process or thread when functions like kprintf() / execve() etc. call alloca(), so that ENOMEM can be raised, reduce a buffer size, or just print a warning. - POSIX signal emulation is now implemented the same way kernels do it with pthread_kill() and raise(). Any thread can interrupt any other thread, regardless of what it's doing. If it's blocked on read/write then the killer thread will cancel its i/o operation so that EINTR can be returned in the mark thread immediately. If it's doing a tight CPU bound operation, then that's also interrupted by the signal delivery. Signal delivery works now by suspending a thread and pushing context data structures onto its stack, and redirecting its execution to a trampoline function, which calls SetThreadContext(GetCurrentThread()) when it's done. - We're now doing a better job managing locks and handles. On NetBSD we now close semaphore file descriptors in forked children. Semaphores on Windows can now be canceled immediately, which means mutexes/condition variables will now go faster. Apple Silicon semaphores can be canceled too. We're now using Apple's pthread_yield() funciton. Apple _nocancel syscalls are now used on XNU when appropriate to ensure pthread_cancel requests aren't lost. The MbedTLS library has been updated to support POSIX thread cancelations. See tool/build/runitd.c for an example of how it can be used for production multi-threaded tls servers. Handles on Windows now leak less often across processes. All i/o operations on Windows are now overlapped, which means file pointers can no longer be inherited across dup() and fork() for the time being. - We now spawn a thread on Windows to deliver SIGCHLD and wakeup wait4() which means, for example, that posix_spawn() now goes 3x faster. POSIX spawn is also now more correct. Like Musl, it's now able to report the failure code of execve() via a pipe although our approach favors using shared memory to do that on systems that have a true vfork() function. - We now spawn a thread to deliver SIGALRM to threads when setitimer() is used. This enables the most precise wakeups the OS makes possible. - The Cosmopolitan runtime now uses less memory. On NetBSD for example, it turned out the kernel would actually commit the PT_GNU_STACK size which caused RSS to be 6mb for every process. Now it's down to ~4kb. On Apple Silicon, we reduce the mandatory upstream thread size to the smallest possible size to reduce the memory overhead of Cosmo threads. The examples directory has a program called greenbean which can spawn a web server on Linux with 10,000 worker threads and have the memory usage of the process be ~77mb. The 1024 byte overhead of POSIX-style thread-local storage is now optional; it won't be allocated until the pthread_setspecific/getspecific functions are called. On Windows, the threads that get spawned which are internal to the libc implementation use reserve rather than commit memory, which shaves a few hundred kb. - sigaltstack() is now supported on Windows, however it's currently not able to be used to handle stack overflows, since crash signals are still generated by WIN32. However the crash handler will still switch to the alt stack, which is helpful in environments with tiny threads. - Test binaries are now smaller. Many of the mandatory dependencies of the test runner have been removed. This ensures many programs can do a better job only linking the the thing they're testing. This caused the test binaries for LIBC_FMT for example, to decrease from 200kb to 50kb - long double is no longer used in the implementation details of libc, except in the APIs that define it. The old code that used long double for time (instead of struct timespec) has now been thoroughly removed. - ShowCrashReports() is now much tinier in MODE=tiny. Instead of doing backtraces itself, it'll just print a command you can run on the shell using our new `cosmoaddr2line` program to view the backtrace. - Crash report signal handling now works in a much better way. Instead of terminating the process, it now relies on SA_RESETHAND so that the default SIG_IGN behavior can terminate the process if necessary. - Our pledge() functionality has now been fully ported to AARCH64 Linux.	2023-09-18 21:04:47 -07:00
Justine Tunney	0d748ad58e	Fix warnings This change fixes Cosmopolitan so it has fewer opinions about compiler warnings. The whole repository had to be cleaned up to be buildable in -Werror -Wall mode. This lets us benefit from things like strict const checking. Some actual bugs might have been caught too.	2023-09-01 20:50:18 -07:00
Justine Tunney	c776a32f75	Replace COSMO define with _COSMO_SOURCE This change might cause ABI breakages for /opt/cosmos. It's needed to help us better conform to header declaration practices.	2023-08-13 20:55:04 -07:00
Justine Tunney	18bb5888e1	Make more fixes and improvements - Remove PAGESIZE constant - Fix realloc() documentation - Fix ttyname_r() error reporting - Make forking more reliable on Windows - Make execvp() a few microseconds faster - Make system() a few microseconds faster - Tighten up the socket-related magic numbers - Loosen restrictions on mmap() offset alignment - Improve GetProgramExecutableName() with getenv("_") - Use mkstemp() as basis for mktemp(), tmpfile(), tmpfd() - Fix flakes in pthread_cancel_test, unix_test, fork_test - Fix recently introduced futex stack overflow regression - Let sockets be passed as stdio to subprocesses on Windows - Improve security of bind() on Windows w/ SO_EXCLUSIVEADDRUSE	2023-07-29 18:44:15 -07:00
Justine Tunney	83341a4269	Remove hints from Windows imports	2023-07-27 14:09:07 -07:00
Justine Tunney	7e0a09feec	Mint APE Loader v1.5 This change ports APE Loader to Linux AARCH64, so that Raspberry Pi users can run programs like redbean, without the executable needing to modify itself. Progress has also slipped into this change on the issue of making progress better conforming to user expectations and industry standards regarding which symbols we're allowed to declare	2023-07-26 13:54:49 -07:00
Justine Tunney	e0c2b91b3e	Remove _Hide keyword It never did anything and isn't worthwhile as documentation.	2023-07-24 08:34:58 -07:00
Justine Tunney	42ba9901e4	Fix some behavioral issues on Windows	2023-07-09 09:59:22 -07:00
Justine Tunney	d7c79f43ef	Clean up more code - Found some bugs in LLVM compiler-rt library - The useless LIBC_STUBS package is now deleted - Improve the overflow checking story even further - Get chibicc tests working in MODE=dbg mode again - The libc/isystem/ headers now have correctly named guards	2023-06-18 01:00:05 -07:00
Justine Tunney	8ff48201ca	Rewrite .zip.o file linker This change takes an entirely new approach to the incremental linking of pkzip executables. The assets created by zipobj.com are now treated like debug data. After a .com.dbg is compiled, fixupobj.com should be run, so it can apply fixups to the offsets and move the zip directory to the end of the file. Since debug data doesn't get objcopy'd, a new tool has been introduced called zipcopy.com which should be run after objcopy whenever a .com file is created. This is all automated by the `cosmocc` toolchain which is rapidly becoming the new recommended approach. This change also introduces the new C23 checked arithmetic macros.	2023-06-10 09:29:44 -07:00
Justine Tunney	eb40cb371d	Get --ftrace working on aarch64 This change implements a new approach to function call logging, that's based on the GCC flag: -fpatchable-function-entry. Read the commentary in build/config.mk to learn how it works.	2023-06-05 23:35:31 -07:00
Justine Tunney	d04430f4ef	Get LIBC_MEM and LIBC_STDIO building with aarch64	2023-05-10 04:20:47 -07:00
Justine Tunney	dd04aeba1c	Increase stack size to 128k and guard size to 16k This improves our compatibility with Apple M1.	2022-12-18 22:58:29 -08:00
Justine Tunney	bf7843833f	Rename hidden keyword to _Hide	2022-11-08 12:55:28 -08:00
Justine Tunney	e522aa3a07	Make more threading improvements - ASAN memory morgue is now lockless - Make C11 atomics header more portable - Rewrote pthread keys support to be lockless - Simplify Python's unicode table unpacking code - Make crash report write(2) closer to being atomic - Make it possible to strace/ftrace a single thread - ASAN now checks nul-terminated strings fast and properly - Windows fork() now restores TLS memory of calling thread	2022-11-01 23:28:26 -07:00

1 2 3

110 commits