957c61cbbf
This change upgrades to GCC 12.3 and GNU binutils 2.42. The GNU linker appears to have changed things so that only a single de-duplicated str table is present in the binary, and it gets placed wherever the linker wants, regardless of what the linker script says. To cope with that we need to stop using .ident to embed licenses. As such, this change does significant work to revamp how third party licenses are defined in the codebase, using `.section .notice,"aR",@progbits`. This new GCC 12.3 toolchain has support for GNU indirect functions. It lets us support __target_clones__ for the first time. This is used for optimizing the performance of libc string functions such as strlen and friends so far on x86, by ensuring AVX systems favor a second codepath that uses VEX encoding. It shaves some latency off certain operations. It's a useful feature to have for scientific computing for the reasons explained by the test/libcxx/openmp_test.cc example which compiles for fifteen different microarchitectures. Thanks to the upgrades, it's now also possible to use newer instruction sets, such as AVX512FP16, VNNI. Cosmo now uses the %gs register on x86 by default for TLS. Doing it is helpful for any program that links `cosmo_dlopen()`. Such programs had to recompile their binaries at startup to change the TLS instructions. That's not great, since it means every page in the executable needs to be faulted. The work of rewriting TLS-related x86 opcodes, is moved to fixupobj.com instead. This is great news for MacOS x86 users, since we previously needed to morph the binary every time for that platform but now that's no longer necessary. The only platforms where we need fixup of TLS x86 opcodes at runtime are now Windows, OpenBSD, and NetBSD. On Windows we morph TLS to point deeper into the TIB, based on a TlsAlloc assignment, and on OpenBSD/NetBSD we morph %gs back into %fs since the kernels do not allow us to specify a value for the %gs register. OpenBSD users are now required to use APE Loader to run Cosmo binaries and assimilation is no longer possible. OpenBSD kernel needs to change to allow programs to specify a value for the %gs register, or it needs to stop marking executable pages loaded by the kernel as mimmutable(). This release fixes __constructor__, .ctor, .init_array, and lastly the .preinit_array so they behave the exact same way as glibc. We no longer use hex constants to define math.h symbols like M_PI. |
||
---|---|---|
.. | ||
test | ||
alloc.c | ||
as.c | ||
as.main.c | ||
asm.c | ||
BUILD.mk | ||
cast.c | ||
chibicc.c | ||
chibicc.h | ||
chibicc.main.c | ||
codegen.c | ||
dox1.c | ||
dox2.c | ||
file.c | ||
file.h | ||
fpclassify.c | ||
hashmap.c | ||
help.txt | ||
kw.c | ||
kw.gperf | ||
kw.h | ||
kw.inc | ||
NOTICE | ||
parse.c | ||
preprocess.c | ||
printast.c | ||
pybind.c | ||
README.cosmo | ||
README.md | ||
strarray.c | ||
tokenize.c | ||
type.c | ||
unicode.c |
chibicc: A Small C Compiler
(The old master has moved to historical/old branch. This is a new one uploaded in September 2020.)
chibicc is yet another small C compiler that implements most C11 features. Even though it still probably falls into the "toy compilers" category just like other small compilers do, chibicc can compile several real-world programs, including Git, SQLite and libpng, without making modifications to the compiled programs. Generated executables of these programs pass their corresponding test suites. So, chibicc actually supports a wide variety of C11 features and is able to compile hundreds of thousands of lines of real-world C code correctly.
chibicc is developed as the reference implementation for a book I'm currently writing about the C compiler and the low-level programming. The book covers the vast topic with an incremental approach; in the first chapter, readers will implement a "compiler" that accepts just a single number as a "language", which will then gain one feature at a time in each section of the book until the language that the compiler accepts matches what the C11 spec specifies. I took this incremental approach from the paper by Abdulaziz Ghuloum.
Each commit of this project corresponds to a section of the book. For this purpose, not only the final state of the project but each commit was carefully written with readability in mind. Readers should be able to learn how a C language feature can be implemented just by reading one or a few commits of this project. For example, this is how while, [], ?:, and thread-local variable are implemented. If you have plenty of spare time, it might be fun to read it from the first commit.
If you like this project, please consider purchasing a copy of the book when it becomes available! 😀 I publish the source code here to give people early access to it, because I was planing to do that anyway with a permissive open-source license after publishing the book. If I don't charge for the source code, it doesn't make much sense to me to keep it private. I hope to publish the book in 2021.
I pronounce chibicc as chee bee cee cee. "chibi" means "mini" or "small" in Japanese. "cc" stands for C compiler.
Status
Features that are often missing in a small compiler but supported by chibicc include (but not limited to):
- Preprocessor
- long double (x87 80-bit floting point numbers)
- Bit-field
- alloca()
- Variable-length array
- Thread-local variable
- Atomic variable
- Common symbol
- Designated initializer
- L, u, U and u8 string literals
chibicc does not support digraphs, trigraphs, complex numbers, K&R-style function prototype, and inline assembly.
chibicc outputs a simple but nice error message when it finds an error in source code.
There's no optimization pass. chibicc emits terrible code which is probably twice or more slower than GCC's output. I have a plan to add an optimization pass once the frontend is done.
Internals
chibicc consists of the following stages:
-
Tokenize: A tokenizer takes a string as an input, breaks it into a list of tokens and returns them.
-
Preprocess: A preprocessor takes as an input a list of tokens and output a new list of macro-expanded tokens. It interprets preprocessor directives while expanding macros.
-
Parse: A recursive descendent parser constructs abstract syntax trees from the output of the preprocessor. It also adds a type to each AST node.
-
Codegen: A code generator emits an assembly text for given AST nodes.
Contributing
When I find a bug in this compiler, I go back to the original commit that introduced the bug and rewrite the commit history as if there were no such bug from the beginning. This is an unusual way of fixing bugs, but as a a part of a book, it is important to keep every commit bug-free.
Thus, I do not take pull requests in this repo. You can send me a pull request if you find a bug, but it is very likely that I will read your patch and then apply that to my previous commits by rewriting history. I'll credit your name somewhere, but your changes will be rewritten by me before submitted to this repository.
Also, please assume that I will occasionally force-push my local repository to this public one to rewrite history. If you clone this project and make local commits on top of it, your changes will have to be rebased by hand when I force-push new commits.
About the Author
I'm Rui Ueyama. I'm the creator of 8cc, which is a hobby C compiler, and also the original creator of the current version of LLVM lld linker, which is a production-quality linker used by various operating systems and large-scale build systems.
References
-
tcc: A small C compiler written by Fabrice Bellard. I learned a lot from this compiler, but the design of tcc and chibicc are different. In particular, tcc is a one-pass compiler, while chibicc is a multi-pass one.
-
lcc: Another small C compiler. The creators wrote a book about the internals of lcc, which I found a good resource to see how a compiler is implemented.