english : use typos to fix comments and logs (#4354)

This commit is contained in:
Richard Kiss 2023-12-12 01:53:36 -08:00 committed by GitHub
parent 6138963fb2
commit 9494d7c477
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
17 changed files with 34 additions and 34 deletions

View file

@ -51,7 +51,7 @@ def bytes_to_unicode():
The reversible bpe codes work on unicode strings.
This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
This is a signficant percentage of your normal, say, 32K bpe vocab.
This is a significant percentage of your normal, say, 32K bpe vocab.
To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
And avoids mapping to whitespace/control characters the bpe code barfs on.
"""