Commit Graph

5 Commits

Author SHA1 Message Date
Theodore Ts'o 7fb6413336 unicode: update to Unicode 12.1.0 final
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
2019-05-12 13:26:08 -04:00
Masahiro Yamada 28ba53c076 unicode: refactor the rule for regenerating utf8data.h
scripts/mkutf8data is used only when regenerating utf8data.h,
which never happens in the normal kernel build. However, it is
irrespectively built if CONFIG_UNICODE is enabled.

Moreover, there is no good reason for it to reside in the scripts/
directory since it is only used in fs/unicode/.

Hence, move it from scripts/ to fs/unicode/.

In some cases, we bypass build artifacts in the normal build. The
conventional way to do so is to surround the code with ifdef REGENERATE_*.

For example,

 - 7373f4f83c ("kbuild: add implicit rules for parser generation")
 - 6aaf49b495 ("crypto: arm,arm64 - Fix random regeneration of S_shipped")

I rewrote the rule in a more kbuild'ish style.

In the normal build, utf8data.h is just shipped from the check-in file.

$ make
  [ snip ]
  SHIPPED fs/unicode/utf8data.h
  CC      fs/unicode/utf8-norm.o
  CC      fs/unicode/utf8-core.o
  CC      fs/unicode/utf8-selftest.o
  AR      fs/unicode/built-in.a

If you want to generate utf8data.h based on UCD, put *.txt files into
fs/unicode/, then pass REGENERATE_UTF8DATA=1 from the command line.
The mkutf8data tool will be automatically compiled to generate the
utf8data.h from the *.txt files.

$ make REGENERATE_UTF8DATA=1
  [ snip ]
  HOSTCC  fs/unicode/mkutf8data
  GEN     fs/unicode/utf8data.h
  CC      fs/unicode/utf8-norm.o
  CC      fs/unicode/utf8-core.o
  CC      fs/unicode/utf8-selftest.o
  AR      fs/unicode/built-in.a

I renamed the check-in utf8data.h to utf8data.h_shipped so that this
will work for the out-of-tree build.

You can update it based on the latest UCD like this:

$ make REGENERATE_UTF8DATA=1 fs/unicode/
$ cp fs/unicode/utf8data.h fs/unicode/utf8data.h_shipped

Also, I added entries to .gitignore and dontdiff.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-28 13:45:36 -04:00
Gabriel Krisman Bertazi 1215d239e7 unicode: update unicode database unicode version 12.1.0
Regenerate utf8data.h based on the latest UCD files and run tests
against the latest version.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25 13:59:17 -04:00
Olaf Weber a8384c6879 unicode: reduce the size of utf8data[]
Remove the Hangul decompositions from the utf8data trie, and do
algorithmic decomposition to calculate them on the fly. To store the
decomposition the caller of utf8lookup()/utf8nlookup() must provide a
12-byte buffer, which is used to synthesize a leaf with the
decomposition. This significantly reduces the size of the utf8data[]
array.

Changes made by Gabriel:
  Rebase to mainline
  Fix checkpatch errors
  Extract robustness fixes and merge back to original mkutf8data.c patch
  Regenerate utf8data.h

Signed-off-by: Olaf Weber <olaf@sgi.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25 13:49:18 -04:00
Gabriel Krisman Bertazi 955405d117 unicode: introduce UTF-8 character database
The decomposition and casefolding of UTF-8 characters are described in a
prefix tree in utf8data.h, which is a generate from the Unicode
Character Database (UCD), published by the Unicode Consortium, and
should not be edited by hand.  The structures in utf8data.h are meant to
be used for lookup operations by the unicode subsystem, when decoding a
utf-8 string.

mkutf8data.c is the source for a program that generates utf8data.h. It
was written by Olaf Weber from SGI and originally proposed to be merged
into Linux in 2014.  The original proposal performed the compatibility
decomposition, NFKD, but the current version was modified by me to do
canonical decomposition, NFD, as suggested by the community.  The
changes from the original submission are:

  * Rebase to mainline.
  * Fix out-of-tree-build.
  * Update makefile to build 11.0.0 ucd files.
  * drop references to xfs.
  * Convert NFKD to NFD.
  * Merge back robustness fixes from original patch. Requested by
    Dave Chinner.

The original submission is archived at:

<https://linux-xfs.oss.sgi.narkive.com/Xx10wjVY/rfc-unicode-utf-8-support-for-xfs>

The utf8data.h file can be regenerated using the instructions in
fs/unicode/README.utf8data.

- Notes on the update from 8.0.0 to 11.0:

The structure of the ucd files and special cases have not experienced
any changes between versions 8.0.0 and 11.0.0.  8.0.0 saw the addition
of Cherokee LC characters, which is an interesting case for
case-folding.  The update is accompanied by new tests on the test_ucd
module to catch specific cases.  No changes to mkutf8data script were
required for the updates.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-04-25 13:38:44 -04:00