I am looking into having a tokenizer for our indexer. English and similar languages are easy to work with:
- Normalize using text/unicode/norm.
- Tokenize using
However, this will not work for non-segmented languages such as Chinese where “terms” are concatenated together, e.g., “这是斯坦福中文分词器测试” should tokenize into “这 是 斯坦福 中文 分词器 测试.” See example here.
Candidate libs that I am considering:
- NLTK. This is a Python lib. I don’t quite want to use it. Would prefer a Go or Java or C++ library.
- OpenNLP (Java). This doesn’t seem to support Chinese. See list of models here.
- Stanford CoreNLP (Java). Supports 6 different languages including Chinese.
- ICU. This comes in both C++ and Java.
- Some others such as SyntaxNet, http://spacy.io/…
My current inclination is to use Stanford CoreNLP.
My current inclination is to use ICU.
(My usual comment: Java has faster GC and more libs. I can get by without goroutines, channels easily or nice language features or styles that are wants but not needs, and would place more value in the amount of support and libraries that we can use, i.e., getting the actual work done. Example: lock-free hash maps, NLP libraries, and more in the future…)