Tokenizer for indexer

jchiu · September 14, 2016, 9:10am

I am looking into having a tokenizer for our indexer. English and similar languages are easy to work with:

Normalize using text/unicode/norm.
Tokenize using strings.Fields

However, this will not work for non-segmented languages such as Chinese where “terms” are concatenated together, e.g., “这是斯坦福中文分词器测试” should tokenize into “这是斯坦福中文分词器测试.” See example here.

Candidate libs that I am considering:

NLTK. This is a Python lib. I don’t quite want to use it. Would prefer a Go or Java or C++ library.
OpenNLP (Java). This doesn’t seem to support Chinese. See list of models here.
Stanford CoreNLP (Java). Supports 6 different languages including Chinese.
ICU. This comes in both C++ and Java.
Some others such as SyntaxNet, http://spacy.io/…

~~My current inclination is to use Stanford CoreNLP.~~
My current inclination is to use ICU.

(My usual comment: Java has faster GC and more libs. I can get by without goroutines, channels easily or nice language features or styles that are wants but not needs, and would place more value in the amount of support and libraries that we can use, i.e., getting the actual work done. Example: lock-free hash maps, NLP libraries, and more in the future…)

mrjn · September 14, 2016, 9:36am

Java is also a big memory hog, which needs to be installed on every machine that you need to run Java program on. It has no particular benefits for concurrency. And forces OOP at every nook and corner, creating an explosion of files. In fact, Java is full of wants, while Go is designed for needs.

If I had to, I’d go with C++ over Java any day. But, honestly, Go is a much nicer language to write in than both of these older ones.

You have a point though about the lack of libraries in Go. That is a problem, that would only time to fix, because adoption takes time to build. But, by that logic, we should stay away from anything new.

Alright. Going past standard rant.

If CoreNLP is Java based, we can’t use it in Go. Have you looked into Bleve to see how they do this?

Update: Also there should be some C based libs, that we could use.

jchiu · September 14, 2016, 9:47am

Indeed, I stay away from very new things, like the biggest airplanes. I agree on the part about the memory usage. I do prefer C++ over Java and I must say C++ has made a lot of improvements over the years compared to Java. But Java 8 is starting to look better in terms of being less verbose. Programming languages are not stagnant. They do improve over time. As you like to say: Don’t reinvent, iterate. I think it is the same for programming languages. Hope we keep up to date with advancements in C++.

Ok, I think I have gotten everything out of my chest. Will be quiet about this, I promise this time.

Bleve is just doing white space tokenizing. It does some more clever white spaces with regex. See https://github.com/blevesearch/bleve/blob/master/analysis/tokenizers/regexp_tokenizer/regexp_tokenizer.go. I don’t think it is capable of segmenting cjk text. Should we do the same for now?

I can continue looking.

mrjn · September 14, 2016, 9:51am

We need to get you something so your heart can feel the Go love! So, you can be one of us! (one of us! one of us! one of us!) @pulkit, can we send something to @jchiu – A Go tshirt and a nice little gopher?

We should keep looking. We need to get CJK in there if we can. I’m sure there are C/C++ implementations that we could use Cgo to call via Go. Or, figure out how hard is it to build this in Go, and you can probably build it for the entire Go community. Wouldn’t that be a bit of machine learning in there?

Actually, regarding iteration, using the same logic, we could probably just iterate on Neo4J!

jchiu · September 14, 2016, 9:53am

I did buy my wife a small gopher soft toy back when I was still at Google. She loves it

mrjn · September 14, 2016, 9:54am

Maybe we should hire her!

jchiu · September 14, 2016, 9:57am

As promised, I shall not rant any further and shall not take the bait haha… (with regards to iterating on Neo4j)

mrjn · September 14, 2016, 9:59am

haha… Maybe we also need to send you a Dgraph shirt .

pulkit · September 14, 2016, 10:39am

We can think of getting a go t -shirt printed.

pawan · September 15, 2016, 2:14am

We can explore using a C++ library. If that’s easy to embed within Dgraph and works well, then great. I suppose writing our own Go library would be time consuming, but would be a great contribution to the community. Though we should only do that if we can’t use an already stable one.

kostub · September 15, 2016, 5:46am

@jchiu have you looked at GitHub - yanyiwu/gojieba: "结巴"中文分词的Golang版本 or GitHub - awsong/MMSEGO: Chinese word splitting algorithm MMSEG in GO? They seem to be Chinese word splitting algorithms written in go. You’d still need something different for JK.

jchiu · September 16, 2016, 1:44am

Essentially we have two options.

Option 1: Use ICU

ICU4C is a mature C++ library with a lot of functionality. There are Go wrappers around ICU but I haven’t found any that allows you to embed. Embedding seems challenging. Yes, embedding brings a lot of convenience to the user, but in this case, as ICU is pretty popular, the user might already have it. (I have also tried out their segmenter for some Chinese text and it seems to work fine.)

Option 2: Native Go libs

I haven’t found any Go lib that supports a large number of languages. Chances are that we have to take in quite a number of dependencies to support a comfortable number of languages. Also, they might not be as mature and as well-maintained as ICU.

My current take

I am leaning towards using ICU and asking the user to install that dependency for now, then look into how to embed.

mrjn · September 16, 2016, 2:51am

SGTM! Let’s set this for v0.5.

system · November 28, 2017, 1:00am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CGOGEN - Go binding generator Users	2	624	November 28, 2017
Improve CJK tokenizer support Dgraph dgraph , kind:enhancement	1	450	July 8, 2020
Indexing with Custom Tokenizers - Query language Documentation	0	506	August 28, 2020
ANTLR for parsing languages and specs Users	22	4284	November 28, 2017
NLP/FTS - future possibilities Users	1	991	November 28, 2017

Tokenizer for indexer

Option 1: Use ICU

Option 2: Native Go libs

My current take

Related topics