Tokenizer for indexer

@core-devs

I am looking into having a tokenizer for our indexer. English and similar languages are easy to work with:

However, this will not work for non-segmented languages such as Chinese where “terms” are concatenated together, e.g., “这是斯坦福中文分词器测试” should tokenize into “这 是 斯坦福 中文 分词器 测试.” See example here.

Candidate libs that I am considering:

  • NLTK. This is a Python lib. I don’t quite want to use it. Would prefer a Go or Java or C++ library.
  • OpenNLP (Java). This doesn’t seem to support Chinese. See list of models here.
  • Stanford CoreNLP (Java). Supports 6 different languages including Chinese.
  • ICU. This comes in both C++ and Java.
  • Some others such as SyntaxNet, http://spacy.io/

My current inclination is to use Stanford CoreNLP.
My current inclination is to use ICU.

(My usual comment: Java has faster GC and more libs. I can get by without goroutines, channels easily or nice language features or styles that are wants but not needs, and would place more value in the amount of support and libraries that we can use, i.e., getting the actual work done. Example: lock-free hash maps, NLP libraries, and more in the future…)

Java is also a big memory hog, which needs to be installed on every machine that you need to run Java program on. It has no particular benefits for concurrency. And forces OOP at every nook and corner, creating an explosion of files. In fact, Java is full of wants, while Go is designed for needs.

If I had to, I’d go with C++ over Java any day. But, honestly, Go is a much nicer language to write in than both of these older ones.

You have a point though about the lack of libraries in Go. That is a problem, that would only time to fix, because adoption takes time to build. But, by that logic, we should stay away from anything new.

Alright. Going past standard rant.

If CoreNLP is Java based, we can’t use it in Go. Have you looked into Bleve to see how they do this?

Update: Also there should be some C based libs, that we could use.

1 Like

Indeed, I stay away from very new things, like the biggest airplanes. I agree on the part about the memory usage. I do prefer C++ over Java and I must say C++ has made a lot of improvements over the years compared to Java. But Java 8 is starting to look better in terms of being less verbose. Programming languages are not stagnant. They do improve over time. As you like to say: Don’t reinvent, iterate. I think it is the same for programming languages. Hope we keep up to date with advancements in C++.

Ok, I think I have gotten everything out of my chest. Will be quiet about this, I promise this time.

Bleve is just doing white space tokenizing. It does some more clever white spaces with regex. See https://github.com/blevesearch/bleve/blob/master/analysis/tokenizers/regexp_tokenizer/regexp_tokenizer.go. I don’t think it is capable of segmenting cjk text. Should we do the same for now?

I can continue looking.

1 Like

We need to get you something so your heart can feel the Go love! So, you can be one of us! (one of us! one of us! one of us!) @pulkit, can we send something to @jchiu – A Go tshirt and a nice little gopher?

We should keep looking. We need to get CJK in there if we can. I’m sure there are C/C++ implementations that we could use Cgo to call via Go. Or, figure out how hard is it to build this in Go, and you can probably build it for the entire Go community. Wouldn’t that be a bit of machine learning in there?


Actually, regarding iteration, using the same logic, we could probably just iterate on Neo4J!

I did buy my wife a small gopher soft toy back when I was still at Google. She loves it :slight_smile:

Maybe we should hire her!

1 Like

As promised, I shall not rant any further and shall not take the bait haha… (with regards to iterating on Neo4j)

1 Like

haha… Maybe we also need to send you a Dgraph shirt :stuck_out_tongue:.

2 Likes

We can think of getting a go t -shirt printed.

2 Likes

We can explore using a C++ library. If that’s easy to embed within Dgraph and works well, then great. I suppose writing our own Go library would be time consuming, but would be a great contribution to the community. Though we should only do that if we can’t use an already stable one.

1 Like

@jchiu have you looked at https://github.com/yanyiwu/gojieba or https://github.com/awsong/MMSEGO? They seem to be Chinese word splitting algorithms written in go. You’d still need something different for JK.

1 Like

Essentially we have two options.

Option 1: Use ICU

ICU4C is a mature C++ library with a lot of functionality. There are Go wrappers around ICU but I haven’t found any that allows you to embed. Embedding seems challenging. Yes, embedding brings a lot of convenience to the user, but in this case, as ICU is pretty popular, the user might already have it. (I have also tried out their segmenter for some Chinese text and it seems to work fine.)

Option 2: Native Go libs

I haven’t found any Go lib that supports a large number of languages. Chances are that we have to take in quite a number of dependencies to support a comfortable number of languages. Also, they might not be as mature and as well-maintained as ICU.

My current take

I am leaning towards using ICU and asking the user to install that dependency for now, then look into how to embed.

1 Like

SGTM! Let’s set this for v0.5.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.