Strings or byte array

Byte arrays can create some expected errors or bugs that can sometimes be hard to track down. Sometimes you might forget to do a copy because you forgot your own code or the code is not written by you.

The main advantage of a byte array is that it is mutable. However, right after building the contents, I think it is good to copy it into a string. Yes, there is some loss in speed, but compared to disk accesses and network calls, this operation is just a memcpy and should be highly optimized and negligible in cost.

Furthermore, we often need the byte array in strings, e.g., print out some debug info. You might be casting back and forth between string and []byte as the code base grows. I think it will be good to just stick to strings as much as possible, and use byte arrays only when you are building something.

May I convert some byte arrays to strings? I might be able to drop some byte copying or do some other optimizations. Would you all support such a change? @core-devs

To me, taking an extreme example, this sounds like, “hey, sometimes we have to convert between strings and integers, so we should just store all integers as strings”.

We go to pretty extreme lengths to avoid memory copying, so anything which adds a bit of memcpy here and a bit there can easily undermine whatever we’re doing to avoid such copies.

Also, why? Wouldn’t understanding slices be a better investment of time, so you know when to copy and when not to.

I understand slices. It is just that sometimes people forget and as the code gets more complicated, people might make mistakes. In particular, different places in the code might write to the same byte array and crazy things might happen. It is beyond just “understanding slices”, but being very careful about how the byte arrays are propagated and how they may be reused and shared. Another thing is that slice tricks on byte arrays might make these errors hard to detect. These tricks might be applied in some function you are using, or some other people’s code, or random places which you might not expect. They look like different byte arrays but they are referencing the same thing.

The other point is that there are several instances in the code where we cast byte array to strings, and who knows, if we’re not careful, it gets cast back again. Each cast is a copy. Why not just cast it to the most convenient and safest form as early as possible and reuse it?

The conversion between strings and integers is somewhat different. The reason is that the performance differences between ints and strings is huge for normal int operations. However, for byte arrays and strings, the performance gain is mainly from building the content itself. Other than that, there is little difference in performance…

Hmm I took a shower and thought about this.

If we are referencing value data from flatbuffers, we don’t want to copy. That makes sense.

There are not that many places where I really want to convert byte arrays to strings. One place is in tokenizing code and I have already made some changes there. There is a copying of byte arrays there anyway, so I thought we might as well use strings. Another place is posting lists’ keys.

Hmm… I honestly don’t buy the arguments about complexity of code, due to byte arrays / slices. Almost all the “tricks” that we do with byte arrays are coming out of a single page of slice tricks; and if we don’t use them, then we’d have to pay that cost in memory allocation. Also, byte slice is a primitive Go data type, and we shouldn’t discriminate against it in any way possible. Because, if you start thinking of standard data types as inherently complex and bug-inviting, then that sort of argument easily extends to any other data structure that one might dislike.

If you find places where we’re doing this double casting, then I agree with your point. We should avoid that. But, that is a consideration on a case by case basis.

Regarding keys, you have to convert them to byte arrays to pass it onto RocksDB, and you get byte arrays back from them. So, why convert them to strings? That is exactly the sort of double casting that you want to avoid.

Actually I don’t feel strongly about this change.

I did consider the double copying if we make key a string, but I suspect we can do it without an additional copy there.

Slice tricks is trivial stuff. But trivial stuff can lead to complexity. What I was cautioning against is also trivial stuff, namely passing mutable data around when there is not that much gain and/or when the gain tends to be erased somewhere else. There are more functions that work on strings but not byte arrays, e.g., printf. We might be casting byte arrays to strings and not caching the result. One example is casting of task.Query or some other flatbuffers’s attr to a string at many places in the code. These strings are not too long, so maybe we don’t have to be too concerned about it.

Once again, this is just something that occurs to me recently. I don’t feel strongly about it and will pursue other interesting directions.

Looking at the indexing code, I see where you’re coming from. Having [][]byte is definitely not great. And in that case, your work to have them as []string makes sense.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.