Make the BERT protocol encoding aware #24

tenderlove · 2016-04-14T01:23:15Z

This PR makes the BERT protocol aware of Ruby encodings. I have updated the Encoder and the Decoder as follows:

Decoder

The decoder now looks for a magic header and changes parsing styles based on the header it finds. If the header is MAGIC, then it knows that the encoder didn't add any encoding information and uses the old decoder. If the header is VERSION2, then it knows that the encoder added string encoding information and it adds the encoding information to the string.

Encoder

The encoder is split in to version 1 and version 2 encoding. It defaults to version 1, and version 2 is enabled with a feature flag. The reason I did this is so that we can deploy the updated bert gem to both the client and the server, then flip the feature flag on for the client, then flip it on for the server. That way we can upgrade without downtime.

Here is an IRB session to demonstrate the encoding support:

[aaron@TC bert (encoding)]$ irb -I lib:ext
irb(main):001:0> require 'bert'
=> true
irb(main):002:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語")).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):003:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語"))
=> "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E"
irb(main):004:0> BERT::Encode.version = :v2
=> :v2
irb(main):005:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語")).encoding
=> #<Encoding:UTF-8>
irb(main):006:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語"))
=> "日本語"
irb(main):007:0>

/cc @brianmario @tmm1 @charliesome @piki

this way we can configure the callbacks to something else at runtime.

This adds an encoding field after the string so that you can apply an encoding to the string sent across the wire.

This commit makes the encoder default to version 1 of the BERT encoding scheme, but allows you to turn on version 2 via a feature flag.

piki · 2016-04-14T03:24:44Z

This is a good direction. Any measurements for how much of a performance hit we take with any sort of representative workload? This change essentially doubles the network, RAM, and CPU cost of short strings. How much difference does it make in a real RPC scenario?

If performance is at all an issue, we could indicate common encodings (I expect to see a lot of UTF-8) with something cheaper than a second String.

brianmario · 2016-04-14T04:48:08Z

If performance is at all an issue, we could indicate common encodings (I expect to see a lot of UTF-8) with something cheaper than a second String.

In mochilo we used genperf to create a perfect hash lookup table mapping Ruby encoding names to integer values. I know using the actual encoding id from ruby itself isn't deterministic (depends on load order or something?) but maybe we could do something similar to mochilo one-time and as Ruby adds new encodings we can add them to the end of the list as new ids?

That way we could use a single byte to represent the encoding and map it to/from the encoding name on the server and client at [de]serialization time? That is, if performance is a big issue here...

tenderlove · 2016-04-14T16:58:53Z

It looks like finding an encoding by name is just a hash lookup (given the encoding has been loaded once already), so I don't think we need to worry about the speed of finding the encoding object.

I agree that representing the encoding as an integer would reduce the payload size, but I don't know if it matters. Using MIME would require even more network chatter than this. If we could guarantee that all strings in one payload were of the same type, then we could really reduce data. But since strings in Ruby are allowed to have different encodings, I don't know if that's something we can / want to enforce.

piki · 2016-04-14T17:29:45Z

It's the number and collective size of allocations that concerns me. Grabbing a list of all the refs in a repository, or all the refs on a particular branch, could be hundreds of thousands of small strings. This change just doubles the serialization, deserialization, and allocation costs. An int would definitely be cheaper.

tenderlove · 2016-04-14T17:51:05Z

@piki do you know if payloads usually have homogenous encodings? If stuff is usually UTF-8, maybe we could only add extra encoding information if it's not UTF-8. If strings are usually UTF-8, then we'd only incur overhead for "special" strings.

carlosmn · 2016-04-14T18:16:44Z

I would expect 90% of the cases to be utf8 strings, since that's the encoding we use (assume) when we extract Git data via rugged, or to be raw binary like from spawning a child process.

Having a way to specify both of these via a single int should help a lot with the smaller strings (and would still allow us to specify a different encoding if it's neither of these).

This commit adds two new types, one for unicode strings and one for other encoded strings. Unocide strings have no extra wire protocol overhead, where "other" strings send the encoding name along with the string.

tenderlove · 2016-04-14T22:17:10Z

Ok, I added two new types to the protocol. One type is for UTF-8 strings, and one type is for "other" encoded strings. Strings tagged as ASCII-8BIT will use the BIN type, so there's no change in protocol overhead for them. The only strings that have additional overhead now are strings that are tagged with some encoding other than UTF8, US-ASCII, and ASCII-8BIT.

Here's a demo of UTF8 strings:

[aaron@TC bert (encoding)]$ irb -I lib
irb(main):001:0> require 'bert'
=> true
irb(main):002:0> BERT::Encoder.encode("日本語").bytesize
=> 15
irb(main):003:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語")).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):004:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語"))
=> "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E"
irb(main):005:0> BERT::Encode.version = :v2
=> :v2
irb(main):006:0> BERT::Encoder.encode("日本語").bytesize
=> 15
irb(main):007:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語")).encoding
=> #<Encoding:UTF-8>
irb(main):008:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語"))
=> "日本語"
irb(main):009:0>

Binary strings:

[aaron@TC bert (encoding)]$ irb -I lib -rbert
irb(main):001:0> BERT::Encoder.encode("日本語".b).bytesize
=> 15
irb(main):002:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語".b)).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):003:0> BERT::Encode.version = :v2
=> :v2
irb(main):004:0> BERT::Encoder.encode("日本語".b).bytesize
=> 15
irb(main):005:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語".b)).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):006:0>

US-ASCII:

[aaron@TC bert (encoding)]$ irb -I lib -rbert
irb(main):001:0> BERT::Encoder.encode("teelo".encode('US-ASCII')).bytesize
=> 11
irb(main):002:0> BERT::Decoder.decode(BERT::Encoder.encode("teelo".encode('US-ASCII'))).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):003:0> BERT::Encode.version = :v2
=> :v2
irb(main):004:0> BERT::Encoder.encode("teelo".encode('US-ASCII')).bytesize
=> 11
irb(main):005:0> BERT::Decoder.decode(BERT::Encoder.encode("teelo".encode('US-ASCII'))).encoding
=> #<Encoding:UTF-8>
irb(main):006:0>

And finally Shift_JIS:

[aaron@TC bert (encoding)]$ irb -I lib -rbert
irb(main):001:0> BERT::Encoder.encode("日本語".encode('Shift_JIS')).bytesize
=> 12
irb(main):002:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語".encode('Shift_JIS'))).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):003:0> BERT::Encode.version = :v2
=> :v2
irb(main):004:0> BERT::Encoder.encode("日本語".encode('Shift_JIS')).bytesize
=> 26
irb(main):005:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語".encode('Shift_JIS'))).encoding
=> #<Encoding:Shift_JIS>
irb(main):006:0>

Note that US-ASCII gets lifted to UTF8, but I think that doesn't matter, and the Shift_JIS string is a bit longer when encoded because it embeds the encoding name.

brianmario · 2016-04-14T22:24:41Z

Nice work!

Note that US-ASCII gets lifted to UTF8, but I think that doesn't matter

👍 to that

piki · 2016-04-14T23:28:01Z

I like it. 👍

haileys · 2016-04-15T00:30:17Z

Looks great ✨

carlosmn · 2016-04-15T17:22:20Z

ext/bert/c/decode.c

@@ -14,9 +15,12 @@
 #define ERL_BIN           109
 #define ERL_SMALL_BIGNUM  110
 #define ERL_LARGE_BIGNUM  111
+#define ERL_ENC_STRING    112
+#define ERL_UNICODE_STRING 113


IIUC this is the first time we're diverging from erlang's External Term Format. And as said format does have these values (for NEW_FUN_EXT and EXPORT_EXT resp.), I think a comment explaining that we knowingly diverge here for version 2 would be useful.

Are there other (available) values we should use instead?

Adherence to erlang formats isn't important for GitHub's uses of BERT -- which all involve Ruby -- but I could imagine it being an issue if anyone else is still using BERT with erlang.

Assuming the online docs are up to date, numbers from 120 onwards are free, but they could probably add new types at any point.

We could reduce the divergence by using 107 (STRING_EXT) for utf8 strings whose length fits within 2 bytes and go with ENC_STRING if they're bigger, but we'd still re-use or hope we never find new types useful for ENC_STRING.

The types we're making unavailable aren't something that I think make sense to transfer between ruby and erlang, so I don't think anybody will miss them. But even so making an note of the rationale in the code would be useful.

I chose these values because they are used for the offset in the callbacks struct. Would adding a comment and removing the ERL prefix be enough?

The use in the offset makes it a bit trickier. How about we use ERLEXT_ and have a comment saying these are specific to our own v2? My concern here is coming back to the code and having to figure out how the C, ruby and spec match up so leaving a note would be polite :)

Done. I added a comment! :D

The two new types are extensions, so this commit adds a comment documenting what these extensions are for (namely so that we can support string encodings over the wire).

tenderlove added 3 commits April 13, 2016 17:05

move the decoder callbacks to the buf pointer

3520318

this way we can configure the callbacks to something else at runtime.

make BERT aware of string encodings

70fd1ff

This adds an encoding field after the string so that you can apply an encoding to the string sent across the wire.

default encoder to version 1, enable version 2 with a flag

4e78dc4

This commit makes the encoder default to version 1 of the BERT encoding scheme, but allows you to turn on version 2 via a feature flag.

add two new types, unicode strings, and other encoded strings

aa084e7

This commit adds two new types, one for unicode strings and one for other encoded strings. Unocide strings have no extra wire protocol overhead, where "other" strings send the encoding name along with the string.

tenderlove force-pushed the encoding branch from 8b72794 to 46a0e43 Compare April 14, 2016 22:06

tenderlove force-pushed the encoding branch from 46a0e43 to 132b21b Compare April 14, 2016 22:30

reduce diff

31ab659

tenderlove force-pushed the encoding branch from 132b21b to 31ab659 Compare April 14, 2016 22:32

carlosmn reviewed Apr 15, 2016
View reviewed changes

tenderlove mentioned this pull request Apr 15, 2016

Make BERT encoding aware github/bert#1

Merged

add a comment about ERL types

3113c6f

The two new types are extensions, so this commit adds a comment documenting what these extensions are for (namely so that we can support string encodings over the wire).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the BERT protocol encoding aware #24

Make the BERT protocol encoding aware #24

tenderlove commented Apr 14, 2016

piki commented Apr 14, 2016

brianmario commented Apr 14, 2016

tenderlove commented Apr 14, 2016

piki commented Apr 14, 2016

tenderlove commented Apr 14, 2016

carlosmn commented Apr 14, 2016

tenderlove commented Apr 14, 2016 •

edited

Loading

brianmario commented Apr 14, 2016

piki commented Apr 14, 2016

haileys commented Apr 15, 2016

carlosmn Apr 15, 2016

piki Apr 15, 2016

carlosmn Apr 15, 2016

tenderlove Apr 15, 2016

carlosmn Apr 15, 2016

tenderlove Apr 18, 2016

Make the BERT protocol encoding aware #24

Are you sure you want to change the base?

Make the BERT protocol encoding aware #24

Conversation

tenderlove commented Apr 14, 2016

Decoder

Encoder

piki commented Apr 14, 2016

brianmario commented Apr 14, 2016

tenderlove commented Apr 14, 2016

piki commented Apr 14, 2016

tenderlove commented Apr 14, 2016

carlosmn commented Apr 14, 2016

tenderlove commented Apr 14, 2016 • edited Loading

brianmario commented Apr 14, 2016

piki commented Apr 14, 2016

haileys commented Apr 15, 2016

carlosmn Apr 15, 2016

Choose a reason for hiding this comment

piki Apr 15, 2016

Choose a reason for hiding this comment

carlosmn Apr 15, 2016

Choose a reason for hiding this comment

tenderlove Apr 15, 2016

Choose a reason for hiding this comment

carlosmn Apr 15, 2016

Choose a reason for hiding this comment

tenderlove Apr 18, 2016

Choose a reason for hiding this comment

tenderlove commented Apr 14, 2016 •

edited

Loading