Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the BERT protocol encoding aware #24

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

tenderlove
Copy link

This PR makes the BERT protocol aware of Ruby encodings. I have updated the Encoder and the Decoder as follows:

Decoder

The decoder now looks for a magic header and changes parsing styles based on the header it finds. If the header is MAGIC, then it knows that the encoder didn't add any encoding information and uses the old decoder. If the header is VERSION2, then it knows that the encoder added string encoding information and it adds the encoding information to the string.

Encoder

The encoder is split in to version 1 and version 2 encoding. It defaults to version 1, and version 2 is enabled with a feature flag. The reason I did this is so that we can deploy the updated bert gem to both the client and the server, then flip the feature flag on for the client, then flip it on for the server. That way we can upgrade without downtime.

Here is an IRB session to demonstrate the encoding support:

[aaron@TC bert (encoding)]$ irb -I lib:ext
irb(main):001:0> require 'bert'
=> true
irb(main):002:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語")).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):003:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語"))
=> "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E"
irb(main):004:0> BERT::Encode.version = :v2
=> :v2
irb(main):005:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語")).encoding
=> #<Encoding:UTF-8>
irb(main):006:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語"))
=> "日本語"
irb(main):007:0>

/cc @brianmario @tmm1 @charliesome @piki

this way we can configure the callbacks to something else at runtime.
This adds an encoding field after the string so that you can apply an
encoding to the string sent across the wire.
This commit makes the encoder default to version 1 of the BERT encoding
scheme, but allows you to turn on version 2 via a feature flag.
@piki
Copy link

piki commented Apr 14, 2016

This is a good direction. Any measurements for how much of a performance hit we take with any sort of representative workload? This change essentially doubles the network, RAM, and CPU cost of short strings. How much difference does it make in a real RPC scenario?

If performance is at all an issue, we could indicate common encodings (I expect to see a lot of UTF-8) with something cheaper than a second String.

@brianmario
Copy link

If performance is at all an issue, we could indicate common encodings (I expect to see a lot of UTF-8) with something cheaper than a second String.

In mochilo we used genperf to create a perfect hash lookup table mapping Ruby encoding names to integer values. I know using the actual encoding id from ruby itself isn't deterministic (depends on load order or something?) but maybe we could do something similar to mochilo one-time and as Ruby adds new encodings we can add them to the end of the list as new ids?

That way we could use a single byte to represent the encoding and map it to/from the encoding name on the server and client at [de]serialization time? That is, if performance is a big issue here...

@tenderlove
Copy link
Author

It looks like finding an encoding by name is just a hash lookup (given the encoding has been loaded once already), so I don't think we need to worry about the speed of finding the encoding object.

I agree that representing the encoding as an integer would reduce the payload size, but I don't know if it matters. Using MIME would require even more network chatter than this. If we could guarantee that all strings in one payload were of the same type, then we could really reduce data. But since strings in Ruby are allowed to have different encodings, I don't know if that's something we can / want to enforce.

@piki
Copy link

piki commented Apr 14, 2016

It's the number and collective size of allocations that concerns me. Grabbing a list of all the refs in a repository, or all the refs on a particular branch, could be hundreds of thousands of small strings. This change just doubles the serialization, deserialization, and allocation costs. An int would definitely be cheaper.

@tenderlove
Copy link
Author

@piki do you know if payloads usually have homogenous encodings? If stuff is usually UTF-8, maybe we could only add extra encoding information if it's not UTF-8. If strings are usually UTF-8, then we'd only incur overhead for "special" strings.

@carlosmn
Copy link

I would expect 90% of the cases to be utf8 strings, since that's the encoding we use (assume) when we extract Git data via rugged, or to be raw binary like from spawning a child process.

Having a way to specify both of these via a single int should help a lot with the smaller strings (and would still allow us to specify a different encoding if it's neither of these).

This commit adds two new types, one for unicode strings and one for
other encoded strings.  Unocide strings have no extra wire protocol
overhead, where "other" strings send the encoding name along with the
string.
@tenderlove
Copy link
Author

tenderlove commented Apr 14, 2016

Ok, I added two new types to the protocol. One type is for UTF-8 strings, and one type is for "other" encoded strings. Strings tagged as ASCII-8BIT will use the BIN type, so there's no change in protocol overhead for them. The only strings that have additional overhead now are strings that are tagged with some encoding other than UTF8, US-ASCII, and ASCII-8BIT.

Here's a demo of UTF8 strings:

[aaron@TC bert (encoding)]$ irb -I lib
irb(main):001:0> require 'bert'
=> true
irb(main):002:0> BERT::Encoder.encode("日本語").bytesize
=> 15
irb(main):003:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語")).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):004:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語"))
=> "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E"
irb(main):005:0> BERT::Encode.version = :v2
=> :v2
irb(main):006:0> BERT::Encoder.encode("日本語").bytesize
=> 15
irb(main):007:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語")).encoding
=> #<Encoding:UTF-8>
irb(main):008:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語"))
=> "日本語"
irb(main):009:0>

Binary strings:

[aaron@TC bert (encoding)]$ irb -I lib -rbert
irb(main):001:0> BERT::Encoder.encode("日本語".b).bytesize
=> 15
irb(main):002:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語".b)).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):003:0> BERT::Encode.version = :v2
=> :v2
irb(main):004:0> BERT::Encoder.encode("日本語".b).bytesize
=> 15
irb(main):005:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語".b)).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):006:0>

US-ASCII:

[aaron@TC bert (encoding)]$ irb -I lib -rbert
irb(main):001:0> BERT::Encoder.encode("teelo".encode('US-ASCII')).bytesize
=> 11
irb(main):002:0> BERT::Decoder.decode(BERT::Encoder.encode("teelo".encode('US-ASCII'))).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):003:0> BERT::Encode.version = :v2
=> :v2
irb(main):004:0> BERT::Encoder.encode("teelo".encode('US-ASCII')).bytesize
=> 11
irb(main):005:0> BERT::Decoder.decode(BERT::Encoder.encode("teelo".encode('US-ASCII'))).encoding
=> #<Encoding:UTF-8>
irb(main):006:0>

And finally Shift_JIS:

[aaron@TC bert (encoding)]$ irb -I lib -rbert
irb(main):001:0> BERT::Encoder.encode("日本語".encode('Shift_JIS')).bytesize
=> 12
irb(main):002:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語".encode('Shift_JIS'))).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):003:0> BERT::Encode.version = :v2
=> :v2
irb(main):004:0> BERT::Encoder.encode("日本語".encode('Shift_JIS')).bytesize
=> 26
irb(main):005:0> BERT::Decoder.decode(BERT::Encoder.encode("日本語".encode('Shift_JIS'))).encoding
=> #<Encoding:Shift_JIS>
irb(main):006:0>

Note that US-ASCII gets lifted to UTF8, but I think that doesn't matter, and the Shift_JIS string is a bit longer when encoded because it embeds the encoding name.

@brianmario
Copy link

Nice work!

Note that US-ASCII gets lifted to UTF8, but I think that doesn't matter

👍 to that

@piki
Copy link

piki commented Apr 14, 2016

I like it. 👍

@haileys
Copy link

haileys commented Apr 15, 2016

Looks great ✨

@@ -14,9 +15,12 @@
#define ERL_BIN 109
#define ERL_SMALL_BIGNUM 110
#define ERL_LARGE_BIGNUM 111
#define ERL_ENC_STRING 112
#define ERL_UNICODE_STRING 113

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is the first time we're diverging from erlang's External Term Format. And as said format does have these values (for NEW_FUN_EXT and EXPORT_EXT resp.), I think a comment explaining that we knowingly diverge here for version 2 would be useful.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other (available) values we should use instead?

Adherence to erlang formats isn't important for GitHub's uses of BERT -- which all involve Ruby -- but I could imagine it being an issue if anyone else is still using BERT with erlang.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the online docs are up to date, numbers from 120 onwards are free, but they could probably add new types at any point.

We could reduce the divergence by using 107 (STRING_EXT) for utf8 strings whose length fits within 2 bytes and go with ENC_STRING if they're bigger, but we'd still re-use or hope we never find new types useful for ENC_STRING.

The types we're making unavailable aren't something that I think make sense to transfer between ruby and erlang, so I don't think anybody will miss them. But even so making an note of the rationale in the code would be useful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose these values because they are used for the offset in the callbacks struct. Would adding a comment and removing the ERL prefix be enough?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use in the offset makes it a bit trickier. How about we use ERLEXT_ and have a comment saying these are specific to our own v2? My concern here is coming back to the code and having to figure out how the C, ruby and spec match up so leaving a note would be polite :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I added a comment! :D

The two new types are extensions, so this commit adds a comment
documenting what these extensions are for (namely so that we can support
string encodings over the wire).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants