-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the BERT protocol encoding aware #24
base: master
Are you sure you want to change the base?
Conversation
this way we can configure the callbacks to something else at runtime.
This adds an encoding field after the string so that you can apply an encoding to the string sent across the wire.
This commit makes the encoder default to version 1 of the BERT encoding scheme, but allows you to turn on version 2 via a feature flag.
This is a good direction. Any measurements for how much of a performance hit we take with any sort of representative workload? This change essentially doubles the network, RAM, and CPU cost of short strings. How much difference does it make in a real RPC scenario? If performance is at all an issue, we could indicate common encodings (I expect to see a lot of |
In mochilo we used That way we could use a single byte to represent the encoding and map it to/from the encoding name on the server and client at [de]serialization time? That is, if performance is a big issue here... |
It looks like finding an encoding by name is just a hash lookup (given the encoding has been loaded once already), so I don't think we need to worry about the speed of finding the encoding object. I agree that representing the encoding as an integer would reduce the payload size, but I don't know if it matters. Using MIME would require even more network chatter than this. If we could guarantee that all strings in one payload were of the same type, then we could really reduce data. But since strings in Ruby are allowed to have different encodings, I don't know if that's something we can / want to enforce. |
It's the number and collective size of allocations that concerns me. Grabbing a list of all the refs in a repository, or all the refs on a particular branch, could be hundreds of thousands of small strings. This change just doubles the serialization, deserialization, and allocation costs. An int would definitely be cheaper. |
@piki do you know if payloads usually have homogenous encodings? If stuff is usually UTF-8, maybe we could only add extra encoding information if it's not UTF-8. If strings are usually UTF-8, then we'd only incur overhead for "special" strings. |
I would expect 90% of the cases to be utf8 strings, since that's the encoding we use (assume) when we extract Git data via rugged, or to be raw binary like from spawning a child process. Having a way to specify both of these via a single int should help a lot with the smaller strings (and would still allow us to specify a different encoding if it's neither of these). |
This commit adds two new types, one for unicode strings and one for other encoded strings. Unocide strings have no extra wire protocol overhead, where "other" strings send the encoding name along with the string.
Ok, I added two new types to the protocol. One type is for UTF-8 strings, and one type is for "other" encoded strings. Strings tagged as ASCII-8BIT will use the BIN type, so there's no change in protocol overhead for them. The only strings that have additional overhead now are strings that are tagged with some encoding other than UTF8, US-ASCII, and ASCII-8BIT. Here's a demo of UTF8 strings:
Binary strings:
US-ASCII:
And finally Shift_JIS:
Note that US-ASCII gets lifted to UTF8, but I think that doesn't matter, and the Shift_JIS string is a bit longer when encoded because it embeds the encoding name. |
Nice work!
👍 to that |
I like it. 👍 |
Looks great ✨ |
@@ -14,9 +15,12 @@ | |||
#define ERL_BIN 109 | |||
#define ERL_SMALL_BIGNUM 110 | |||
#define ERL_LARGE_BIGNUM 111 | |||
#define ERL_ENC_STRING 112 | |||
#define ERL_UNICODE_STRING 113 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is the first time we're diverging from erlang's External Term Format. And as said format does have these values (for NEW_FUN_EXT
and EXPORT_EXT
resp.), I think a comment explaining that we knowingly diverge here for version 2 would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there other (available) values we should use instead?
Adherence to erlang formats isn't important for GitHub's uses of BERT -- which all involve Ruby -- but I could imagine it being an issue if anyone else is still using BERT with erlang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the online docs are up to date, numbers from 120 onwards are free, but they could probably add new types at any point.
We could reduce the divergence by using 107 (STRING_EXT
) for utf8 strings whose length fits within 2 bytes and go with ENC_STRING
if they're bigger, but we'd still re-use or hope we never find new types useful for ENC_STRING
.
The types we're making unavailable aren't something that I think make sense to transfer between ruby and erlang, so I don't think anybody will miss them. But even so making an note of the rationale in the code would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose these values because they are used for the offset in the callbacks struct. Would adding a comment and removing the ERL prefix be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use in the offset makes it a bit trickier. How about we use ERLEXT_
and have a comment saying these are specific to our own v2? My concern here is coming back to the code and having to figure out how the C, ruby and spec match up so leaving a note would be polite :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I added a comment! :D
The two new types are extensions, so this commit adds a comment documenting what these extensions are for (namely so that we can support string encodings over the wire).
This PR makes the BERT protocol aware of Ruby encodings. I have updated the Encoder and the Decoder as follows:
Decoder
The decoder now looks for a magic header and changes parsing styles based on the header it finds. If the header is
MAGIC
, then it knows that the encoder didn't add any encoding information and uses the old decoder. If the header isVERSION2
, then it knows that the encoder added string encoding information and it adds the encoding information to the string.Encoder
The encoder is split in to version 1 and version 2 encoding. It defaults to version 1, and version 2 is enabled with a feature flag. The reason I did this is so that we can deploy the updated bert gem to both the client and the server, then flip the feature flag on for the client, then flip it on for the server. That way we can upgrade without downtime.
Here is an IRB session to demonstrate the encoding support:
/cc @brianmario @tmm1 @charliesome @piki