-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
This does not match behavior of Huggingface's Python version #18
Comments
Also, i believe you misunderstanding the token_type_ids from here https://huggingface.co/transformers/v3.2.0/glossary.html#token-type-ids Some models, like XLNetModel use an additional token represented by a 2." So normally for BERT it's 0 or 1. You are just returning 0,1,2,3... index in the array of strings submitted. Which is not what it represents. If you see my example in Python you will see the output being all 0 for token_type_ids. |
Easy to fix #1
|
So in my (limited) experience - the C# version Encode method accepts 1 sentence at a time, I couldn't get it to work with multiple sentences at a time. Ex: I found this was a helpful resource as well: https://onnxruntime.ai/docs/tutorials/csharp/bert-nlp-csharp-console-app.html Although I agree with your sentiment - I feel like this one piece is holding me back from really diving into using the models in C# |
Normally you do not even need "token_type_ids" in C#. They are only needed for training and it is easier done in Python since you will have hard time training in C#. The inference is much easier since you do not need various loss functions or optimizers and you want it to be fast, so you are probably better of running it with C# than Python.
|
Not sure if you've experienced this as well, but the Tokenize method also hangs when presented with irregular text. For example, this input will cause the Tokenize method to hang and ultimately produce an OOM exception:
Obviously an extreme example but the same occurs for English text with accent marks in the text (think someone's name), etc.. Currently trying to find a way to address this. |
Made some improvements to the TokenizeSubwords method in the TokenizerBase to improve general resiliency: `private IEnumerable<(string Token, int VocabularyIndex)> TokenizeSubwords(string word)
Will try to get a MR opened up for this. Are you looking into fixing this issue as well @gevorgter ?
|
What's the status on these? Is there a PR or a fork with the changes? I'm working with the NuGet package. After a fair bit of wrapper work, I got it generating 384-dim vectors from the ONNX export of the all-MiniLM-L6-v2 model. However, the vectors do not match the ones in Python. Wondering if it's something in the tokenizer, or ONNX, or my code. |
From @gevorgter's ticket NMZivkovic#18.
Inspired by this library (thanks for the great work @NMZivkovic!), I built FastBertTokenizer (nuget). I invested in ensuring that the encoding results of FastBertTokenizer match those of Hugging Face's Python and Rust tokenizers in all practical cases. Unit tests ensure this against a 15k document wikipedia corpus which includes right to left languages, exotic scripts and more. Aside from edge cases around characters that can not be encoded using a given dictionary (e.g. when encountering multiple assamese characters in a row, my implementation sometimes emits an [UNK] token more than Hugging Face would; or: Hugging Face sometimes emits [UNK] for words in greek letters despite it is possible to encode them using the given vocabulary, which my implementation does), I'm able to achieve equal encodings. If you give it a try and stumble up on an issue, please let me know. Note that while my implementation does support batched encoding, it does not return a |
Nice. I had used a HuggingFace micro-service instead, but your package should help the C# ML ecosystem. |
I would expect tokenizer's behavior to match Python version otherwise it will be hard to convert samples from Python to .NET
Example. Python code:
from transformers import BertTokenizer
MODEL_NAME = "distilbert-base-uncased"
sentence1 = "George Is the best person ever";
sentence2 = "George Is the best person ever";
sentence3 = "George Is the best person ever";
sentences = [sentence1, sentence2, sentence3]
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
train_encodings = tokenizer(sentences, truncation=True, padding=True, max_length=512)
print(train_encodings)
Output:
{'input_ids': [[101, 2577, 2003, 1996, 2190, 2711, 2412, 102], [101, 2577, 2003, 1996, 2190, 2711, 2412, 102], [101, 2577, 2003, 1996, 2190, 2711, 2412, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}
C# code:
var tokenizer = new BertUncasedBaseTokenizer();
var sentence1 = "George Is the best persone ever";
var sentence2 = "George Is the best persone ever";
var sentence3 = "George Is the best persone ever";
var sentenses = new string[] { sentence1, sentence2, sentence3 };
var encoded = tokenizer.Encode(30, sentenses);
typeof(encoded) = System.Collections.Generic.List<(long InputIds, long TokenTypeIds, long AttentionMask)>(Count = 30)
I would have expected List<List<(long InputIds, long TokenTypeIds, long AttentionMask)>>
The text was updated successfully, but these errors were encountered: