-
Notifications
You must be signed in to change notification settings - Fork 24
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #14 from NMZivkovic/feature/custom-vocab
Classes for custom vocabulary
- Loading branch information
Showing
5 changed files
with
30 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -67,7 +67,7 @@ While working with BERT Models from Huggingface in combination with ML.NET, I st | |
I documented them in [here](https://rubikscode.net/2021/10/25/using-huggingface-transformers-with-ml-net/).</br> | ||
However, the biggest challenge by far was that I needed to implement my own tokenizer and pair them with the correct vocabulary. | ||
So, I decided to extend it and publish my implementation as a NuGet package and an open-source project. | ||
More info about this project can be found in this [blog post](https://rubikscode.net/2021/11/01/bert-tokenizers-for-ml-net/) | ||
More info about this project can be found in this [blog post](https://rubikscode.net/2021/11/01/bert-tokenizers-for-ml-net/). </br> | ||
|
||
This repository contains tokenizers for following models:<br /> | ||
· BERT Base<br /> | ||
|
@@ -77,6 +77,8 @@ This repository contains tokenizers for following models:<br /> | |
· BERT Base Uncased<br /> | ||
· BERT Large Uncased<br /> | ||
|
||
There are also clases using which you can upload your own vocabulary. | ||
|
||
<p align="right">(<a href="#top">back to top</a>)</p> | ||
|
||
### Built With | ||
|
@@ -194,6 +196,7 @@ [email protected]</br> | |
## Acknowledgments | ||
|
||
* Gianluca Bertani - Performance Improvements | ||
* [Paul Calot](https://github.com/PaulCalot) - First Token bugfix | ||
|
||
<p align="right">(<a href="#top">back to top</a>)</p> | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
using BERTTokenizers.Base; | ||
|
||
namespace BERTTokenizers | ||
{ | ||
public class BertUnasedCustomVocabulary : CasedTokenizer | ||
{ | ||
public BertUnasedCustomVocabulary(string vocabularyFilePath) : base(vocabularyFilePath) { } | ||
|
||
} | ||
|
||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
using BERTTokenizers.Base; | ||
|
||
namespace BERTTokenizers | ||
{ | ||
public class BertCasedCustomVocabulary : CasedTokenizer | ||
{ | ||
public BertCasedCustomVocabulary(string vocabularyFilePath) : base(vocabularyFilePath) { } | ||
|
||
} | ||
} |