Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hugging Face Language Models Implementation #2350

Closed
wants to merge 9 commits into from

Conversation

f4str
Copy link
Collaborator

@f4str f4str commented Dec 14, 2023

Description

Implementation of language models under a new art.estimators.language_modeling submodule. Currently only Hugging Face language models using a PyTorch back-end have been implemented. This is implemented as the HuggingFaceLanguageModel which is a generic estimator that is able to run basic functionality on any Hugging Face model.

This new language model estimator takes in a Hugging Face model and tokenizer and acts as a basic ART wrapper for now until attacks and defenses for language models are implemented. Currently the estimator only supports the following tasks:

  • Tokenization to be fed into the model
  • Encoding strings to tokens
  • Decoding tokens to strings
  • Running inference on the model using a string input (with auto tokenization)
  • Running text generation on the model using a string input (with auto tokenization)

Inference on the estimator will simply return the output dictionary from running inference on the HuggingFace model. The estimator currently does not support training or loss gradients as these are more complex features that will be added later. Once this PR is merged in, additional issues will be created to implement training and loss gradients which will be done as separate PRs.

A demo notebook in notebooks/hugging_face_language_model.ipynb was created to illustrate the usage.

Fixes #2336

Type of change

Please check all relevant options.

  • Improvement (non-breaking)
  • Bug fix (non-breaking)
  • New feature (non-breaking)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Testing

Please describe the tests that you ran to verify your changes. Consider listing any relevant details of your test configuration.

  • Unit tests for the HuggingFaceLanguageModel estimator.

Test Configuration:

  • OS
  • Python version
  • ART version or commit number
  • TensorFlow / Keras / PyTorch / MXNet version

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • My changes have been tested using both CPU and GPU devices

@f4str f4str changed the base branch from main to dev_1.17.0 December 14, 2023 02:41
@codecov-commenter
Copy link

codecov-commenter commented Dec 14, 2023

Codecov Report

Attention: Patch coverage is 34.25414% with 119 lines in your changes missing coverage. Please review.

Project coverage is 77.69%. Comparing base (403623c) to head (38f4429).

Files with missing lines Patch % Lines
art/estimators/language_modeling/hugging_face.py 23.22% 119 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff               @@
##           dev_1.18.0    #2350      +/-   ##
==============================================
+ Coverage       73.14%   77.69%   +4.55%     
==============================================
  Files             327      330       +3     
  Lines           30205    30386     +181     
  Branches         5589     5634      +45     
==============================================
+ Hits            22094    23609    +1515     
+ Misses           6807     5429    -1378     
- Partials         1304     1348      +44     
Files with missing lines Coverage Δ
art/estimators/__init__.py 100.00% <100.00%> (ø)
art/estimators/language_modeling/__init__.py 100.00% <100.00%> (ø)
art/estimators/language_modeling/language_model.py 100.00% <100.00%> (ø)
art/estimators/language_modeling/hugging_face.py 23.22% <23.22%> (ø)

... and 94 files with indirect coverage changes

@f4str f4str marked this pull request as ready for review December 14, 2023 04:59
@beat-buesser beat-buesser self-requested a review December 14, 2023 12:24
@beat-buesser beat-buesser self-assigned this Dec 14, 2023
@f4str f4str force-pushed the hf-language-models branch from 8d7b586 to 38f4429 Compare January 18, 2024 23:53
@f4str f4str changed the base branch from dev_1.17.0 to dev_1.18.0 January 18, 2024 23:53
@jetlime
Copy link

jetlime commented Jul 8, 2024

Hey @f4str , what's the status on this PR? Shall it be picked up?

@beat-buesser beat-buesser deleted the branch Trusted-AI:dev_1.18.0 October 1, 2024 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement HuggingFace Language Modeling Estimators
4 participants