Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

beamsearch.py script is broken #11

Open
msintaha opened this issue Oct 6, 2022 · 16 comments
Open

beamsearch.py script is broken #11

msintaha opened this issue Oct 6, 2022 · 16 comments

Comments

@msintaha
Copy link

msintaha commented Oct 6, 2022

Hi @jiang719 @lin-tan

We have somehow been able to train the model, but the inference step fails for the model. In beamsearch.py, we keep getting the same error when attempting to generate the hypothesis (both in cpu and gpu mode) by running src/tester/generator.py. Regardless of the device, we are always getting error here.

/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/functional.py:1960: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.
  warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
Traceback (most recent call last):
  File "src/tester/generator.orig.py", line 134, in <module>
    generate_gpt_conut(vocab_file, model_file, input_file, identifier_txt_file, identifier_token_file, output_file, beam_size)
  File "src/tester/generator.py", line 89, in generate_gpt_conut
    generator.generate(output_file)
  File "src/tester/generator.py", line 39, in generate
    hypothesis = self.beamsearch.generate_gpt_conut(sample)
  File "/scratch/st-amesbah-1/cure-debug/cure/src/tester/beamsearch.py", line 570, in generate_gpt_conut
    logits = self.model.decode(
  File "/scratch/st-amesbah-1/cure-debug/cure/src/tester/beamsearch.py", line 114, in decode
    logits = self.model.decoder(
  File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/scratch/st-amesbah-1/cure-debug/cure/src/tester/../models/gpt_conut.py", line 313, in forward
    embed = share_embed_model.transformer(
  File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/nnashid/.local/lib/python3.8/site-packages/transformers/modeling_openai.py", line 429, in forward
    inputs_embeds = self.tokens_embed(input_ids)
  File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
@nashid
Copy link

nashid commented Oct 9, 2022

@jiang719 we are stuck with this problem. We trained the GPT-CoNuT model and ran inference. However, during inference we are keep getting the above error. We would really appreciate your insight into this error.

@jiang719
Copy link
Collaborator

  1. This is likely due to a problem in the vocabulary part. Are you using the pre-trained GPT model I shared when you train your own GPT-CoNuT model?

  2. It could also be the problem of the data format. Make sure you follow the three steps in CURE/data/data/prepare_testing_data.py to prepare the test data as the required format.

@nashid
Copy link

nashid commented Oct 10, 2022

We have trained GPT-CoNuT with our dataset. My colleague @msintaha already looked into the steps for dataset creation to ensure we are following the same format. But we will cross-check again in our side.

@jiang719 thanks for your feedback, we really appreciate it.

@jiang719
Copy link
Collaborator

The point of the first possible cause is that, when you train your own GPT-CoNuT model, did you only change the train_file and valid_file in src/trainer/gpt_conut_trainer.py and keep the vocab_file and gpt_file unchanged? If that's the case, the model should be fine and the problem is more likely to be in the test data.

You could share one test instance in the input_file in src/tester/generator.py, and its corresponding line in the identifier_txt_file and identifier_token_file so I can see if it looks correct.

@msintaha
Copy link
Author

Yes, here you go

input_bpe.txt

app . use ( express . static ( path . join ( _ _ dirname + $STRING$ ) ) ) ; <CTX> var _ = require ( $STRING$ ) ; var express = require ( $STRING$ ) ; var app = express ( ) ; var http = require ( $STRING$ ) . Server ( app ) ; var path = require ( $STRING$ ) ; var io = require ( $STRING$ ) ( http ) ; const PORT = process . env . PORT || $NUMBER$ ; var users = [ ] ; app . use ( express . static ( path . join ( _ _ dirname + $STRING$ ) ) ) ; app . get ( $STRING$ , ( req , res ) = > { res . send ( JSON . stringify ( users ) ) ; } ) ; app . get ( $STRING$ , ( req , res ) = > { res . send ( JSON . stringify ( _ . find ( users , ( user ) = > user . id == == req . query . user CaMeL Id ) ) ) ; } ) ; app . get ( $STRING$ , ( req , res ) = > { console . log ( $STRING$ ) ; res . send CaMeL File ( path . join ( _ _ dirname + $STRING$ ) ) ; } ) ; io . on ( $STRING$ , ( socket ) = > { console . log ( ` a user connected : $ { socket . id } ` ) ; socket . on ( $STRING$ , ( player ) = > { users . push ( player ) ; socket . broadcast . emit ( $STRING$ , users ) ; } ) ; socket . on ( $STRING$ , ( payload ) = > { var user = _ . find ( users , ( user ) = > user . id == == payload . user CaMeL Id ) ; user . life = payload . life ; socket . broadcast . emit ( $STRING$ , users ) ; } ) ; } ) ; http . listen ( PORT , ( ) = > { console . log ( $STRING$ ) ; } ) ;@@ 	

identifier.txt

send if throw 1 return post code http ++ express , Server exports var function router Router static ] msg err extends $NUMBER$ PORT from __dirname ) getElementById > find obj _ 0xffffffff catch io async class type content get JSON options continue document 0x7f push switch || env id use break ! res + body [ listen user connect ; result else PropTypes error for T const 0 typeof sendFile - app Route undefined key payload import on } React req ( connection < while console join e false broadcast i life value users emit length host 0x1f style name set state message do = $STRING$ action userId log await of data in url => socket <<unk>> true module path config node stringify process done new axios query { . player require === :

identifier.tokens

send <SEP> if <SEP> throw <SEP> 1 <SEP> return <SEP> post <SEP> code <SEP> http <SEP> ++ <SEP> express <SEP> , <SEP> Server <SEP> exports <SEP> var <SEP> function <SEP> router <SEP> Router <SEP> static <SEP> ] <SEP> msg <SEP> err <SEP> extends <SEP> $NUMBER$ <SEP> PORT <SEP> from <SEP> _ _ dirname <SEP> ) <SEP> get CaMeL Element CaMeL By CaMeL Id <SEP> > <SEP> find <SEP> obj <SEP> _ <SEP> 0 xffffffff <SEP> catch <SEP> io <SEP> async <SEP> class <SEP> type <SEP> content <SEP> get <SEP> JSON <SEP> options <SEP> continue <SEP> document <SEP> 0 x $NUMBER$ f <SEP> push <SEP> switch <SEP> || <SEP> env <SEP> id <SEP> use <SEP> break <SEP> ! <SEP> res <SEP> + <SEP> body <SEP> [ <SEP> listen <SEP> user <SEP> connect <SEP> ; <SEP> result <SEP> else <SEP> Prop CaMeL Types <SEP> error <SEP> for <SEP> T <SEP> const <SEP> 0 <SEP> typeof <SEP> send CaMeL File <SEP> - <SEP> app <SEP> Route <SEP> undefined <SEP> key <SEP> payload <SEP> import <SEP> on <SEP> } <SEP> React <SEP> req <SEP> ( <SEP> connection <SEP> < <SEP> while <SEP> console <SEP> join <SEP> e <SEP> false <SEP> broadcast <SEP> i <SEP> life <SEP> value <SEP> users <SEP> emit <SEP> length <SEP> host <SEP> 0 x 1 f <SEP> style <SEP> name <SEP> set <SEP> state <SEP> message <SEP> do <SEP> = <SEP> $STRING$ <SEP> action <SEP> user CaMeL Id <SEP> log <SEP> await <SEP> of <SEP> data <SEP> in <SEP> url <SEP> = > <SEP> socket <SEP> <<unk>> <SEP> true <SEP> module <SEP> path <SEP> config <SEP> node <SEP> stringify <SEP> process <SEP> done <SEP> new <SEP> axios <SEP> query <SEP> { <SEP> . <SEP> player <SEP> require <SEP> == == = <SEP> :

@jiang719
Copy link
Collaborator

@msintaha Looks like you only run the prepare_cure_input function.

there are two remaining steps:

  1. run subword-nmt to tokenize these lines into subwords.
  2. run clean_testing_bpe to finalize the input files.

Please check the readme file under CURE/data/data, the Prepare Test Input section shows the steps. if possible, I recommend you integrate these three steps into your own script.

@msintaha
Copy link
Author

I have actually run those as well, using the subword.txt generated. It was mentioned in the prepare_cure_input script at the end

@msintaha
Copy link
Author

First i generated the vocab using
subword-nmt learn-joint-bpe-and-vocab --input training_tokenize.txt -s 50000 -o subword.txt --write-vocabulary vocabulary.txt

Then i ran:

subword-nmt apply-bpe -c subword.txt < training_tokenize.txt > training_bpe.txt
subword-nmt apply-bpe -c subword.txt < input.txt > input_bpe.txt
subword-nmt apply-bpe -c subword.txt < validation_tokenize.txt > validation_bpe.txt
subword-nmt apply-bpe -c subword.txt < identifier.tokens > identifier_bpe.tokens

@jiang719
Copy link
Collaborator

First i generated the vocab using subword-nmt learn-joint-bpe-and-vocab --input training_tokenize.txt -s 50000 -o subword.txt --write-vocabulary vocabulary.txt

Then i ran:

subword-nmt apply-bpe -c subword.txt < training_tokenize.txt > training_bpe.txt
subword-nmt apply-bpe -c subword.txt < input.txt > input_bpe.txt
subword-nmt apply-bpe -c subword.txt < validation_tokenize.txt > validation_bpe.txt
subword-nmt apply-bpe -c subword.txt < identifier.tokens > identifier_bpe.tokens

Then you should have a file called ifentifier_bpe.tokens, which should not contain <SEP>, as the input to the generator.py.

But now I assume the problem is the vocabulary, since you train your own subword-nmt, so the vocabulary file also changes. How many unique lines do you have in your own vocabulary.txt?

If you change the vocabulary file, you will need to re-train the GPT first (re-train a new Huggingface GPT model), since the one I shared can only recognize the 50057 vocabulary in data/vocabulary/vocabulary.txt. If your new vocabulary file contains more vocabulary, the index out of range error will be caused.

@msintaha
Copy link
Author

msintaha commented Oct 12, 2022

We have 46,247 lines in the vocabulary.txt. And yes, the generated identifier_bpe.tokens file does not contain <SEP>

@jiang719
Copy link
Collaborator

That looks reasonable. Could you enclose the call of generate_gpt_conut with try-catch and see if it crashes for every input or just some?

Another possibility I can imagine is the input exceeds the maximum length (1024 tokens) set for the GPT model. But this will only cause those long inputs to crash.

@msintaha
Copy link
Author

The maximum input length is within 1022 tokens. We have enclosed it in try_catch block, and it crashes on all the inputs

@ozzydong
Copy link

Hi there @lin-tan , I just cloned your code and try to run it use your module which you have been trained.But it always has a error about follow that .
D:\Python\python.exe E:/cure/CURE/src/tester/generator.py
50061
Traceback (most recent call last):
File "E:/cure/CURE/src/tester/generator.py", line 135, in
generate_gpt_conut(vocab_file, model_file, input_file, identifier_txt_file, identifier_token_file, output_file, beam_size)
File "E:/cure/CURE/src/tester/generator.py", line 63, in generate_gpt_conut
model_file, map_location='cpu'
File "D:\Python\lib\site-packages\torch\serialization.py", line 713, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "D:\Python\lib\site-packages\torch\serialization.py", line 930, in _legacy_load
result = unpickler.load()
File "D:\Python\lib\site-packages\torch\serialization.py", line 746, in find_class
return super().find_class(mod_name, name)
ModuleNotFoundError: No module named 'transformers.configuration_openai'
I would really appreciate your insight into this error. :)

@studypython33
Copy link

@ozzydong I also met the same problem. Have you solved it?thank you

@BaiGeiQiShi
Copy link

@studypython33 I also met the same problem, too. Have you solved it? Thanks in advance.

1 similar comment
@BaiGeiQiShi
Copy link

@studypython33 I also met the same problem, too. Have you solved it? Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants