beamsearch.py script is broken #11

msintaha · 2022-10-06T04:37:31Z

We have somehow been able to train the model, but the inference step fails for the model. In beamsearch.py, we keep getting the same error when attempting to generate the hypothesis (both in cpu and gpu mode) by running src/tester/generator.py. Regardless of the device, we are always getting error here.

/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/functional.py:1960: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.
  warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
Traceback (most recent call last):
  File "src/tester/generator.orig.py", line 134, in <module>
    generate_gpt_conut(vocab_file, model_file, input_file, identifier_txt_file, identifier_token_file, output_file, beam_size)
  File "src/tester/generator.py", line 89, in generate_gpt_conut
    generator.generate(output_file)
  File "src/tester/generator.py", line 39, in generate
    hypothesis = self.beamsearch.generate_gpt_conut(sample)
  File "/scratch/st-amesbah-1/cure-debug/cure/src/tester/beamsearch.py", line 570, in generate_gpt_conut
    logits = self.model.decode(
  File "/scratch/st-amesbah-1/cure-debug/cure/src/tester/beamsearch.py", line 114, in decode
    logits = self.model.decoder(
  File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/scratch/st-amesbah-1/cure-debug/cure/src/tester/../models/gpt_conut.py", line 313, in forward
    embed = share_embed_model.transformer(
  File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/nnashid/.local/lib/python3.8/site-packages/transformers/modeling_openai.py", line 429, in forward
    inputs_embeds = self.tokens_embed(input_ids)
  File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

The text was updated successfully, but these errors were encountered:

nashid · 2022-10-09T23:17:48Z

@jiang719 we are stuck with this problem. We trained the GPT-CoNuT model and ran inference. However, during inference we are keep getting the above error. We would really appreciate your insight into this error.

jiang719 · 2022-10-10T02:11:16Z

This is likely due to a problem in the vocabulary part. Are you using the pre-trained GPT model I shared when you train your own GPT-CoNuT model?
It could also be the problem of the data format. Make sure you follow the three steps in CURE/data/data/prepare_testing_data.py to prepare the test data as the required format.

nashid · 2022-10-10T18:44:00Z

We have trained GPT-CoNuT with our dataset. My colleague @msintaha already looked into the steps for dataset creation to ensure we are following the same format. But we will cross-check again in our side.

@jiang719 thanks for your feedback, we really appreciate it.

jiang719 · 2022-10-12T18:08:35Z

The point of the first possible cause is that, when you train your own GPT-CoNuT model, did you only change the train_file and valid_file in src/trainer/gpt_conut_trainer.py and keep the vocab_file and gpt_file unchanged? If that's the case, the model should be fine and the problem is more likely to be in the test data.

You could share one test instance in the input_file in src/tester/generator.py, and its corresponding line in the identifier_txt_file and identifier_token_file so I can see if it looks correct.

msintaha · 2022-10-12T18:27:50Z

Yes, here you go

input_bpe.txt

app . use ( express . static ( path . join ( _ _ dirname + $STRING$ ) ) ) ; <CTX> var _ = require ( $STRING$ ) ; var express = require ( $STRING$ ) ; var app = express ( ) ; var http = require ( $STRING$ ) . Server ( app ) ; var path = require ( $STRING$ ) ; var io = require ( $STRING$ ) ( http ) ; const PORT = process . env . PORT || $NUMBER$ ; var users = [ ] ; app . use ( express . static ( path . join ( _ _ dirname + $STRING$ ) ) ) ; app . get ( $STRING$ , ( req , res ) = > { res . send ( JSON . stringify ( users ) ) ; } ) ; app . get ( $STRING$ , ( req , res ) = > { res . send ( JSON . stringify ( _ . find ( users , ( user ) = > user . id == == req . query . user CaMeL Id ) ) ) ; } ) ; app . get ( $STRING$ , ( req , res ) = > { console . log ( $STRING$ ) ; res . send CaMeL File ( path . join ( _ _ dirname + $STRING$ ) ) ; } ) ; io . on ( $STRING$ , ( socket ) = > { console . log ( ` a user connected : $ { socket . id } ` ) ; socket . on ( $STRING$ , ( player ) = > { users . push ( player ) ; socket . broadcast . emit ( $STRING$ , users ) ; } ) ; socket . on ( $STRING$ , ( payload ) = > { var user = _ . find ( users , ( user ) = > user . id == == payload . user CaMeL Id ) ; user . life = payload . life ; socket . broadcast . emit ( $STRING$ , users ) ; } ) ; } ) ; http . listen ( PORT , ( ) = > { console . log ( $STRING$ ) ; } ) ;@@

identifier.txt

send if throw 1 return post code http ++ express , Server exports var function router Router static ] msg err extends $NUMBER$ PORT from __dirname ) getElementById > find obj _ 0xffffffff catch io async class type content get JSON options continue document 0x7f push switch || env id use break ! res + body [ listen user connect ; result else PropTypes error for T const 0 typeof sendFile - app Route undefined key payload import on } React req ( connection < while console join e false broadcast i life value users emit length host 0x1f style name set state message do = $STRING$ action userId log await of data in url => socket <<unk>> true module path config node stringify process done new axios query { . player require === :

identifier.tokens

send <SEP> if <SEP> throw <SEP> 1 <SEP> return <SEP> post <SEP> code <SEP> http <SEP> ++ <SEP> express <SEP> , <SEP> Server <SEP> exports <SEP> var <SEP> function <SEP> router <SEP> Router <SEP> static <SEP> ] <SEP> msg <SEP> err <SEP> extends <SEP> $NUMBER$ <SEP> PORT <SEP> from <SEP> _ _ dirname <SEP> ) <SEP> get CaMeL Element CaMeL By CaMeL Id <SEP> > <SEP> find <SEP> obj <SEP> _ <SEP> 0 xffffffff <SEP> catch <SEP> io <SEP> async <SEP> class <SEP> type <SEP> content <SEP> get <SEP> JSON <SEP> options <SEP> continue <SEP> document <SEP> 0 x $NUMBER$ f <SEP> push <SEP> switch <SEP> || <SEP> env <SEP> id <SEP> use <SEP> break <SEP> ! <SEP> res <SEP> + <SEP> body <SEP> [ <SEP> listen <SEP> user <SEP> connect <SEP> ; <SEP> result <SEP> else <SEP> Prop CaMeL Types <SEP> error <SEP> for <SEP> T <SEP> const <SEP> 0 <SEP> typeof <SEP> send CaMeL File <SEP> - <SEP> app <SEP> Route <SEP> undefined <SEP> key <SEP> payload <SEP> import <SEP> on <SEP> } <SEP> React <SEP> req <SEP> ( <SEP> connection <SEP> < <SEP> while <SEP> console <SEP> join <SEP> e <SEP> false <SEP> broadcast <SEP> i <SEP> life <SEP> value <SEP> users <SEP> emit <SEP> length <SEP> host <SEP> 0 x 1 f <SEP> style <SEP> name <SEP> set <SEP> state <SEP> message <SEP> do <SEP> = <SEP> $STRING$ <SEP> action <SEP> user CaMeL Id <SEP> log <SEP> await <SEP> of <SEP> data <SEP> in <SEP> url <SEP> = > <SEP> socket <SEP> <<unk>> <SEP> true <SEP> module <SEP> path <SEP> config <SEP> node <SEP> stringify <SEP> process <SEP> done <SEP> new <SEP> axios <SEP> query <SEP> { <SEP> . <SEP> player <SEP> require <SEP> == == = <SEP> :

jiang719 · 2022-10-12T18:44:12Z

@msintaha Looks like you only run the prepare_cure_input function.

there are two remaining steps:

run subword-nmt to tokenize these lines into subwords.
run clean_testing_bpe to finalize the input files.

Please check the readme file under CURE/data/data, the Prepare Test Input section shows the steps. if possible, I recommend you integrate these three steps into your own script.

msintaha · 2022-10-12T18:45:11Z

I have actually run those as well, using the subword.txt generated. It was mentioned in the prepare_cure_input script at the end

msintaha · 2022-10-12T18:46:35Z

First i generated the vocab using
subword-nmt learn-joint-bpe-and-vocab --input training_tokenize.txt -s 50000 -o subword.txt --write-vocabulary vocabulary.txt

Then i ran:

subword-nmt apply-bpe -c subword.txt < training_tokenize.txt > training_bpe.txt
subword-nmt apply-bpe -c subword.txt < input.txt > input_bpe.txt
subword-nmt apply-bpe -c subword.txt < validation_tokenize.txt > validation_bpe.txt
subword-nmt apply-bpe -c subword.txt < identifier.tokens > identifier_bpe.tokens

jiang719 · 2022-10-12T18:53:05Z

First i generated the vocab using subword-nmt learn-joint-bpe-and-vocab --input training_tokenize.txt -s 50000 -o subword.txt --write-vocabulary vocabulary.txt

Then i ran:
subword-nmt apply-bpe -c subword.txt < training_tokenize.txt > training_bpe.txt
subword-nmt apply-bpe -c subword.txt < input.txt > input_bpe.txt
subword-nmt apply-bpe -c subword.txt < validation_tokenize.txt > validation_bpe.txt
subword-nmt apply-bpe -c subword.txt < identifier.tokens > identifier_bpe.tokens

Then you should have a file called ifentifier_bpe.tokens, which should not contain <SEP>, as the input to the generator.py.

But now I assume the problem is the vocabulary, since you train your own subword-nmt, so the vocabulary file also changes. How many unique lines do you have in your own vocabulary.txt?

If you change the vocabulary file, you will need to re-train the GPT first (re-train a new Huggingface GPT model), since the one I shared can only recognize the 50057 vocabulary in data/vocabulary/vocabulary.txt. If your new vocabulary file contains more vocabulary, the index out of range error will be caused.

msintaha · 2022-10-12T18:55:19Z

We have 46,247 lines in the vocabulary.txt. And yes, the generated identifier_bpe.tokens file does not contain <SEP>

jiang719 · 2022-10-12T19:12:46Z

That looks reasonable. Could you enclose the call of generate_gpt_conut with try-catch and see if it crashes for every input or just some?

Another possibility I can imagine is the input exceeds the maximum length (1024 tokens) set for the GPT model. But this will only cause those long inputs to crash.

msintaha · 2022-10-12T19:16:27Z

The maximum input length is within 1022 tokens. We have enclosed it in try_catch block, and it crashes on all the inputs

ozzydong · 2022-10-21T02:49:39Z

Hi there @lin-tan , I just cloned your code and try to run it use your module which you have been trained.But it always has a error about follow that .
D:\Python\python.exe E:/cure/CURE/src/tester/generator.py
50061
Traceback (most recent call last):
File "E:/cure/CURE/src/tester/generator.py", line 135, in
generate_gpt_conut(vocab_file, model_file, input_file, identifier_txt_file, identifier_token_file, output_file, beam_size)
File "E:/cure/CURE/src/tester/generator.py", line 63, in generate_gpt_conut
model_file, map_location='cpu'
File "D:\Python\lib\site-packages\torch\serialization.py", line 713, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "D:\Python\lib\site-packages\torch\serialization.py", line 930, in _legacy_load
result = unpickler.load()
File "D:\Python\lib\site-packages\torch\serialization.py", line 746, in find_class
return super().find_class(mod_name, name)
ModuleNotFoundError: No module named 'transformers.configuration_openai'
I would really appreciate your insight into this error. :)

studypython33 · 2023-03-19T09:41:28Z

@ozzydong I also met the same problem. Have you solved it？thank you

BaiGeiQiShi · 2023-07-14T02:30:01Z

@studypython33 I also met the same problem, too. Have you solved it？ Thanks in advance.

BaiGeiQiShi · 2023-07-14T02:30:08Z

@studypython33 I also met the same problem, too. Have you solved it？ Thanks in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

beamsearch.py script is broken #11

beamsearch.py script is broken #11

msintaha commented Oct 6, 2022 •

edited

Loading

nashid commented Oct 9, 2022 •

edited

Loading

jiang719 commented Oct 10, 2022

nashid commented Oct 10, 2022 •

edited

Loading

jiang719 commented Oct 12, 2022

msintaha commented Oct 12, 2022

jiang719 commented Oct 12, 2022

msintaha commented Oct 12, 2022

msintaha commented Oct 12, 2022

jiang719 commented Oct 12, 2022

msintaha commented Oct 12, 2022 •

edited

Loading

jiang719 commented Oct 12, 2022

msintaha commented Oct 12, 2022

ozzydong commented Oct 21, 2022

studypython33 commented Mar 19, 2023

BaiGeiQiShi commented Jul 14, 2023

BaiGeiQiShi commented Jul 14, 2023

beamsearch.py script is broken #11

beamsearch.py script is broken #11

Comments

msintaha commented Oct 6, 2022 • edited Loading

nashid commented Oct 9, 2022 • edited Loading

jiang719 commented Oct 10, 2022

nashid commented Oct 10, 2022 • edited Loading

jiang719 commented Oct 12, 2022

msintaha commented Oct 12, 2022

jiang719 commented Oct 12, 2022

msintaha commented Oct 12, 2022

msintaha commented Oct 12, 2022

jiang719 commented Oct 12, 2022

msintaha commented Oct 12, 2022 • edited Loading

jiang719 commented Oct 12, 2022

msintaha commented Oct 12, 2022

ozzydong commented Oct 21, 2022

studypython33 commented Mar 19, 2023

BaiGeiQiShi commented Jul 14, 2023

BaiGeiQiShi commented Jul 14, 2023

msintaha commented Oct 6, 2022 •

edited

Loading

nashid commented Oct 9, 2022 •

edited

Loading

nashid commented Oct 10, 2022 •

edited

Loading

msintaha commented Oct 12, 2022 •

edited

Loading