Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pickle empty file error #197

Open
oliver-lloyd opened this issue Apr 21, 2021 · 4 comments
Open

Pickle empty file error #197

oliver-lloyd opened this issue Apr 21, 2021 · 4 comments

Comments

@oliver-lloyd
Copy link

oliver-lloyd commented Apr 21, 2021

I have received this same error message from several Ax search jobs spanning multiple models and datasets:

concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/fu19841/miniconda3/envs/libkge/lib/python3.7/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/fu19841/LPComparison/scripts/kge/kge/job/search.py", line 232, in _run_train_job
    raise e
  File "/home/fu19841/LPComparison/scripts/kge/kge/job/search.py", line 131, in _run_train_job
    checkpoint_file, train_job_config.get("job.device")
  File "/home/fu19841/LPComparison/scripts/kge/kge/util/io.py", line 41, in load_checkpoint
    checkpoint = torch.load(checkpoint_file, map_location="cpu")
  File "/home/fu19841/miniconda3/envs/libkge/lib/python3.7/site-packages/torch/serialization.py", line 595, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/fu19841/miniconda3/envs/libkge/lib/python3.7/site-packages/torch/serialization.py", line 764, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/fu19841/miniconda3/envs/libkge/bin/kge", line 33, in <module>
    sys.exit(load_entry_point('libkge', 'console_scripts', 'kge')())
  File "/home/fu19841/LPComparison/scripts/kge/kge/cli.py", line 285, in main
    job.run()
  File "/home/fu19841/LPComparison/scripts/kge/kge/job/job.py", line 159, in run
    result = self._run()
  File "/home/fu19841/LPComparison/scripts/kge/kge/job/search_auto.py", line 162, in _run
    (self, trial_no, config, self.num_trials, list(parameters.keys())),
  File "/home/fu19841/LPComparison/scripts/kge/kge/job/search.py", line 75, in submit_task
    self.wait_task()
  File "/home/fu19841/LPComparison/scripts/kge/kge/job/search.py", line 97, in wait_task
    self.ready_task_results.append(task.result())
  File "/home/fu19841/miniconda3/envs/libkge/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/home/fu19841/miniconda3/envs/libkge/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
EOFError: Ran out of input

The following is the configuration file for one of the jobs throwing this error:

# fb15k-237-cp-KvsAll-bce
job.type: search
search.type: ax
search.num_workers: 4
dataset.name: fb15k-237

# training settings (fixed)
train:
  max_epochs: 400
  auto_correct: True

# this is faster for smaller datasets, but does not work for some models (e.g.,
# TransE due to a pytorch issue) or for larger datasets. Change to spo in such
# cases (either here or in ax section of model config), results will not be
# affected.
negative_sampling.implementation: sp_po

# validation/evaluation settings (fixed)
valid:
  every: 5
  metric: mean_reciprocal_rank_filtered_with_test
  filter_with_test: True
  early_stopping:
    patience: 10
    min_threshold.epochs: 50
    min_threshold.metric_value: 0.05

eval:
  batch_size: 256
  metrics_per.relation_type: True

# settings for reciprocal relations (if used)
import: [cp, reciprocal_relations_model]
reciprocal_relations_model.base_model.type: cp

# ax settings: hyperparameter serach space
ax_search:
  num_trials: 100
  num_sobol_trials: 100
  parameters:
      # model
    - name: model
      type: choice
      values: [cp, reciprocal_relations_model]

    # training hyperparameters
    - name: train.batch_size
      type: choice   
      values: [128, 256, 512, 1024]
      is_ordered: True
    - name: train.type
      type: fixed
      value: KvsAll
    - name: train.optimizer
      type: choice
      values: [Adam, Adagrad]
    - name: train.loss
      type: fixed
      value: bce
    - name: train.optimizer_args.lr     
      type: range
      bounds: [0.0003, 1.0]
      log_scale: True
    - name: train.lr_scheduler
      type: fixed
      value: ReduceLROnPlateau
    - name: train.lr_scheduler_args.mode
      type: fixed
      value: max  
    - name: train.lr_scheduler_args.factor
      type: fixed
      value: 0.95  
    - name: train.lr_scheduler_args.threshold
      type: fixed
      value: 0.0001  
    - name: train.lr_scheduler_args.patience
      type: range
      bounds: [0, 10]  

    # embedding dimension
    - name: lookup_embedder.dim
      type: choice 
      values: [16, 32, 64, 128, 256, 512]
      is_ordered: True

    # embedding initialization
    - name: lookup_embedder.initialize
      type: choice
      values: [xavier_normal_, xavier_uniform_, normal_, uniform_]  
    - name: lookup_embedder.initialize_args.normal_.mean
      type: fixed
      value: 0.0
    - name: lookup_embedder.initialize_args.normal_.std
      type: range
      bounds: [0.00001, 1.0]
      log_scale: True
    - name: lookup_embedder.initialize_args.uniform_.a
      type: range
      bounds: [-1.0, -0.00001]
    - name: lookup_embedder.initialize_args.xavier_uniform_.gain
      type: fixed
      value: 1.0
    - name: lookup_embedder.initialize_args.xavier_normal_.gain
      type: fixed
      value: 1.0

    # embedding regularization
    - name: lookup_embedder.regularize
      type: choice
      values: ['', 'l3', 'l2', 'l1']
      is_ordered: True
    - name: lookup_embedder.regularize_args.weighted
      type: choice
      values: [True, False]
    - name: cp.entity_embedder.regularize_weight
      type: range
      bounds: [1.0e-20, 1.0e-01]
      log_scale: True
    - name: cp.relation_embedder.regularize_weight
      type: range
      bounds: [1.0e-20, 1.0e-01]
      log_scale: True

    # embedding dropout
    - name: cp.entity_embedder.dropout
      type: range
      bounds: [-0.5, 0.5]
    - name: cp.relation_embedder.dropout
      type: range
      bounds: [-0.5, 0.5]

    # training-type specific hyperparameters
    - name: KvsAll.label_smoothing            #train_type: KvsAll
      type: range                             #train_type: KvsAll
      bounds: [-0.3, 0.3]                     #train_type: KvsAll
    # model-specific entries
@rgemulla
Copy link
Member

This may be some inter-process communication issue. Which operating system is this on? Does the error also arise when you set search.num_workers to 1? Does it always happen (i.e., also on toy data with a much smaller config)?

@oliver-lloyd
Copy link
Author

This may be some inter-process communication issue. Which operating system is this on? Does the error also arise when you set search.num_workers to 1? Does it always happen (i.e., also on toy data with a much smaller config)?

OS is CentOS 7.

I have submitted jobs to test the other two questions but they are stuck in a long queue, however I can say that multiple other searches have previously completed with num_workers at 4 and with the same search configuration.

@rgemulla
Copy link
Member

multiple other searches have previously completed with num_workers at 4 and with the same search configuration.

In this case, it may be hard for us to figure out what's going on as we've never seen this issue. Also, if things tended to work out for you but then started failing, the problem may also lie outside of LibKGE. If not, we'd need some way to reproduce this problem to investigate.

@oliver-lloyd
Copy link
Author

That's understandable, I'll report back if i find a way to consistently reproduce the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants