Pickle empty file error #197

oliver-lloyd · 2021-04-21T13:22:17Z

I have received this same error message from several Ax search jobs spanning multiple models and datasets:

concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/fu19841/miniconda3/envs/libkge/lib/python3.7/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/fu19841/LPComparison/scripts/kge/kge/job/search.py", line 232, in _run_train_job
    raise e
  File "/home/fu19841/LPComparison/scripts/kge/kge/job/search.py", line 131, in _run_train_job
    checkpoint_file, train_job_config.get("job.device")
  File "/home/fu19841/LPComparison/scripts/kge/kge/util/io.py", line 41, in load_checkpoint
    checkpoint = torch.load(checkpoint_file, map_location="cpu")
  File "/home/fu19841/miniconda3/envs/libkge/lib/python3.7/site-packages/torch/serialization.py", line 595, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/fu19841/miniconda3/envs/libkge/lib/python3.7/site-packages/torch/serialization.py", line 764, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/fu19841/miniconda3/envs/libkge/bin/kge", line 33, in <module>
    sys.exit(load_entry_point('libkge', 'console_scripts', 'kge')())
  File "/home/fu19841/LPComparison/scripts/kge/kge/cli.py", line 285, in main
    job.run()
  File "/home/fu19841/LPComparison/scripts/kge/kge/job/job.py", line 159, in run
    result = self._run()
  File "/home/fu19841/LPComparison/scripts/kge/kge/job/search_auto.py", line 162, in _run
    (self, trial_no, config, self.num_trials, list(parameters.keys())),
  File "/home/fu19841/LPComparison/scripts/kge/kge/job/search.py", line 75, in submit_task
    self.wait_task()
  File "/home/fu19841/LPComparison/scripts/kge/kge/job/search.py", line 97, in wait_task
    self.ready_task_results.append(task.result())
  File "/home/fu19841/miniconda3/envs/libkge/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/home/fu19841/miniconda3/envs/libkge/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
EOFError: Ran out of input

The following is the configuration file for one of the jobs throwing this error:

# fb15k-237-cp-KvsAll-bce
job.type: search
search.type: ax
search.num_workers: 4
dataset.name: fb15k-237

# training settings (fixed)
train:
  max_epochs: 400
  auto_correct: True

# this is faster for smaller datasets, but does not work for some models (e.g.,
# TransE due to a pytorch issue) or for larger datasets. Change to spo in such
# cases (either here or in ax section of model config), results will not be
# affected.
negative_sampling.implementation: sp_po

# validation/evaluation settings (fixed)
valid:
  every: 5
  metric: mean_reciprocal_rank_filtered_with_test
  filter_with_test: True
  early_stopping:
    patience: 10
    min_threshold.epochs: 50
    min_threshold.metric_value: 0.05

eval:
  batch_size: 256
  metrics_per.relation_type: True

# settings for reciprocal relations (if used)
import: [cp, reciprocal_relations_model]
reciprocal_relations_model.base_model.type: cp

# ax settings: hyperparameter serach space
ax_search:
  num_trials: 100
  num_sobol_trials: 100
  parameters:
      # model
    - name: model
      type: choice
      values: [cp, reciprocal_relations_model]

    # training hyperparameters
    - name: train.batch_size
      type: choice   
      values: [128, 256, 512, 1024]
      is_ordered: True
    - name: train.type
      type: fixed
      value: KvsAll
    - name: train.optimizer
      type: choice
      values: [Adam, Adagrad]
    - name: train.loss
      type: fixed
      value: bce
    - name: train.optimizer_args.lr     
      type: range
      bounds: [0.0003, 1.0]
      log_scale: True
    - name: train.lr_scheduler
      type: fixed
      value: ReduceLROnPlateau
    - name: train.lr_scheduler_args.mode
      type: fixed
      value: max  
    - name: train.lr_scheduler_args.factor
      type: fixed
      value: 0.95  
    - name: train.lr_scheduler_args.threshold
      type: fixed
      value: 0.0001  
    - name: train.lr_scheduler_args.patience
      type: range
      bounds: [0, 10]  

    # embedding dimension
    - name: lookup_embedder.dim
      type: choice 
      values: [16, 32, 64, 128, 256, 512]
      is_ordered: True

    # embedding initialization
    - name: lookup_embedder.initialize
      type: choice
      values: [xavier_normal_, xavier_uniform_, normal_, uniform_]  
    - name: lookup_embedder.initialize_args.normal_.mean
      type: fixed
      value: 0.0
    - name: lookup_embedder.initialize_args.normal_.std
      type: range
      bounds: [0.00001, 1.0]
      log_scale: True
    - name: lookup_embedder.initialize_args.uniform_.a
      type: range
      bounds: [-1.0, -0.00001]
    - name: lookup_embedder.initialize_args.xavier_uniform_.gain
      type: fixed
      value: 1.0
    - name: lookup_embedder.initialize_args.xavier_normal_.gain
      type: fixed
      value: 1.0

    # embedding regularization
    - name: lookup_embedder.regularize
      type: choice
      values: ['', 'l3', 'l2', 'l1']
      is_ordered: True
    - name: lookup_embedder.regularize_args.weighted
      type: choice
      values: [True, False]
    - name: cp.entity_embedder.regularize_weight
      type: range
      bounds: [1.0e-20, 1.0e-01]
      log_scale: True
    - name: cp.relation_embedder.regularize_weight
      type: range
      bounds: [1.0e-20, 1.0e-01]
      log_scale: True

    # embedding dropout
    - name: cp.entity_embedder.dropout
      type: range
      bounds: [-0.5, 0.5]
    - name: cp.relation_embedder.dropout
      type: range
      bounds: [-0.5, 0.5]

    # training-type specific hyperparameters
    - name: KvsAll.label_smoothing            #train_type: KvsAll
      type: range                             #train_type: KvsAll
      bounds: [-0.3, 0.3]                     #train_type: KvsAll
    # model-specific entries

The text was updated successfully, but these errors were encountered:

rgemulla · 2021-04-21T13:32:45Z

This may be some inter-process communication issue. Which operating system is this on? Does the error also arise when you set search.num_workers to 1? Does it always happen (i.e., also on toy data with a much smaller config)?

oliver-lloyd · 2021-04-22T13:44:24Z

This may be some inter-process communication issue. Which operating system is this on? Does the error also arise when you set search.num_workers to 1? Does it always happen (i.e., also on toy data with a much smaller config)?

OS is CentOS 7.

I have submitted jobs to test the other two questions but they are stuck in a long queue, however I can say that multiple other searches have previously completed with num_workers at 4 and with the same search configuration.

rgemulla · 2021-04-22T14:11:25Z

multiple other searches have previously completed with num_workers at 4 and with the same search configuration.

In this case, it may be hard for us to figure out what's going on as we've never seen this issue. Also, if things tended to work out for you but then started failing, the problem may also lie outside of LibKGE. If not, we'd need some way to reproduce this problem to investigate.

oliver-lloyd · 2021-04-22T14:20:07Z

That's understandable, I'll report back if i find a way to consistently reproduce the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pickle empty file error #197

Pickle empty file error #197

oliver-lloyd commented Apr 21, 2021 •

edited

Loading

rgemulla commented Apr 21, 2021

oliver-lloyd commented Apr 22, 2021

rgemulla commented Apr 22, 2021

oliver-lloyd commented Apr 22, 2021

Pickle empty file error #197

Pickle empty file error #197

Comments

oliver-lloyd commented Apr 21, 2021 • edited Loading

rgemulla commented Apr 21, 2021

oliver-lloyd commented Apr 22, 2021

rgemulla commented Apr 22, 2021

oliver-lloyd commented Apr 22, 2021

oliver-lloyd commented Apr 21, 2021 •

edited

Loading