Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to see the training plots? #6

Open
AdamStelmaszczyk opened this issue Jan 30, 2018 · 13 comments
Open

How to see the training plots? #6

AdamStelmaszczyk opened this issue Jan 30, 2018 · 13 comments

Comments

@AdamStelmaszczyk
Copy link
Contributor

AdamStelmaszczyk commented Jan 30, 2018

Now the log for python run_job.py -n 5 -g 60 -c 12 --simulator_procs 10 --use_sync --name breakout --short looks better. But I don't know how to view the plots.

I noticed that in the experiments/breakout_1517329773.87/storage/atari_trainlog there are 4 directories corresponding to workers and each of them has a TensorBoard events file, e.g. events.out.tfevents.1517329790.p0112.

I downloaded with scp the whole experiments directory and run tensorboard --logdir=experiments to view it in the browser:

image

After 40 minutes of training on 4 workers I would expect to see the full mean_score plot, not only 1 data point with value 1.8 for step 100. Whereas one can see from the log that each worker made about 1800 steps.

How to see the training plots?

@tgrel
Copy link
Contributor

tgrel commented Jan 31, 2018

We inherited the tensorboard code from the original tensorpack implementation of BA3C but never used it. Making it work correctly will require some work.

We never used tensorboard for viewing these the plots. We have our own in-house tool for this (https://neptune.ml/. We handle the distributed setup by aggregating all the datapoints from all the workers in a single worker using sockets. The implementation can be found here.

About the mean score plot: in the original TP implementation there was only one worker which periodically stopped the training to perform evaluation. Obviously, this doesn't scale well on distributed setups so we changed it to have a single worker that performs evaluations for every model checkpoint saved and shut down evaluations in all other workers to make them work faster. The tensorboard files you're seeing are leftovers of these changes.

Additionally we also had 'online score' plots that showed the scores achieved in the training games (which are played a bit differently from the evalution games so the scores aren't exactly the same)

I guess you could try to modify this file so that it sends the results to tensorboard instead of neptune and then you'd have all the results from all workers in a single tensorboard file.

@AdamStelmaszczyk
Copy link
Contributor Author

I see, thank you.

Additionally we also had 'online score' plots that showed the scores achieved in the training games (which are played a bit differently from the evalution games so the scores aren't exactly the same)

Why are they played a bit differently? What are the differences? Isn't that an issue: to train on game A, but evaluate on a slightly different game A', without matching scores?

@tgrel
Copy link
Contributor

tgrel commented Jan 31, 2018

The policy network assigns a probability to each action that the agent can take. During evaluation you aim to achieve the highest possible score so you just pick the action with the highest probability. In training you want to have exploration, so instead of choosing max, you sample from this distribution in order to take a different action from time to time and thus explore a different trajectory. In most games this yields worse scores (there are exceptions to this rule though) but speeds up the training.

@AdamStelmaszczyk
Copy link
Contributor Author

That sounds good, thanks.

@AdamStelmaszczyk
Copy link
Contributor Author

I tried to save stats to TensorBoard event files, but adding that requires more effort. So, I switched to the idea of using Neptune UI.

We never used tensorboard for viewing these the plots. We have our own in-house tool for this (https://neptune.ml/. We handle the distributed setup by aggregating all the datapoints from all the workers in a single worker using sockets. The implementation can be found here.

Ok, the code is running a ZeroMQ server that aggregates all the data from the workers. Then it is logging ##### Sending to neptune: online_score : 0.241437337531 , 1.4 ##### and it saves stats to 16 CSV files. How to view the plots in Neptune UI? I have a Neptune account.

@tgrel
Copy link
Contributor

tgrel commented Feb 5, 2018

This project was completed using the neptune version 1.4, which is no longer supported. I guess you could port it all to neptune 2.0 (which should not be that hard since the API haven't changed substantially).

Otherwise you could just plot the CSV files using some other tool like matplotlib.

@AdamStelmaszczyk
Copy link
Contributor Author

I added saving online score to tenserboard events file: #7.

With this, I tried to reproduce the results on Breakout-v0 and Seaquest-v0 from your research paper:

baseline

The plots show only max scores, right? In further comparisons I will guess so. (Because near Table 3 it's written "Best stable score and time (in hours) to achieve it are given".)

Breakout, after 6 hours and ~50k global steps:

breakout

After 22 minutes I got 320 max_score. This matches your plot.

Seaquest, also after 6 hours and ~50k global steps:

seaquest

After 43 minutes I got 1940 and later 1840. You achieved 1840 in 20 minutes. Maybe mine training run was unlucky.

@tgrel
Copy link
Contributor

tgrel commented Feb 6, 2018

All evaluation scores in the paper are mean scores from 50 consecutive games. Please DO NOT compare online scores with evaluation scores since they are achieved using a different algorithm and will be different. If you want to reproduce the results from the paper you'll have to get the evaluation score by using the evaluation node and saving the data from it.

Moreover, this setup requires extensive tuning to run efficiently. Especially important hyperparameters:

  • learning rate: 0.001
  • number of workers: 64
  • number of parameter servers: 4
  • synchronous training
  • epsilon for Adam optimizer: 1e-8
  • local batch size: 32

@tgrel
Copy link
Contributor

tgrel commented Feb 6, 2018

Also please make sure you're using a TensorFlow version that supports SIMD instructions for CPU, especially AVX-2. We were using the optimized version provided by Intel.

@tgrel
Copy link
Contributor

tgrel commented Feb 6, 2018

One more thing -- there's a possibility that due to some race conditions not all of the workers started the computations correctly. Unfortunately the setup is not as stable as I'd like it to be.

There's a data channel called 'active_workers' somewhere, it says how many workers are currently participating in computations. After a few minutes of training all workers should be working correctly, if not you can try restarting the training from scratch.

Should you like to debug the situation to make the training more stable you can count on my help :)

@AdamStelmaszczyk
Copy link
Contributor Author

AdamStelmaszczyk commented Feb 6, 2018

All evaluation scores in the paper are mean scores from 50 consecutive games.

This is the mean_score (saved to mean_score.csv and now mean_score in tensorboard). Please correct me if I'm wrong.

Please DO NOT compare online scores with evaluation scores since they are achieved using a different algorithm and will be different.

Sure, I understand the difference, I didn't compare those.

If you want to reproduce the results from the paper you'll have to get the evaluation score by using the evaluation node and saving the data from it.

This is mean_score as this line is executed. args.eval_node is False, we can't pass --eval_node, otherwise InvalidArgumentError is raised, #2.

Especially important hyperparameters:

Oh, in my previous post I forgot to include the full command I used:

python run_job.py -n 71 -g 60 -c 12 -o adam --use_sync -l 0.001 -b 32 --fc_neurons 128 --simulator_procs 10 --ps 4 --fc_init u
niform --conv_init normal --fc_splits 4 --epsilon 1e-8 --beta1 0.8 --beta2 0.75 --save_every 1000 -e Seaquest-v0 --name seaque
st

So the hyperparams are exactly the same, I took them from README.

Also please make sure you're using a TensorFlow version that supports SIMD instructions for CPU, especially AVX-2. We were using the optimized version provided by Intel.

Ah, I'm sorry, I missed this completely. With this it will be probably much faster - will try it. Thanks!

One more thing -- there's a possibility that due to some race conditions not all of the workers started the computations correctly.

I think I saw this once. One line was repeated over and over in the log, workers didn't start. After restarting it was fine.

Should you like to debug the situation to make the training more stable you can count on my help :)

So far it happened to me once on ~30 runs, it's not biting me hard enough. Prefer to spend time on other things. I checked active_workers.csv, it always was 66 or 65.

If you or anybody else would like to debug it - I can offer some guidance here. A good first step would be:

Create a small script starting about 50 run_job.py with 4-10 workers and 1 ps, pipe outputs to {1..50} log files, tail -f all of them to find the faulty one. Paste that full output log here.

If I see this error again, I will paste it.

@AdamStelmaszczyk
Copy link
Contributor Author

With Intel's TensorFlow, same command as before, after 6 hours both runs reached ~150k global steps (3x more than before):

Breakout, mean_score 292 after 28 minutes, quite close to ~330 from the paper, but a huge drop after:

breakout

Seaquest, mean_score 1690 after 23 minutes, also close to paper's 1840, also the plot shape is similar:

seaquest

The differences perhaps could be due to variance in between runs, i.e. if I run Breakout training more times, maybe there would be a plot very close to the one from the paper.

@tgrel
Copy link
Contributor

tgrel commented Feb 8, 2018

The Seaquest results look good to me. Setup for Breakout might be somewhat unstable in a sense that the 'catastrophic forgetting' is quite common for this game and learning rate. Also the training should be a bit faster (our mean time (average from 10 experiments) to mean score of 300 was 21 minutes). I suggest running more experiments and seeing if these issues persist.

Please bear in mind that our experiments from the paper did not use any learning rate scheduling. If you're really after getting very high scores for these games, or very stable learning without catastrophes, I suggest you play with the 'schedule_hyper' parameter. The schedule for annealing the learning rate and exploration factor can be found here, by using it you could start with a high learning rate of 0.001 at the beginning and then drop it after ~15 minutes of training to avoid catastrophes and get much higher scores in some games (especially breakout).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants