-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion - Redesign RayEnvWorker for Improved Performance #1172
Comments
Thank you @destin-v ! I'll look into it on the weekend and will run some performance tests |
@MischaPanch , in order to get optimal results from using Ray I recommend you provide a vectorized environment to each of the Ray remote workers. When running the vectorized Then gather all of the Ray reference objects (sometimes called futures) into a list. Then perform a If done properly, you will see a speedup because the number of serialization/deserializations will decrease and your utilization of the CPU will increase. Let me know if you need any help. |
Ok, I looked into this in more detail. You are right, the design is suboptimal. There is however the overall question whether the ray worker is important for tianshou. The only reason I would see for using it is if one wants to start the training on a multi-node cluster. This is currently not the main focus of Tianshou, there are many more important problems to fix before that. Several practical algorithms like SAC cannot be arbitrarily parallelized anyway. There's also rllib for a tight integration with ray (I had extremely bad experiences with it, but still). The solution of this issue would require quite some work. Is this feature important for you? Are you using multi-node training, and if not, is the current subproc vectorized env not sufficient for you? |
Related to #1133 |
I am interested in Tianshou's RayVecEnv because I have access to a multi-node computing infrastructure. I would be willing to contribute a new design for RayVecEnv. Due to multiple priorities I may not get to it in the near term. But looking at the code for RayVecEnv I think it will require a complete rewrite to get a double buffering solution. RLLib is heavily abstracted and inherits from many classes that are not needed for RL (i.e. Tune). This has led to coupling issues where bugs have propagated across RLLib making it difficult to fix or debug. Even though RLLib provides multi-node support, it has its own unique problems. |
@destin-v fully agree, in fact rllib's large number of problems is the main reason for me for investing significant effort into tianshou. I feel that tianshou might strike the right balance between useful abstractions while still not being overwhelming for researchers, and not being breaking and bug-prone like rllib. I also agree that RayVecEnv needs to be redesigned from the ground up, which is precisely why I was reluctant to do it myself now. If you want to collaborate on this, I'm happy to discuss, review, and participate to some extent. Let me know when you have time and let's come back to this issue then. |
DescriptionA set of benchmarks for Experiment Setup
Hypothesis
ResultsNote
DiscussionThese results show that environments with slow steps (>1sec) will greatly benefit from Note the key findings from the experimental results:
Since Note Environments like Cartpole step very fast (<0.0001sec) meaning there is likely no benefit to retrieving a single step of Cartpole from a distributed core under the current |
@destin-v thank you for this very thorough and clear evaluation! I will include these results in the docs for the next release, if you don't mind. It is unfortunate that the overhead is so large. Envs with a single step per second are kind of doomed - even with simple envs agents need roughly 1e6 steps to converge to anything. Highly parallelizable algos will need even more. So having such a slow env in the first place is pretty much a no-go. Maybe some very slow envs would be 0.1 seconds per step, but at least for research purposes it's way too slow. Your results just confirm to me that optimizing for multi-node scenarios should not be a focus of Tianshou for now, do you agree? I'm not saying it's generally irrelevant, just that in most situations it seems like one node is a better way to go. Clouds offer VMs with 96 cores, with AMD CPUs one can get even more. |
Or do you think that with the re-design allowing aggregation we would get significant benefits from multi-node? Most off-policy algos don't aggregate all too much before performing updates in the main process, but some high-throughput algorithms do (impala, apex). We could consider implementing them together with a redesigned ray worker. In this scenario, multi-node might become useful, and better supporting it would make some sense. EDIT: after looking at the whole conversation again I am even more convinced that it's not purely a worker-redesign issue, but rather tightly coupled to algorithms and their parametrizations. Not all parametrizations of all algos would be able to make use of either of the options that you presented. In fact, option one is very close to the impala algo, if I'm not mistaken |
I agree that any redesign of I think scaling to multiple nodes gives users the ability to push the boundaries on state-of-the art performance. At the moment Tianshou is designed for vertical scaling (scaling resources on a single node). But horizontal scaling (scaling resources across mulitple nodes) will likely provide the best opportunities for improving performance. I am still thinking through what is the best way to achieve a good solution for Note One upgrade that may provide a speed boost to all algorithms is double-buffered sampling. This simply has the worker perform a batch of steps on the environment while waiting for the next inference. |
I generally agree. We can start looking into this and implementing some multi-node things. However, it's not going to be as simple as changing the workers. The Collector right now is not able to deal with receiving a batch of steps. It will likely be necessary to separate out an EpisodeCollector from the current logic which can collect both steps and episodes in order to deal.with batched workers, since special care needs to be taken when an episode is finished , before attempting to make a suitable Collector. Moreover, we currently don't even have a proper Collector interface... I propose the following:
Independently of that, the tianshou core team is working on including an automated benchmarking for algos, which will help evaluate the results of step 3. Once those are firmly established, we can move it out of experimental. Wdyt @destin-v ? |
Sounds great, I'll take a look at the double-buffered sampling and send a pull request when it is ready. |
I have been busy with papers and a conference but I have not forgotten about this topic. Once I get through my presentations I will have more time to devote to this. My initial experiments have confirmed that buffering affects the Steps per Second (SPS) positively and it improves performance. |
I checked with my university employer and it appears that there restrictions that prevent me from directly contributing to an open-source repository. However, I am able to publish papers and open-source code as a researcher for my university. I am working on a white paper describing ways to improve throughput in multi-node settings. When the paper is ready for release, I'll share more details. |
+ [ ] exception-raising bug
+ [ ] RL algorithm bug
+ [ ] documentation request (i.e. "X is missing from the documentation.")
+ [x] new feature request
+ [x] design request (i.e. "X should be changed to Y.")
Description
The issue with the
RayEnvWorker
is that it is almost guaranteed to be slower than the other vectorized environments because of how it is designed. Tianshou'sRayEnvWorker
(shown below) creates an environment on each Ray worker and callssend()
andrecv()
on the environments tostep
orreset
.When the Ray workers receive a
step
orreset
commands it generates an object reference which is sent back to the head node. At this pointRayEnvWorker
callsray.get
to dereference the object. This process involves serialization every time you send an object and deserialization every time you dereference an object. Hence, the communication cost is high. But the actual processing done on the worker is trivial (recall we are just issuing astep
orreset
command).The bottom line is that
RayEnvWorker
should be rewritten to heavily utilize the CPU and minimize communication costs.Options
RayEnvWorker
its own copy of the neural network so it can take actions without communicating with the head node. Once you have enough samples, useray.get
to pull the data over the network.step
orreset
functions. This would make more use of the CPUs and reduce the amount of communication calls.Note
Of the two approaches, the second one is preferred because you only need CPUs to build vectorized environments. The first approach requires that you have both CPUs and GPUs available to the
RayEnvWorker
which is more costly from a resource standpoint.RayEnvWorker
The text was updated successfully, but these errors were encountered: