You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We recently encountered an issue where seemingly multiple connections to the Trick Web Server, specifically to the Web Socket-based Variable Server API, caused our entire simulation which was previous running at real time to instead run extremely slowly (~1/600th real time). This continued for some time, with short jumps in performance every 30 seconds before it return to running slowly. From profiling the code during this time we identified that the Trick sim process was barely using any CPU and appeared to be blocked. Eventually (after closing connections and waiting), the sim began to catch up and eventual resumed real-time operation (note the graph below shows two such instances of this failure happening in a long durance sim run):
While we were able to reproduce this issue occasionally, we have been unable to reproduce it reliably even with dedicated scripts and as such the above description is mainly for context than anything else. If anyone has ideas about this failure I would be interested in hearing them, but we have gone in a different direction for now.
Overview
While investigating the above failure we were able to narrow it down to connections to the Trick Web Server. Digging into the code, I noticed that while almost all of the functions in MyCivetServer.cpp lock a mutex before accessing the connection map, the two functions responsible for handling messages do not:
In certain cases, I believe this could cause issues if a connection was removed (logic that is wrapped in a critical section) when simultaneously the code was attempting to process a message from that connection.
I cannot claim that this sort of race condition would cause the behavior that originally caused us to look at this code, but I figured it would be good to report it anyways
The text was updated successfully, but these errors were encountered:
Thanks for letting us know. The web server interface isn't very commonly used, so I'm not too surprised there are, shall we say "inefficiencies", in that code. We'll look into it and see if we can assess the problem and improve the performance.
Background Context
We recently encountered an issue where seemingly multiple connections to the Trick Web Server, specifically to the Web Socket-based Variable Server API, caused our entire simulation which was previous running at real time to instead run extremely slowly (~1/600th real time). This continued for some time, with short jumps in performance every 30 seconds before it return to running slowly. From profiling the code during this time we identified that the Trick sim process was barely using any CPU and appeared to be blocked. Eventually (after closing connections and waiting), the sim began to catch up and eventual resumed real-time operation (note the graph below shows two such instances of this failure happening in a long durance sim run):
While we were able to reproduce this issue occasionally, we have been unable to reproduce it reliably even with dedicated scripts and as such the above description is mainly for context than anything else. If anyone has ideas about this failure I would be interested in hearing them, but we have gone in a different direction for now.
Overview
While investigating the above failure we were able to narrow it down to connections to the Trick Web Server. Digging into the code, I noticed that while almost all of the functions in MyCivetServer.cpp lock a mutex before accessing the connection map, the two functions responsible for handling messages do not:
https://github.com/nasa/trick/blob/master/trick_source/web/CivetServer/src/MyCivetServer.cpp#L374-L394
In certain cases, I believe this could cause issues if a connection was removed (logic that is wrapped in a critical section) when simultaneously the code was attempting to process a message from that connection.
I cannot claim that this sort of race condition would cause the behavior that originally caused us to look at this code, but I figured it would be good to report it anyways
The text was updated successfully, but these errors were encountered: