-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preallocate p.stacks() before calling runtime.GoroutineProfile() #11
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumerc thank you so much for submitting this and sorry for the late review.
PTAL at my comment & suggestion!
fgprof.go
Outdated
nGoroutines := runtime.NumGoroutine() | ||
if len(p.stacks) < nGoroutines { | ||
p.stacks = make([]runtime.StackRecord, int(float64(nGoroutines)*1.1)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I was thinking about doing this outside the GoroutineProfile()
func, otherwise it's costing us extra cycles even after we've settled on a suitably large stacks
size. (I wasn't able to see a significant effect from this in the benchmark, but I think it's theoretically clear why this is higher overhead)
Maybe something like this? And then changing all the places where we initialize a &profiler{} struct right now?
// newProfiler returns a new profiler with a pre-allocated stacks field. This
// is slightly faster than discovering its size during the first
// GoroutineProfile() call.
func newProfiler() *profiler {
n := runtime.NumGoroutine()
return &profiler{
stacks: make([]runtime.StackRecord, int(float64(n)*1.1)),
}
}
Let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! This is the correct way.
I was trying to avoid make()
by if len(p.stacks) < nGoroutines {
but as you indicated runtime.NumGoroutine()
will be redundantly called most of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Do you have interest/time in modifying your PR as suggested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I have done some further tests on this.
The issue is: as runtime.NumGoroutine
is faster than calling GoroutineProfile
I thought it would make sense to optimize for this as I opened the case.But I did not know p.stacks
is reused. My previous implementation was still not good enough as it still adds more cycles where goroutine counts do not change between calls to GoroutineProfile
.
I pushed a sample implementation for the above one, but it seems like it does not make too much difference because the newProfile
is not on the hot path anymore only initialized once. It adds more complexity with too little gain. So, it is up to you:
- We can close this issue as is and focus on maybe more important things in the hot path,
- We can merge,
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I have done some further tests on this.
The issue is: as
runtime.NumGoroutine
is faster than callingGoroutineProfile
I thought it would make sense to optimize for this as I opened the case.But I did not knowp.stacks
is reused. My previous implementation was still not good enough as it still adds more cycles where goroutine counts do not change between calls toGoroutineProfile
.I pushed a sample implementation for the above one, but it seems like it does not make too much difference because the
newProfile
is not on the hot path anymore only initialized once. It adds more complexity with too little gain.
Yeah, I agree that it won't have a big impact. So if you feel like this is a bad tradeoff, then I'd also prefer to not merge the patch
So, it is up to you:
- We can close this issue as is and focus on maybe more important things in the hot path,
- We can merge,
Ok, I think we should close this and look for other opportunities.
I think the main issue with fgprof right now is that it doesn't scale to programs that use tons of goroutines. I'm starting a new job at Datadog's Profiler team in January that might give me the opportunity to submit proposals and patches to the Go core for it.
But meanwhile it might be nice for the profiler to detect if it's too slow (and causing issues for the profiled program). When that happens we could either dynamically adjust the sampling rate, or fail with an error.
Let me know if you're interested in playing with this idea.
Also, if you change your mind and want to merge the PR, I'd be okay with that too. It's not a lot of complexity to add IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mental model is this: Calling runtime.GoroutineProfile()
calls stopTheWorld()
so any time we spent inside of it, we assume the users application is totally stopped, and we're "stealing" its time for profiling. The goal is to not steal more than e.g 1% of the programs wall clock time.
We don't really know how stuff the program is doing when not profiling, but I don't think it's relevant for the goal of the profiler to minimize interfering with the program.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I understand. I forgot about the STW in goroutineprofile
. I am still in learning phase for the runtime. Now it makes sense also looking at the wall time as well (instead of CPU time).
I will do some experiments myself and think a bit more...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was too busy lately and could not find time to look/think into this too much.
But I thought about the problem a bit more and I agree that your approach of trying to bound profiler wall time into %1 seems both simple to implement and powerful enough like you suggested.
Couple of issues that came to my mind, a bit overthinking maybe for the current phase, but curious about your thoughts on these:
-
While I have not tested it yet: I assume when STW occurs, it might have some adverse effects on the system as it will abuse cache locality of some processors. One test that we could either do manually or automatically is to have a large number of CPU intensive goroutines executing and do some work with them. THen we calculate these both with profiler and without profiler and see if the output is also changed like %1 or so. It would be nice to test this as there might be other kind of adverse effects as well.
-
I think this is more major problem: what happens if
GoroutineProfile
itself is actually taking more than %1? Yeah, we could simply skip some samples on the next iterations but that might actually hurt accuracy? What if it took 30ms? 50ms? 100ms? Those numbers are just (maybe?) too big for the application? Well again: I have not yet tested it myself but I think this is not too uncommon if you have many goroutines: runtime: GoroutineProfile causes latency spikes in high-goroutine-count application golang/go#33250. Well one idea might be to make an estimation on how long aGoroutineProfile
might take based on previous data, but again: I don't know what we will do even if that is the case. Another idea: get some help from the runtime: like what ifGoroutineProfile
returns some data even iflen(p)<=n
. Even when we don't have all the goroutine data, at least we might have something(maybe some kind of randomness can be applied in the runtime for returning different GOROs everytime?). So, instead of returning nothing, we will have at least something to profile this way. Otherwise it is completely binary, right? Any idea on this?
Ah one last thing:
While I was re-reading your previous comments, I could not understand this one:
Of course there is plenty of room to get fancy. E.g. using exponentially weighted averages for tracking the overhead or PID controllers for adapting the frequency.
Can you elaborate a bit what this means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
Fixes #10