-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
org.hbase.async.RemoteException with CDH 5.7 #783
Comments
Hmm, Hbase has an RPC queue that it throttles on to prevent having tons of calls backing up and OOMing or GCing the region server. Take a look at the queue size setting and it maybe that you need to bump it up a bit or drop the write rate from your TSDs. (we've faced a similar issue around sending region servers into GC with too many requests). I'm not exactly sure what the setting is, we'll have to look it up. As for the 0.0.0.0, that's just the region sever saying it's listening on all IP addresses on port 60020 and it's busy. (some folks run multiple region servers on the same host over different ports). |
But strange, i changed no configs, and it worked fine before the update. I did update to CDH 5.7, and with this and Solr issues i am having.... i am not liking it |
If your heap size is decent and you have a modern number of CPU cores, try setting the handler count higher
|
it's 30 now. but i wonder (and it seems to be a CDH 5.7 issue, not an opentsdb issue, so this can prolly be closed) why it cropped up when upgrading. sigh |
Does increasing that value improve the problem? We don't mind helping you |
Not sure what CDH 5.7 looks inside like but there was a pretty bad bug in HBase 1.2.0 in the Procedure system that caused a ton of CPU spin and garbage creation HBASE-15422. |
That's listed in their changelog as applied to CDH 5.7 |
So i can confirm, upping the regionserver handler count resolved the issue. Opentsdb is again quick to respond and such. As it was fine before updating to CDH 5.7, gonna mark it down as Hbase 1.2/CDH 5.7 issue, but this minimizes it |
Hmm... so i have hbase.regionserver.handler.count = 150 in my 40 node cluster... and still am seeing two issues: Bad Request on /api/query: Call queue is full on /0.0.0.0:60020, too many items queued ? and also /api/query/last is not returning last.... Although this all appeared to start with CD 5.7, i am wondering if it could be my opentsdb.conf that is setup wrong for what i want to do as well
|
There is OpenTSDB/asynchbase#135 on this issue. |
the call queue full thing doesn't explain why last isn't working tho, either |
get empty array on everything |
We increased handler.count to 200 and have been watching stats for queue usage over time. There are occasional spikes that lead to this exception, and I'm fairly confident the spikes come from requests to tsdb-meta table. We only have OpenTSDB on this cluster and tsdb is spread across all RS evenly, but tsdb-meta only has 4 regions and there is exact correlation between spikes in queues on RS that have tsdb-meta regions on them. Any thoughts on what would cause spikes in tsdb-meta RPC volumes? We do have tsd.core.meta.enable_realtime_ts=true, but all other .meta. settings are left as default. |
So yea, turning off tsd.core.meta.enable_realtime_ts lowered our ipc.numActiveHandler usage from 100-200 spikes down to 0-1, and this problem is gone. |
SO, also, region splitting tsdb-meta was a big win |
@devaudio mind if you ask how many tsdb-meta regions you ended up with? I suspect this was necessary, so we went from 1 to 2 to 4 but that only helped marginally. |
I have 174 regions for tsdb-meta, with ~3,800 metrics, ~155 billion rows |
Similar to the last comment in #772 , since upgrading to CDH 5.7, I am getting this constantly in my logs:
org.hbase.async.RemoteException: Call queue is full on /0.0.0.0:60020, too many items queued ? org.hbase.async.RemoteException: Call queue is full on /0.0.0.0:60020, too many items queued ? Caused by: org.hbase.async.RemoteException: Call queue is full on /0.0.0.0:60020, too many items queued ?
I ran it with DEBUG, and most of the time, especially for writes, i see it selecting the correct region servers:
id: 0x231245fe, /10.10.7.101:33714 => /10.10.7.36:60020] Sending RPC #26
but i can't seem to figure out why it's trying to go to 0.0.0.0 ..... and, this just cropped up when I upgraded to CDH 5.7 :/
The text was updated successfully, but these errors were encountered: