Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve health checking #48

Open
mroark1m opened this issue Feb 18, 2019 · 2 comments
Open

improve health checking #48

mroark1m opened this issue Feb 18, 2019 · 2 comments

Comments

@mroark1m
Copy link
Contributor

mroark1m commented Feb 18, 2019

We had an incident where the waggledance tasks stopped working properly.

We had a custom ELB set up on top of those tasks, which caught the fact that the tasks were unhealthy from the health check perspective. However, the tasks did not get stopped or get replaced.

@mroark1m
Copy link
Contributor Author

After pulling in a recent change for changing how Xmx is calculated it has OutOfMemory Heap exception...

2019-02-19T17:22:55,356 DEBUG com.hotels.bdp.waggledance.server.TTransportMonitor:66 - Releasing disconnected sessions Exception in thread "SimplePauseDetectorThread_0" java.lang.OutOfMemoryError: Java heap space Exception in thread "pool-3-thread-199" java.lang.OutOfMemoryError: Java heap space Exception in thread "pool-3-thread-1084" java.lang.OutOfMemoryError: Java heap space Exception in thread "pool-3-thread-1085" java.lang.OutOfMemoryError: Java heap space 2019-02-19T17:22:55,357 ERROR com.hotels.bdp.waggledance.server.MetaStoreProxyServer:133 - WaggleDance Thrift Server threw an exception...

Again observed that the tasks stay up, but the clients (Qubole clusters) are not working.

Caused by: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused) at org.apache.thrift.transport.TSocket.open(TSocket.java:226) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:383) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:191)

I note we don't have a built-in dashboard for apiary-federation like we do for apiary-data-lake.

@mroark1m
Copy link
Contributor Author

mroark1m commented Feb 19, 2019

After the OOM:

nc -vz <host> 48869 gives...

nc: connect to <host> port 48869 (tcp) failed: Connection refused

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant