Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unresponsive Garner server upon multiple cluster meets #900

Open
harshika-kashyap opened this issue Jan 2, 2025 · 1 comment
Open

Unresponsive Garner server upon multiple cluster meets #900

harshika-kashyap opened this issue Jan 2, 2025 · 1 comment
Assignees

Comments

@harshika-kashyap
Copy link

Describe the bug

Garnet server becomes unresponsive (deadlock) upon multiple cluster meet operations, when started with very few threads (4).

Steps to reproduce the bug

  • Start 6 Garnet servers (v1.0.47) with min and max threads as 4.
./GarnetServer --port 6000 --checkpointdir /tmp/checkpoints/6000 --cluster True --minthreads 4 --maxthreads 4
./GarnetServer --port 6001 --checkpointdir /tmp/checkpoints/6001 --cluster True --minthreads 4 --maxthreads 4
./GarnetServer --port 6002 --checkpointdir /tmp/checkpoints/6002 --cluster True --minthreads 4 --maxthreads 4
./GarnetServer --port 6003 --checkpointdir /tmp/checkpoints/6003 --cluster True --minthreads 4 --maxthreads 4
./GarnetServer --port 6004 --checkpointdir /tmp/checkpoints/6004 --cluster True --minthreads 4 --maxthreads 4
./GarnetServer --port 6005 --checkpointdir /tmp/checkpoints/6005 --cluster True --minthreads 4 --maxthreads 4
  • Connect to one of the servers and execute cluster meet commands.
redis-cli -p 6000
127.0.0.1:6000 > cluster meet 127.0.0.1 6001
OK
127.0.0.1:6000 > cluster meet 127.0.0.1 6002
OK
127.0.0.1:6000 > cluster meet 127.0.0.1 6003
OK
127.0.0.1:6000 > cluster meet 127.0.0.1 6004
OK
127.0.0.1:6000 > cluster meet 127.0.0.1 6005 <-- This one got stuck in my experiment.

Expected behavior

The server should remain responsive.

Screenshots

Image
Image

Release version

v1.0.47

IDE

No response

OS version

No response

Additional context

I took the thread snapshot of the stuck Garnet server.

Thread (0x4B33) has acquired the lock in ClusterManager and is stuck while writing to the file (nodes.conf).

The other threads (0x45FD, 0x4B3B, etc.) are stuck waiting for the ClusterManager lock itself.

Thread (0x4231):
  [Native Frames]
  System.Private.CoreLib!System.Threading.Thread.Sleep(int32)
  GarnetServer!Garnet.Program.Main(class System.String[])

Thread (0x4248):
  [Native Frames]
  System.Private.CoreLib!System.Threading.Monitor.Wait(class System.Object,int32)
  Microsoft.Extensions.Logging.Console!Microsoft.Extensions.Logging.Console.ConsoleLoggerProcessor.TryDequeue(value class Microsoft.Extensions.Logging.Console.LogMessageEntry&)
  Microsoft.Extensions.Logging.Console!Microsoft.Extensions.Logging.Console.ConsoleLoggerProcessor.ProcessLogQueue()

Thread (0x424C):
  [Native Frames]
  Garnet.cluster!Garnet.cluster.ClusterManager.FlushConfig()
  Garnet.cluster!Garnet.cluster.ClusterManager.TryMerge(class Garnet.cluster.ClusterConfig,bool)
  Garnet.cluster!Garnet.cluster.GarnetServerNode.<Gossip>b__26_0(class System.Threading.Tasks.Task`1<value class Garnet.common.MemoryResult`1<unsigned int8>>)
  System.Private.CoreLib!System.Threading.Tasks.ContinuationTaskFromResultTask`1[Garnet.common.MemoryResult`1[System.Byte]].InnerInvoke()
  System.Private.CoreLib!System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(class System.Threading.Thread,class System.Threading.ExecutionContext,class System.Threading.ContextCallback,class System.Object)
  System.Private.CoreLib!System.Threading.Tasks.Task.ExecuteWithThreadLocal(class System.Threading.Tasks.Task&,class System.Threading.Thread)
  System.Private.CoreLib!System.Threading.ThreadPoolWorkQueue.Dispatch()
  System.Private.CoreLib!System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()

Thread (0x424D):
  [Native Frames]
  System.Private.CoreLib!System.Threading.WaitHandle.WaitOneNoCheck(int32)
  System.Private.CoreLib!System.Threading.PortableThreadPool+GateThread.GateThreadStart()

Thread (0x424F):
  [Native Frames]
  ?!?
  System.Net.Sockets!System.Net.Sockets.SocketAsyncEngine.EventLoop()

Thread (0x4252):
  [Native Frames]
  System.Private.CoreLib!System.Threading.WaitHandle.WaitOneNoCheck(int32)
  System.Private.CoreLib!System.Threading.TimerQueue.TimerThread()

Thread (0x45FD):
  [Native Frames]
  Garnet.cluster!Garnet.cluster.ClusterManager.FlushConfig()
  Garnet.cluster!Garnet.cluster.ClusterManager.TryMerge(class Garnet.cluster.ClusterConfig,bool)
  Garnet.cluster!Garnet.cluster.GarnetServerNode.<Gossip>b__26_0(class System.Threading.Tasks.Task`1<value class Garnet.common.MemoryResult`1<unsigned int8>>)
  System.Private.CoreLib!System.Threading.Tasks.ContinuationTaskFromResultTask`1[Garnet.common.MemoryResult`1[System.Byte]].InnerInvoke()
  System.Private.CoreLib!System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(class System.Threading.Thread,class System.Threading.ExecutionContext,class System.Threading.ContextCallback,class System.Object)
  System.Private.CoreLib!System.Threading.Tasks.Task.ExecuteWithThreadLocal(class System.Threading.Tasks.Task&,class System.Threading.Thread)
  System.Private.CoreLib!System.Threading.ThreadPoolWorkQueue.Dispatch()
  System.Private.CoreLib!System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()

Thread (0x4B33):
  [Native Frames]
  System.Private.CoreLib!System.Threading.Monitor.Wait(class System.Object,int32)
  System.Private.CoreLib!System.Threading.SemaphoreSlim.WaitUntilCountOrTimeout(int32,unsigned int32,value class System.Threading.CancellationToken)
  System.Private.CoreLib!System.Threading.SemaphoreSlim.Wait(int32,value class System.Threading.CancellationToken)
  System.Private.CoreLib!System.Threading.SemaphoreSlim.Wait()
  Garnet.cluster!Garnet.cluster.ClusterUtils.WriteInto(class Tsavorite.core.IDevice,class Tsavorite.core.SectorAlignedBufferPool,unsigned int64,unsigned int8[],int32,class Microsoft.Extensions.Logging.ILogger)
  Garnet.cluster!Garnet.cluster.ClusterManager.FlushConfig()
  Garnet.cluster!Garnet.cluster.ClusterManager.TryMerge(class Garnet.cluster.ClusterConfig,bool)
  Garnet.cluster!Garnet.cluster.GarnetServerNode.<Gossip>b__26_0(class System.Threading.Tasks.Task`1<value class Garnet.common.MemoryResult`1<unsigned int8>>)
  System.Private.CoreLib!System.Threading.Tasks.ContinuationTaskFromResultTask`1[Garnet.common.MemoryResult`1[System.Byte]].InnerInvoke()
  System.Private.CoreLib!System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(class System.Threading.Thread,class System.Threading.ExecutionContext,class System.Threading.ContextCallback,class System.Object)
  System.Private.CoreLib!System.Threading.Tasks.Task.ExecuteWithThreadLocal(class System.Threading.Tasks.Task&,class System.Threading.Thread)
  System.Private.CoreLib!System.Threading.ThreadPoolWorkQueue.Dispatch()
  System.Private.CoreLib!System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()

Thread (0x4B3B):
  [Native Frames]
  Garnet.cluster!Garnet.cluster.ClusterManager.FlushConfig()
  Garnet.cluster!Garnet.cluster.ClusterManager.TryMerge(class Garnet.cluster.ClusterConfig,bool)
  Garnet.cluster!Garnet.cluster.GarnetServerNode.<Gossip>b__26_0(class System.Threading.Tasks.Task`1<value class Garnet.common.MemoryResult`1<unsigned int8>>)
  System.Private.CoreLib!System.Threading.Tasks.ContinuationTaskFromResultTask`1[Garnet.common.MemoryResult`1[System.Byte]].InnerInvoke()
  System.Private.CoreLib!System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(class System.Threading.Thread,class System.Threading.ExecutionContext,class System.Threading.ContextCallback,class System.Object)
  System.Private.CoreLib!System.Threading.Tasks.Task.ExecuteWithThreadLocal(class System.Threading.Tasks.Task&,class System.Threading.Thread)
  System.Private.CoreLib!System.Threading.ThreadPoolWorkQueue.Dispatch()
  System.Private.CoreLib!System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()
@harshika-kashyap
Copy link
Author

I also tried starting the Garnet servers with these flags: --clean-cluster-config, --aof-null-device, --no-obj. Facing the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants