-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Seeing some backlog pending on Pulsar UI even if Spark consumer has consumed all data. #176
Comments
|
Thanks for responding.
|
@akshay-habbu Just FYI, during the spark job execution, it spawns new consumer/reader to consume messages from the last committed position. That's why you may observe some backlog. Do you see you job is proceeding and the backlog changes after each micro-batch? |
@nlu90 |
@akshay-habbu We haven't heard any report for this issue from other users for now. One possible thing is these backlogged subscription are not the one being actively used and probably is the left-over subscriptions from your previous round of test. |
I have tried with multiple names and different topic, same behaviour is observed |
@akshay-habbu Hello, have you ever been able to figure out how to reduce backlog? I am seeing exactly the same issue on my end. Also, using "predefinedSubscription" vs auto creation has any impact on the backlog at all in your experience? |
I am also facing the same issue after completing the pipeline. There is still a message in the backlog |
I have been able to reproduce at least a similar behaviour, with Pulsar version 4.0.1 and the latest Databricks runtime. After reading all the messages, the backlog displays the total number of messages. Correct me if I am wrong, this connector seems to be using the reader interface, instead of the consumer one (relevant PR). According to the reader interface Pulsar docs (here and here), it is up to the client to manage the cursor and there is no need to acknowledge the messages, which would explain why the backlog displays the total number of messages to be consumed, since Spark manages this with checkpoints and would not acknowledge messages. However, I have also noticed the subscription mode is set to
So if this connector is using the reader interface, how is a Furthermore, I tested creating a reader with the native Python client. A So I wonder if this is related: if a durable subscription is being created, perhaps Pulsar is expecting the client to acknowledge the messages, which is not happening. But to end with the main question: why is a |
[ Disclaimer - I am fairly new with Pulsar so I might not understand all the pulsar details but I have been using spark from a while now. ]
I am using Apache Spark consumer for consuming data from Pulsar on AWS EMR. I am using steamnative pulsar-spark connector.
my version stack looks like this
Spark Version- 3.4.1
Pulsar Version- 2.10.0.7
streamnative connector - pulsar-spark-connector_2.12-3.4.0.3.jar
I have created a new pulsar topic and started a fresh spark consumer on that topic, the consumer is able to connect to the topic and consume messages correctly. the only issue I have is with the backlog numbers displayed on the pulsar admin UI.
To Reproduce
Steps to reproduce the behavior:
Create a spark consumer using following code
Also there is a side problem not very important but seems like spark does not create new subscription on its own, the job keeps on failing with
The only way I make it work is by creating a subscription manually on pulsar end and using
predefinedSubscription
option in spark to latch on to that subscriptionI tried passing
pulsar.reader.subscriptionName
,pulsar.consumer.subscriptionName
,subscriptionName
while running job but it failed with same error.Any help would be much appreciated.
The text was updated successfully, but these errors were encountered: