-
Notifications
You must be signed in to change notification settings - Fork 766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collators stuck producing same block #5253
Comments
Hi, I investigated this issue a bit today. From the collator side everything seems to be okay. However, for some reason the submitted blocks fail to validate on the rococo validators. I extracted this from one validators log:
What the reason for this validation failure is exactly I am not sure. Maybe something went wrong with the runtime upgrade? When was it deployed? |
Runtime was deployed on 2024-08-05 at 10:25:12 (+UTC), this is the block when the runtime upgrade is applied: https://neuroweb-testnet.subscan.io/block/4238101 You can see that 30 blocks were created after the runtime upgrade was applied: https://neuroweb-testnet.subscan.io/block/4238136, so the upgrade itself had no issue as block production continued after the upgrade. We are considering changing runtime to the previous version using sudo on Rococo to unblock the network, do you think this would help us? |
This seems to be the reason the parachain block fails to validate:
Maybe it helps you figure it out why the blocks are invalid. |
As far as I can tell, everything was fine until https://neuroweb-testnet.subscan.io/block/4238137 which is appears unfinalized in subscan - latest para head on Rococo relay chain is of block https://neuroweb-testnet.subscan.io/block/4238136: 0xf1f71e7d498dbecd671c4a985199398f9d87fb380944184aa594251094b516cb It seemed to have worked fine as long as all the para blocks were empty after the upgrade until 4238137 which contains a bunch of
Yeah setting the old code and head should work. |
One more thing to try is to make sure that you are not using the native executor to build your blocks. Recently we had reports of nodes stalling because there was some encoding difference between wasm and native: #4808 The solution in that case was to switch to full wasm execution, which you should also do. If that does not help, next step would be to reproduce this error locally. Recently there was mechanisms introduced to write PoVs to the disk and then execute them locally. It has two parts:
The debugging flow would be the following:
|
Thank you all for your swift replies! @skunert to your point, I believe the testnet version is still using native executor. I can also point out to this PR on HydraDX that changed that as well. Still, given that the fix will change a bit of time and the team is in need to swiftly get their parachain back to production, I'd suggest we go with the suggestion of setting up the old code that @sandreim said it should work. The questions that I have given that the chain managed to produce ~30 blocks with the new code, are:
|
@SBalaguer You should be able to just pass |
Ah missed your questions the first time. I have only limited experience in resetting chains. But from my mental model:
|
When setting the parameter we get
So I guess the only option is to change the code. |
When you do this, all the nodes will need to resync because the blocks after the runtime got upgraded are already finalized. We don't support to revert the finalized chain. Without any more details, I assume as well that there is a mismatch between the native and the runtime registered on the relay chain. Just remove the native executor to test this theory. It should be the fastest option. If that doesn't work, you can still reset. |
There should also be a |
Just to confirm, we should remove the native executor in a same way as it is done on the HydraDX PR. And just change the binaries to run the new ones and check if that resolves the issue. |
That is correct. After you switched to the new binaries, the collators should start building blocks again. This time however you used WASM to build so there should be no difference to the validation code the validators are running. Blocks should be backed by the relay chain again and you chain will progress. |
Thanks for the help guys, this was indeed the solution and collators are now producing blocks on NeuroWeb Testnet! |
That is great to hear! Will close here, if they get stuck again (which I don't expect) feel free to reopen. |
Is there an existing issue?
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
At NeuroWeb we are experiencing an issue with NeuroWeb Testnet connected to the Rococo relay chain.
As per @skunert request, we are opening a new issue. Initially, we reported this to issue #1202.
After runtime upgrade where we changed dependencies from v0.9.40 to v1.9.0 (OriginTrail/neuroweb#86), upgrade was successful but after 30+ blocks, the chain could not produce a block.
All collators have block 4238138 as their best block and 4238137 as finalized, but they are constantly trying to create 4238138 block again and are stuck in a loop. With following logs, with
aura::cumulus=trace
,parachain=debug
,txpool=debug
:We assume that it is caused by forks on relay chain (Rococo) which are not handled properly by collators on NeuroWeb Testnet. But we would like to get confirmation and better understanding of this issue, so we are sure how to avoid having it on NeuroWeb Mainnet, alongside finding the way to unblock NeuroWeb Testnet.
Steps to reproduce
No response
The text was updated successfully, but these errors were encountered: