Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If a packet is heard multiple times, rebroadcast using the highest hop limit #5534

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

erayd
Copy link
Contributor

@erayd erayd commented Dec 9, 2024

Sometimes a packet will be in the TX queue waiting to be transmitted, when it is overheard being rebroadcast by another node with a higher remaining hop limit. When this occurs, modify the pending packet in the TX queue to avoid unnecessarily wasting hops.

This is intended to help solve the problem of a router node with excellent coverage (A) hearing a packet bounce around between several poorly-placed nodes while waiting for the opportunity to rebroadcast. On a busy mesh, this can mean that the first copy of the packet that (A) hears may have had its hop limit prematurely reduced, and a better version is subsequently received.

This is more relevant to the late-rebroadcast scenario implemented in #5528, but is still applicable even without that mode, particularly when having to contend with poorly-placed rooftop routers.

For example:

  1. (E) transmits a packet with hop limit 3
  2. (B) & (D) both overhear it. (D) starts rebroadcasting it first with hop limit 2, (B) overhears (D) and waits.
  3. (B) and (C) are both ready to transmit. (C) happens to go first, rebroadcasting with hop limit 1.
  4. (A) hears the packet from (C), and is about to rebroadcast with hop limit 0, when it hears (B) starting to send.
  5. (B) rebroadcasts the packet with hop limit 2.
  6. (A) hears the packet again from (B), this time with hop limit 2.
  7. (A) adjusts the hop limit on the packet in its TX queue from 0 to 1, and then transmits the packet.

example

@fifieldt
Copy link
Contributor

fifieldt commented Dec 9, 2024

This makes sense to me

@thebentern thebentern requested a review from GUVWAF December 9, 2024 11:40
@GUVWAF
Copy link
Member

GUVWAF commented Dec 9, 2024

Implementation looks good, but I'm not entirely sure about the idea behind it, as it effectively means you're artificially increasing the hop limit for certain branches of the mesh. Not sure this is necessarily a bad thing, as it wouldn't happen if the optimal route was chosen in the first place, but it does go against the idea of ​​limiting the amount of rebroadcasts by setting a proper hop limit.

Also, a small issue with this is that a traceroute result will show a route with more hops than the hops_away accompanying the packet, which is also used to add “unknown” nodes to the route.

@erayd
Copy link
Contributor Author

erayd commented Dec 9, 2024

I'm not entirely sure about the idea behind it, as it effectively means you're artificially increasing the hop limit for certain branches of the mesh. Not sure this is necessarily a bad thing, as it wouldn't happen if the optimal route was chosen in the first place, but it does go against the idea of ​​limiting the amount of rebroadcasts by setting a proper hop limit.

The way I see it, is that it is correcting an unintended (and suboptimal) lowering of hop count for those sections. Two different ways of looking at the same situation perhaps? This wouldn't change the number of rebroadcasts in that area, but it would allow the packet to ultimately travel further, rather than getting lost between high sites due to routing shenanigans at the origin area.

Also, a small issue with this is that a traceroute result will show a route with more hops than the hops_away accompanying the packet, which is also used to add “unknown” nodes to the route.

Thanks for flagging this. I do not yet understand how traceroute is implemented (have yet to investigate that part of the source), so will dig into that to ensure it is treated properly.

@erayd erayd marked this pull request as draft December 9, 2024 13:24
@GUVWAF
Copy link
Member

GUVWAF commented Dec 9, 2024

This wouldn't change the number of rebroadcasts in that area

For this specific packet, it does (otherwise this won't have any effect).

Two different ways of looking at the same situation perhaps?

I guess it depends on whether users are already accounting for this or not. Even with this solution, consuming hops too early might still happen if the packet already left the Tx queue, or if the first packet arrived with a hop limit of 0. So, if users set their hop limit based on the maximum they have seen, this only leads to more unnecessary rebroadcasts. Only if they set it exactly to the one needed for the optimal route, then this would indeed higher the success rate without any additional unnecessary rebroadcasts.

@GUVWAF
Copy link
Member

GUVWAF commented Dec 9, 2024

I do not yet understand how traceroute is implemented

It will be called upon as a module, and then modifies the payload, before adding it to the Tx queue again. When you see a duplicate packet, it will be filtered out and thus the modules are not called for it, so the original traceroute result will be in the Tx queue.
I've struggled with this before, since you need to be careful with handling duplicate packets multiple times, for example because it shouldn't be sent to the phone and MQTT twice.

@erayd
Copy link
Contributor Author

erayd commented Dec 9, 2024

For this specific packet, it does (otherwise this won't have any effect).

Can you clarify? I'm not seeing where the extra rebroadcast is in the local area. The only additional rebroadcast should be out of the problem area, by whichever relaying device would otherwise have seen a hop limit of 0 (in this case, A).

The fundamental question is, do you see this behaviour as desirable or undesirable in the general case? If the latter, would you be more comfortable if I implemented this as part of #5528, to apply only to ROUTER_LATE?

I think it's definitely warranted in the ROUTER_LATE case, because those nodes are intended to be serving isolated pockets of nodes that do a poor job amongst themselves first. But with other roles, the usefulness is much more marginal.

@GUVWAF
Copy link
Member

GUVWAF commented Dec 10, 2024

It might just be we’re not agreeing on what the “(local/problem) area” is. In my opinion it doesn’t really matter where the additional rebroadcasts happen, if they are unnecessary then it should be avoided. It is unnecessary when an end-node rebroadcasts even though there is no new receiver, but it is especially relevant for direct messages, where most nodes do not need to receive the packet at all. For example here, if node 0 sends a DM to node 1 with a hop limit of 2, it also goes to node 2 and 3, and ends up with a hop limit of 0 at node 4. However, when it arrives later via 5, with this PR node 4 would still rebroadcast.

image

do you see this behaviour as desirable or undesirable in the general case?

If users are conservative in setting the hop limit, then it would be desirable, but in my experience this is mostly not the case.

I think it's definitely warranted in the ROUTER_LATE case, because those nodes are intended to be serving isolated pockets of nodes that do a poor job amongst themselves first.

The ROUTER_LATE should serve isolated pocket of nodes that did not hear the packet yet. The first packet that arrives at the ROUTER_LATE should not be coming from those isolated pockets of nodes that consume too many hops.

But with other roles, the usefulness is much more marginal.

I agree this will happen more frequently if there are ROUTER_LATE roles in the mesh. However, it will then also affect ROUTERs and REPEATERs, because even when the route via a ROUTER_LATE uses less hops, its packet likely arrives later than a suboptimal route.

Maybe we can come to a compromise and only introduce this after ROUTER_LATE and the NextHopRouter (#2856) are introduced? With the NextHopRouter unnecessary relaying for DMs will already be limited, while using this PR, for broadcasts the reachability might in some cases be improved.

@erayd
Copy link
Contributor Author

erayd commented Dec 11, 2024

It might just be we’re not agreeing on what the “(local/problem) area” is.

I think that may well be the case. I'm looking at this from the principle of "The packet should be spread as widely as possible across the mesh with the fewest number of hops". In that context, what I consider to be unnecessary rebroadcasts are those generated by clients who missed seeing a prior nearby repetition of the packet (e.g. due to terrain obstructions). Whereas you seem to be coming at it from the perspective of "Users are probably trying to reach only people very close to them, and packets should be discouraged from travelling longer distances." (unless I have badly misunderstood your example scenario in the above post).

In my opinion it doesn’t really matter where the additional rebroadcasts happen, if they are unnecessary then it should be avoided.

Agreed. However, currently we do not have a way to divine the user's intent - they might be trying to reach node 1, but they may also be wanting to reach node 6, or perhaps even something further away than that which would require relaying via (6). This is particularly the case with broadcast packets (e.g. if (6) and a closer node share a common channel, the user may desire them both to receive the packet). In the common channel case, the user may not even know how far away (6) is, or whether they're a member of that channel.

With DMs, it's more straightforward - the destination node ID is in the packet header. Albeit that we don't (yet) have a mechanism to determine where on the network that node might be located, and which hops may be necessary to reach it. Looking forward to #2856 for that!

do you see this behaviour as desirable or undesirable in the general case?

If users are conservative in setting the hop limit, then it would be desirable, but in my experience this is mostly not the case.

Is it not that users bump the hop limit up because they are struggling to reach their target otherwise? I know that is certainly the case at the edges of our local mesh here... distant users increase their hopcount, because otherwise their packets don't reliably make it through the mess of mountains & hills to the person 400km away that they're trying to communicate with.

I think it's definitely warranted in the ROUTER_LATE case, because those nodes are intended to be serving isolated pockets of nodes that do a poor job amongst themselves first.

The ROUTER_LATE should serve isolated pocket of nodes that did not hear the packet yet. The first packet that arrives at the ROUTER_LATE should not be coming from those isolated pockets of nodes that consume too many hops.

This makes no sense. ROUTER_LATE does not exist as a one-way thing to only deliver packets to those pockets of users. For it to properly serve those users, also exists to receive packets from them and relay those out to the rest of the mesh. In many cases, due to the topography, that node will be the only path available for such traffic to take.

But with other roles, the usefulness is much more marginal.

I agree this will happen more frequently if there are ROUTER_LATE roles in the mesh. However, it will then also affect ROUTERs and REPEATERs, because even when the route via a ROUTER_LATE uses less hops, its packet likely arrives later than a suboptimal route.

By the time the ROUTER_LATE version of a packet arrives, anything with a ROUTER or REPEATER role would have already rebroadcast that packet (and will therefore completely ignore the ROUTER_LATE version) unless the one from ROUTER_LATE is the first time it was received. This optimisation is only relevant to packets that are already sitting in the TX queue - it doesn't retain them there to wait for better options, and it doesn't rebroadcast it a second time. This behaviour is purely opportunistic. Hence why I say the benefits are more marginal for other roles, because with other roles things get flushed out of the TX queue much more rapidly. The sooner they move, the less this helps.

Maybe we can come to a compromise and only introduce this after ROUTER_LATE and the NextHopRouter (#2856) are introduced? With the NextHopRouter unnecessary relaying for DMs will already be limited, while using this PR, for broadcasts the reachability might in some cases be improved.

I'd be OK with that 🙂. It's a feature I'm keen to have available on our local mesh sooner than that - because it's clear from experience that we definitely do need it - but we can always just use custom binaries on our sites where that matters, so waiting until after #2856 and #5528 are merged won't cause us any problems.

@GUVWAF
Copy link
Member

GUVWAF commented Dec 11, 2024

"Users are probably trying to reach only people very close to them, and packets should be discouraged from travelling longer distances." (unless I have badly misunderstood your example scenario in the above post).

My point is that the hop limit is a global setting, it applies to every packet, so you have to set it to the maximum you expect you need. However, it is also used for the case where you send a direct message to a node close to you.

ROUTER_LATE does not exist as a one-way thing to only deliver packets to those pockets of users. For it to properly serve those users, also exists to receive packets from them and relay those out to the rest of the mesh.

Ah, I was thinking about the use-case for a ROUTER_LATE on a rooftop that serve CLIENT_MUTE devices in the house, but yes, when they are normal CLIENTs and the packets go the other way around, you’re right.

a ROUTER or REPEATER role would have already rebroadcast that packet

Yes, if it is in front of the Tx queue, and there will not be any new packet received in between from another ROUTER/REPEATER or a node that didn’t hear the first packet (hidden node problem).

Anyway, let's merge this after ROUTER_LATE and the NextHopRouter (and hopefully after fixing the traceroute issue).

@erayd
Copy link
Contributor Author

erayd commented Dec 11, 2024

Anyway, let's merge this after ROUTER_LATE and the NextHopRouter (and hopefully after fixing the traceroute issue).

Sounds good 🙂. And yes, I agree that the traceroute issue you've identified is something that I should resolve before this is merged.

/**
* Clamp the hop limit to the greater of the hop count provided, or the hop count in the queue
*/
void RadioLibInterface::clampHopsToMax(NodeNum from, PacketId id, uint32_t hop_limit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this function to the SimRadio also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the SimRadio part of the codebase yet (or at least, I don't recognise the name), but assuming there are no show-stoppers, yes 🙂.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used with https://github.com/meshtastic/Meshtasticator?tab=readme-ov-file#interactive-simulator to simulate multiple nodes on a Linux system (or Docker container).

It should just work if you simply copy this function :)

Copy link
Contributor Author

@erayd erayd Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, maybe better to implement it in RadioInterface directly.

Does RadioInterface have access to the TX queue? I didn't think it did... that's why I put this in RadioLibInterface in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry, it doesn't. Indeed it has to be in RadioLibInterface and then unfortunately needs to be copied to SimRadio.

@caveman99 caveman99 force-pushed the overheard-hoptimisation branch from e520ba6 to 6ea0fed Compare December 12, 2024 15:46
@erayd erayd force-pushed the overheard-hoptimisation branch from 6ea0fed to ea42503 Compare December 12, 2024 17:11
@erayd
Copy link
Contributor Author

erayd commented Dec 12, 2024

@caveman99 Please don't force-push my branches. No issue with you keeping it up to date if you want to, but unnecessarily breaking sync is irritating. Merge commits work just fine.

@garthvh
Copy link
Member

garthvh commented Dec 12, 2024

@caveman99 Please don't force-push my branches. No issue with you keeping it up to date if you want to, but unnecessarily breaking sync is irritating. Merge commits work just fine.

This is our process, having one branch managed differently than the others is not really going to work.

@erayd
Copy link
Contributor Author

erayd commented Dec 12, 2024

@garthvh

This is our process, having one branch managed differently than the others is not really going to work.

#5528 indicates otherwise, as you folks have been using merge commits there without issue. And on several other PR branches, including some of them by @caveman99. So using merge commits is clearly accepted practice here.

Why is doing things in a way that doesn't break sync OK for those branches, but not this one?

@caveman99
Copy link
Member

Preferred is rebase, sometimes you can't rebase, then we use merge commits.

@erayd
Copy link
Contributor Author

erayd commented Dec 12, 2024

@caveman99

Preferred is rebase, sometimes you can't rebase, then we use merge commits.

Roger. Looking at the PR list though, I'm still seeing a fair bit of merge activity in cases where a rebase would have worked fine (including my initial example of #5528) - so it doesn't seem that this preference is being adhered to in practice across the board. If that is something you want to insist on though, then I will just put up with the annoyance as the price of engaging with this project.

Rebase-first-by-maintainers for regularly keeping every PR updated on a repo this busy does seem an odd way of doing things IMO. Given the frequency of updates, you'd be breaking things on a near-constant basis. If this preference is something you want to insist on, might I suggest that you mention it in the default notes when somebody creates a PR? Currently it just requests that editing by maintainers be enabled, but makes no mention that one's work might get repeatedly rebased without warning. That is a useful thing to know upfront, especially for PRs that may involve a notable number of distinct commits between pushes.

@erayd erayd force-pushed the overheard-hoptimisation branch from 8f75b28 to a4cfa57 Compare December 21, 2024 10:44
…p limit

Sometimes a packet will be in the TX queue waiting to be transmitted,
when it is overheard being rebroadcast by another node, with a higher
hop limit remaining. When this occurs, modify the pending packet in
the TX queue to avoid unnecessarily wasting hops.
@erayd erayd force-pushed the overheard-hoptimisation branch from a4cfa57 to 97751df Compare December 28, 2024 03:15
In order to ensure that the traceroute module works correctly, rather
than modifying the hop limnit of the existing queued version of the
packet, simply drop it ifrom the queue and reprocess the version of the
packet with the superior hop limit.
@erayd
Copy link
Contributor Author

erayd commented Dec 28, 2024

@GUVWAF What do you think of the new approach as a way of resolving the traceroute issue?

It seemed a bit dirty IMO to put exceptions in there for traceroute, so now it simply reprocesses the packet if it arrives again with a better hop limit while still in the TX queue (instead of modifying the hop limit of the existing queued version).

} else if (iface && p->hop_limit > 0) {
// If we overhear a duplicate copy of the packet with more hops left than the one we are waiting to
// rebroadcast, then remove the packet currently sitting in the TX queue and use this one instead.
if (iface->removePendingTXPacket(getFrom(p), p->id, p->hop_limit - 1)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: need to also catch packets that arrived initially with hop_limit of 0, and then subsequently with something higher.

As currently implemented, such packets will never hit the TX queue (because they weren't eligible for rebroadcast the first time around), and therefore will never be noticed as eligible for reprocessing.

@GUVWAF
Copy link
Member

GUVWAF commented Dec 28, 2024

@erayd Unfortunately this won't work as it will lead to duplicate packets in the apps and on MQTT as I mentioned above. Even if it's still in the Tx queue, it has already been sent to the phone here:

service->handleFromRadio(&mp);

and to MQTT here:
mqtt->onSend(*p, *p_decoded, chIndex);

@erayd
Copy link
Contributor Author

erayd commented Dec 28, 2024

@GUVWAF Roger, and thanks. Will keep thinking...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants