Skip to content

feat(transport): add bandwidth tracking (send only) by space (#157) #205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

anchalshivank
Copy link
Contributor

Implemented tracking of bytes sent per space in the transport layer by wrapping implementations of the Transport trait. This enables basic observability of outbound traffic for each space.

Currently, only sent data is tracked; receiving metrics can be added in a follow-up.

Related to: #157

…#157)

Introduced BandwidthStats and BandwidthTracker to record bytes sent and received per space and optionally per module. This lays the groundwork for future bandwidth limiting capabilities. Currently, the tracking is focused on outgoing and incoming traffic and supports querying stats via the tracker interface.

This addresses the reporting aspect of bandwidth limiting discussed in holochain#157.

feat(bandwidth): track bandwidth usage by space and module (holochain#157)

Introduced BandwidthStats and BandwidthTracker to record bytes sent and received per space and optionally per module. This lays the groundwork for future bandwidth limiting capabilities. Currently, the tracking is focused on outgoing and incoming traffic and supports querying stats via the tracker interface.

This addresses the reporting aspect of bandwidth limiting discussed in holochain#157.
@anchalshivank anchalshivank force-pushed the 157-bandwidth-limiting branch from 22bfb79 to a7c2f63 Compare May 6, 2025 17:04
Copy link
Member

@ThetaSinner ThetaSinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I really appreciate the quality documentation on the new types. We've made an effort to try to keep up with documentation on this project and it's great if we can keep that going :)

A couple of really minor comments. It'd also be good to take the .idea files back out of Git please. I'm a RustRover user myself, but we try to keep IDE files out of Git because we all use different tools on this project :)

@ThetaSinner ThetaSinner requested a review from a team May 6, 2025 17:19
anchalshivank and others added 4 commits May 6, 2025 22:52
@anchalshivank anchalshivank force-pushed the 157-bandwidth-limiting branch from 364225f to 1e73276 Compare May 6, 2025 17:50
Copy link
Contributor

@jost-s jost-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code reads very clear. Nevertheless please add a test for this feature. If you want guidance how to do that, give us a shout. There are unit tests for most features in this repo, so you can have a look at them for inspiration.

Copy link
Contributor

@cdunster cdunster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Thanks!

I particularly like the detailed and clear comments. I added a couple of suggestions.

@anchalshivank
Copy link
Contributor Author

I haven't added tests for this module because the behavior is often non-deterministic. For example, if I wait for 5 seconds, the node may gossip during that time, leading to inconsistent test results. See the relevant code here: https://github.com/anchalshivank/kitsune2/blob/bc941a0f3f15b42c478fa052ca11cf4484c17b2e/crates/kitsune2/src/lib.rs#L263. Suggestions for handling this kind of testing would be appreciated.

@jost-s
Copy link
Contributor

jost-s commented May 7, 2025

I haven't added tests for this module because the behavior is often non-deterministic. For example, if I wait for 5 seconds, the node may gossip during that time, leading to inconsistent test results. See the relevant code here: https://github.com/anchalshivank/kitsune2/blob/bc941a0f3f15b42c478fa052ca11cf4484c17b2e/crates/kitsune2/src/lib.rs#L263. Suggestions for handling this kind of testing would be appreciated.

Sure I'll make suggestions. This kind of thing is inherent to this system, where you can't guarantee that something has happened. Our go to tool for this circumstance is to set a timeout and inside of the timeout repeatedly query what should be asserted until it succeeds. If time runs out, the timeout must be longer, until it is stable on CI. Or something can be improved in the test logic. Or it can also happen that a field must be exposed just for testing purposes.

There's a macro for this kind of repeated check https://github.com/holochain/kitsune2/blob/main/crates/test_utils/src/lib.rs#L42

Here's an example of this macro in use https://github.com/holochain/kitsune2/blob/main/crates/core/src/factories/core_fetch/test/incoming_request_queue.rs#L134

Instead of the sleeps (e. g. https://github.com/anchalshivank/kitsune2/blob/bc941a0f3f15b42c478fa052ca11cf4484c17b2e/crates/kitsune2/src/lib.rs#L316), use the iter_check macro, and that should make the test predicable.

@ThetaSinner
Copy link
Member

I like Jost's suggestion for this, and to prove it's working, it's not necessary to assert a specific value I think. An iter_check! that asserts "sent > 500" and "received > 500" does demonstrate that the value is increasing as expected when nodes talk to each other.

I've chosen the value 500 at random there, not necessarily useful. It's just about something that is "> 0" really.

- Updated `get_space_stats` to accept `Option<SpaceId>`:
  * Returns per-space stats if `Some(SpaceId)`
  * Returns aggregated stats if `None`
- Added module-level tests:
  * `track_sent_bytes_per_module_two_peer`
  * `track_sent_bytes_per_module_no_peer`
@anchalshivank
Copy link
Contributor Author

Made minor changes to some of the previous test cases that were failing due to timeout issues. The application code remains unchanged. I investigated the failures but couldn't identify the exact cause, so I increased the timeout duration in the tests as a workaround.

- Merged fields from upstream Kistune p2p source (f65c4ec) and existing attributes
- Updated `Default` implementation accordingly
- Add fields from upstream Kistune p2p source (f65c4ec) to `BandwidthConfig`
@anchalshivank
Copy link
Contributor Author

@ThetaSinner @jost-s @cdunster I’ve added a code snippet for the Bandwidth module. I believe the way I’ve integrated it into the existing code is appropriate. Once everyone agrees on this approach, I’ll proceed with implementing the logic for configuring bandwidth.

Now that the transport layer includes both the tracker and config parameters, monitoring and control can be handled more effectively at this level.

Initially, I considered integrating the bandwidth parameters into the k2GossipModule, but I decided it would be cleaner to create a separate struct instead, without modifying the existing code.

@@ -101,7 +106,8 @@ impl TxImpHnd {
data,
..
} = data;

let len = data.len() as u64;
println!("Bandwidth config {:?}", self.config);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these println!'s just for debugging and to be removed later? We typically use the tracing crate for printing/logging messages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added them for debugging purpose. As this is not the final code

@@ -118,7 +118,7 @@ impl K2GossipFunctionalTestHarness {

/// Wait for the given ops to be in our op store.
pub async fn wait_for_ops(&self, op_ids: Vec<OpId>) -> Vec<MemoryOp> {
tokio::time::timeout(Duration::from_millis(500), {
tokio::time::timeout(Duration::from_millis(5000), {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a deliberate change or just for testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test cases for the 500 status code were failing, so I increased the timeout. However, I'm unable to debug the root cause, as I haven't made any changes to the relevant part of the code.

Comment on lines +184 to +185
//If I comment it test passes else fails
// enable_tracing();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the CI fail or when running it locally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yups

@jost-s
Copy link
Contributor

jost-s commented May 14, 2025

Is this ready for review again, @anchalshivank?

@anchalshivank
Copy link
Contributor Author

Is this ready for review again, @anchalshivank?

Nope, I will make some changes by the end of the day.

@jost-s
Copy link
Contributor

jost-s commented May 28, 2025

@anchalshivank Are you willing to finish this PR?

@anchalshivank
Copy link
Contributor Author

@anchalshivank Are you willing to finish this PR?

Yes I will finish it by tomorrow, I have been very busy these days

@jost-s
Copy link
Contributor

jost-s commented Jun 2, 2025

@anchalshivank I see you haven't finished this yet. This is just an observation, you don't need to explain yourself or make a promise. I'm sure there is always enough to do for you. =)

Are you okay with us completing this PR?

@anchalshivank
Copy link
Contributor Author

I need more time . If anyone else wants to complete this pr, you can assign it to them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants