Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible inaccuracies on (linux) bond interfaces #136

Open
imsnif opened this issue Jan 18, 2020 · 36 comments
Open

Possible inaccuracies on (linux) bond interfaces #136

imsnif opened this issue Jan 18, 2020 · 36 comments
Labels
bug Something isn't working help wanted Extra attention is needed needs reproduction

Comments

@imsnif
Copy link
Owner

imsnif commented Jan 18, 2020

Someone reported this on twitter: https://twitter.com/LinuxReviews/status/1218547448928444418. I don't have a lot more details unfortunately.

@imsnif imsnif added bug Something isn't working help wanted Extra attention is needed labels Jan 18, 2020
@zhangxp1998
Copy link
Collaborator

Not sure if this is due to averaging? if the traffic is bursty our averaging calculation will make the numbers go wild. Looking at these screenshots, it seems that we are always reporting a number lower than other tools, which is consistent with my hypothesis...

@imsnif
Copy link
Owner Author

imsnif commented Jan 18, 2020

That was my guess too. The person on twitter wasn't willing to participate in an issue here, so I think to just leave this open in case someone else encounters this issue and decides to be helpful.

@amyspark
Copy link

amyspark commented Mar 1, 2020

👋 this is in Manjaro (kernel version Linux Dianakko 5.4.22-1-MANJARO #1 SMP PREEMPT Mon Feb 24 09:01:11 UTC 2020 x86_64 GNU/Linux).

tmux of bmon and bandwhich simultaneously

@imsnif
Copy link
Owner Author

imsnif commented Mar 1, 2020

Hey @amyspark, thanks for bringing this up. I hope you're okay with helping us debug this a little?

To start, a few questions to help identify a possible cause for this:

  1. Does this happen all the time, or just in specific situations? With specific sorts of traffic, or?
  2. Does this happen for sustained traffic (eg. leaving the app running for 1-2 minutes?)
  3. As far as you can tell, are the differences in traffic more or less consistent, or are they all over the place?

@amyspark
Copy link

amyspark commented Mar 5, 2020

@imsnif ,

  1. All the time, it always underestimates traffic, even though the connections are properly listed.
  2. As far as I've used it, yes.
  3. No, always ~3KB/s for an usage of ~300KB/s, about two orders of magnitude less.

@imsnif
Copy link
Owner Author

imsnif commented Mar 5, 2020

Thanks @amyspark. I want to try to troubleshoot this to find out which part of the app is misbehaving. If sometime in the next few days I give you (either a branch to compile or a compiled binary - whichever is more comfortable for you) would you be willing to run it? I essentially just want to cut out all parts of the app except for the traffic sniffer and see if it reports the total traffic correctly in your system.

@amyspark
Copy link

amyspark commented Mar 5, 2020

Sure @imsnif ! Let me know which branch and I'll test it right away.

@imsnif
Copy link
Owner Author

imsnif commented Mar 29, 2020

So, this came up again in #155, and I'd like to try to get back and find the root cause for this. I still cannot reproduce this locally, but @TheLostLambda encountered this issue with large volumes of data.

My first guess is that this is somehow related to an issue with the ip payload length reporting in libpnet. I made a branch that measures the size of the Ethernet frame rather than the ip packet to see if this direction might be promising.

@amyspark - I know it's been a while, my apologies for not getting to this, but if you'd be willing to checkout the debug/inaccuracies branch of this repo and see if the problem still happens there, that would be great.

@TheLostLambda - if you'd like to give it a go as well, that would also be great.

In case anyone else would like to try as well, this is how I run bandwhich locally after I checked out the debug/inaccuracies branch:
cargo build && sudo ./target/debug/bandwhich

There will probably be a little more back and forth, as if this is not the issue, I have some other guesses and things I'd like to look into. Thanks in advance for bearing with me. :)

@TheLostLambda
Copy link
Collaborator

Hi!

I've given this a go and, unfortunately, the same issue persists. Bandwhich reporting ~4MBps when I'm pulling ~25MBps.

Thanks so much for helping with the debugging!

@imsnif
Copy link
Owner Author

imsnif commented Mar 29, 2020

Hrm... How about if you start bandwhich (in this branch or normally) with the -i <your interface> flag?

If that still happens, I'll update the debug branch and try to remove everything except the packet sniffer and some text that updates the total bandwidth on screen. Try to peel away as much stuff as possible.

@TheLostLambda
Copy link
Collaborator

No luck unfortunately, with either my wifi interface or a virtual docker interface

@imsnif
Copy link
Owner Author

imsnif commented Mar 29, 2020

Hum... really more of a hunch than anything substantial, but how about if you kill the docker daemon, restart the interface and then try?

@TheLostLambda
Copy link
Collaborator

A good call, but killing docker and deleting the virtual interface didn't resolve the issue. The error does seem speed dependent though. There is less error if the connection is slower, even for the same size of file.

@imsnif
Copy link
Owner Author

imsnif commented Mar 29, 2020

Ah well, we had to try. :)
I understand regarding the connection. In my mind I'm leaning toward some issues with the acquiring the arc lock. I'll have that minimal total bandwidth implementation I mentioned above in the next few days and then we'll know more. If the issue still happens there it'll be considerably easier to isolate it. Either way, I believe we'll know more.
Thanks for the help! Will ping when I have something.

@TheLostLambda
Copy link
Collaborator

Sounds excellent! Thank you again for all of your hard work!

@amyspark
Copy link

@imsnif in my testing, upload now matches bmon's output. Download is constantly 0 (which is obviously not).

@imsnif
Copy link
Owner Author

imsnif commented Mar 30, 2020

Hum @amyspark, thanks for sticking around - that's quite odd!

Alright, I updated the debug branch with a commit that comments out most of the app, leaving just the network sniffer threads and the keyboard input (so that quitting with ctrl-c or q would be possible).

This would now show nothing on the screen, and upon quitting would dump the total and bandwidth to the screen. The bandwidth it shows would be an average per second of everything that happened ever since the app was started.

Could you two please give this a try and see if the reporting is more on the mark for you now?
Thanks!!

@TheLostLambda
Copy link
Collaborator

Unfortunately it's still acting up... I've been downloading a 1GB file with this command:

wget -O /dev/null http://ipv4.download.thinkbroadband.com/1GB.zip

And wget reports around 20MBps, but this is what a got from bandwhich:

wlo1 total downloaded 203327143 bytes
wlo1 total uploaded 2943626 bytes
wlo1: average download per second 2.99MBps
wlo1: average upload per second 43.29KBps
lo total downloaded 0 bytes
lo total uploaded 344 bytes
lo: average download per second 0Bps
lo: average upload per second 5Bps
docker0 total downloaded 0 bytes
docker0 total uploaded 0 bytes
docker0: average download per second 0Bps
docker0: average upload per second 0Bps
br-9a93123f3720 total downloaded 0 bytes
br-9a93123f3720 total uploaded 0 bytes
br-9a93123f3720: average download per second 0Bps
br-9a93123f3720: average upload per second 0Bps

So the rate is still a bit slow and the total is unfortunately 800MB off :(

Let me know if there is any more testing I can help with!

@imsnif
Copy link
Owner Author

imsnif commented Mar 30, 2020

Aha! This is actually good news, because it rules out most of the app. :) The problem is probably somewhere in the network sniffer.

So, I updated the branch with a version that doesn't even parse the packets, but just counts the raw ethernet frames on the wire. This will mix up and down speed, so you'll only see download (which is a sum of both), but since we're way off accuracy in your cases, I think we'd be able to tell if this made a difference nonetheless.

Could you please check? (also, preferably with the -i <your interface> option - just to rule everything out)
Thanks again for helping out!

@TheLostLambda
Copy link
Collaborator

Thanks for all of the continued help!

I've given things a go with the same wget command as last time (without around the same speed from wget) and get this back from bandwhich -i wlo1:

wlo1 total downloaded 208174418 bytes
wlo1 total uploaded 0 bytes
wlo1: average download per second 3.41MBps
wlo1: average upload per second 0Bps

Still off unfortunately, but it's good that we are narrowing things down a little!

@amyspark
Copy link

@imsnif Tested it with Netflix's fast.com and it now matches my expected bandwidth (~400KB/s).

@TheLostLambda
Copy link
Collaborator

For completeness, I've run my test again with fast.com and fast.com gives 180Mbps (so 22.5MBps) and I'm only getting 4.31MBps from bandwhich.

Limiting my speed to 400KBps, I still only get 113KBps from bandwhich

@imsnif
Copy link
Owner Author

imsnif commented Mar 30, 2020

Alright, we're getting somewhere. Seems like we have two different issues here.

@amyspark - I created a new branch called debug/inaccuracies2, in it I included the first iteration of this test that I was too quick to assume would not work in your case. Could you give it a shot?

@TheLostLambda - I made another try in debug/inaccuracies, I'm trying to remove the read_timeout configuration from the datalink channel. Quitting the app might be a little slower :) Could you try?

@TheLostLambda
Copy link
Collaborator

Mostly the same sort of results in my branch unfortunately.

@imsnif
Copy link
Owner Author

imsnif commented Mar 30, 2020

Thanks @TheLostLambda - I have some more ideas, I'll shoot some changes your way in the next few days if I haven't tired you out already . :)

@TheLostLambda
Copy link
Collaborator

No worries! I'm grateful for all of the help!

@TheLostLambda
Copy link
Collaborator

I got led on a wild ride, but I think I squashed this bug in #157 (more detail in that PR)

Please let me know if the master_migrate_to_pcap branch of this fork fixes the issue for you as well :)

Additionally, out of curiosity, could both @imsnif and @amyspark post the output of this command:

ethtool -k <interface> | grep offload

Where <interface> is the network interface like wlo1 or eth0?

@amyspark
Copy link

@TheLostLambda -- your branch is still wildly off my current bandwidth usage.
As for the ethtool output:

> ethtool -k wlp4s0 | grep offload
tcp-segmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]

@TheLostLambda
Copy link
Collaborator

Hmm, definitely the one with pcap?

Could you try running (as root):

# ethtool -K wlp4s0 tx off rx off gso off tso off gro off lro off

And checking if the usage is correct then?

It would also be good if you could send ethtool -k wlp4s0 | grep offload again after running the first command.

@imsnif
Copy link
Owner Author

imsnif commented Mar 31, 2020

Hah! That's really interesting. Great work tracking this down @TheLostLambda!!
This is my output:

tcp-segmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]

I suspect @amyspark's issue comes from somewhere else given that counting the raw bytes was accurate. I would love to check it further.

Tbh, I would be okay with keeping the backend as is and having the offloading shutdown as a documented troubleshooting stage for bandwhich. I suspect most people would not mind this as a solution, and if someone comes along that does, we can consider issuing this fix. I'm just a little weary of bugs such a deep infrastructure change can introduce. What do you think?

@TheLostLambda
Copy link
Collaborator

Hmm, I don't know. I really think it would be better for Bandwhich to handle offloading correctly out of the box, as many other applications seem to do. Additionally, the offloading features are there to increase network performance, so toggling them off does have an adverse effect on high-speed networking.

Personally, I trust pcap to be reliable, as it is the backend for Wireshark, tcpdump, and other very widespread programs. I can certainly understand the hesitancy to change out the backend, but it does manage to solve my problem on every one of my machines I've tested so far (currently 3) and I suspect that the number of people running into this problem will only increase as NICs become more advanced.

I'd still like to see the swap to pcap making it in to master, but I suppose I can always maintain a fork if need be.

@TheLostLambda
Copy link
Collaborator

I've got a new, less disruptive fix in #158 :)

@imsnif
Copy link
Owner Author

imsnif commented Mar 31, 2020

Hey @amyspark - I would be curious to see if @TheLostLambda's fix addresses your issue as well. I just merged it to master (haven't released it yet). If you have the time to check, that would be great. Otherwise I'd be happy to get to the bottom of the issue you're experiencing as well. Thanks!

@amyspark
Copy link

@TheLostLambda about the pcap issue, upload rate looks OK with the offload disabled. This is the ethtooloutput:

bandwhich on  master is 📦 v0.12.0 via 🦀 v1.43.0-beta.3 took 47s 
❯ sudo ethtool -K wlp4s0 tx off rx off gso off tso off gro off lro off
Cannot change rx-checksumming
Cannot change tx-checksumming
Cannot change tcp-segmentation-offload
Cannot change large-receive-offload

bandwhich on  master is 📦 v0.12.0 via 🦀 v1.43.0-beta.3 
❯ ethtool -k wlp4s0 | grep offload
tcp-segmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]

@imsnif -- your branch looks OK with @TheLostLambda's ethtool fix. Download rate is still 0 here, but upload is correct now.

@imsnif
Copy link
Owner Author

imsnif commented Apr 5, 2020

@amyspark - we just released a new version (0.13.0) - could you try with it and see if it works for you?

@amyspark
Copy link

amyspark commented Apr 5, 2020

@imsnif - no, 0.13.0 consistently underestimates (by a half or more) current traffic. Download rate is still locked at 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed needs reproduction
Projects
None yet
Development

No branches or pull requests

4 participants