Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network utilization #58

Open
zhengpeirong opened this issue May 18, 2024 · 3 comments
Open

network utilization #58

zhengpeirong opened this issue May 18, 2024 · 3 comments

Comments

@zhengpeirong
Copy link

zhengpeirong commented May 18, 2024

Let's calculate the transfer time theoretically.

llama3 8B

The original experiment data is here.
Since the transfer is full-duplex, there's no interference between uplink and downlink.
So, we can choose the bigger 510 kB as the transfer data volume to calculate the transfer time.

$$510000*8 bit/1G bps = 4.08ms$$ $$4.08ms/199.60 ms ~= 2\%$$

So, the average transfer time should be 4.08ms. However, your result is 199.60 ms, 50 times higher.
So, the network utilization ratio is merely 2%.

llama2 7B

For comparison, I summarize a similar model (llama2 7B) using different devices:
image

VMs

In this discussion, the Network Bandwidth is 20 Gbps, reference here.
image

$$590000*8 bit/20G bps = 0.236ms$$ $$0.236ms/7.62ms= 3\%$$

So, the network utilization ratio is merely 3%.
Similarly, we can calculate the result of 4 VMs to be 6%.

RaspberryPi

image

Also, the result of the Raspberry Pi cluster is calculated to be 9.0%, 48.0%, 14.1% for 2,4,8 devices.

  • llama2 13B
    23.9%,25.75%, 9.8%
  • llama2 70B
    8.5%

Summary

I think the network utilization, average around 11%, ranging from 2% to 48%, is under-optimized.
Developing the code possibly ensures a stable and high network utilization.

Originally posted by @zhengpeirong in #41 (reply in thread)

@b4rtaz
Copy link
Owner

b4rtaz commented May 18, 2024

The test that you are referring to is very unlucky. 2 Raspberry Pi 5 devices + a cheap switch achieve 80.11 ms / token. I think you underestimate the impact of setup quality on the result. In the google cloud I achieved 8.56 ms / token (Llama 7B Q40), but there is probably something faster than Gigabit Ethernet.

The next thing is the transfer characteristics, a node calulates the result of own slice then synchronizes the result. So it looks like this:

image

So mostly there is no any transfer, and suddenly some data to transfer appear. In this case the latency has a huge impact on the final transfer time.

@zhengpeirong
Copy link
Author

zhengpeirong commented May 19, 2024

The test that you are referring to is very unlucky. 2 Raspberry Pi 5 devices + a cheap switch achieve 80.11 ms / token.

80.11 ms means 15% network utilization, which is not high.

I think you underestimate the impact of setup quality on the result. In the google cloud I achieved 8.56 ms / token (Llama 7B Q40), but there is probably something faster than Gigabit Ethernet.

I have included this factor, the Network Bandwidth is 20 Gbps. And the result is calculated in the issue content.

The next thing is the transfer characteristics, a node calulates the result of own slice then synchronizes the result. So it looks like this:
image
So mostly there is no any transfer, and suddenly some data to transfer appear. In this case the latency has a huge impact on the final transfer time.

This seems simplified because the transfer process is actually composed of 2 procedures: the transfer time should begin when the source node begins to send, and the transfer time should end when the destination node finishes receiving. So, the statistics, collected only on the root node, deviate from the actual value.
image

If we build analysis on these statistics, then still the network bandwidth should be enough. 1Gbps>>hidden_dim*8 bits!

@zhengpeirong
Copy link
Author

I am using 'tcpdump' to capture the real speed of packets and analyze the actual bandwidth utilization. But I only have 4 Raspberry Pi and may need your help to run the experiment of 8 devices, usually achieving highest latency, and share logs. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants