Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FD2 perf: pack unique runtime args #8602

Open
pgkeller opened this issue May 17, 2024 · 0 comments
Open

FD2 perf: pack unique runtime args #8602

pgkeller opened this issue May 17, 2024 · 0 comments

Comments

@pgkeller
Copy link
Contributor

pgkeller commented May 17, 2024

Presently sending unique runtime args per-core/processor requires the dispatcher to iterate over cores/processors resulting in 64 cores * 3 processors = 192 noc transactions. This dispatch is riscv limited not data limited and preforms below expectations. To fix this:

  1. Reduce the max number of unique args from 256 to…32?
  2. Pack args on a row. That is, processors on a row will receive all the unique runtime args for all processors on that row. This should give a ~8x gain. For 32 unique args, we'd send 8324=1K bytes which is likely still riscv dominated (and takes just 32 noc cycles)
    2a) We can experiment w/ packing just by 4x (half row) by masking off the upper bit of noc_x to reduce data replication. This could be done dynamically based on the number of RTAs
  3. Use a dynamic memory map for RTAs (can do this now before the ring buffer). This allows packing B/N/T args tightly so they are sent together as one NOC transaction which is another ~3x gain for ~24x

To do this we need to:

  1. Move RTAs to a dynamic memory map w/ the origin of the args specified in the launch message
  2. Change RTA indexing in the kernel (get_runtime_arg) to use noc X index to retrieve the right arg
  3. Change host side fast dispatch to pack all this appropriately. Device side has the capabilities to handle this already
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant