Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Having DPRINTs in kernels affects perf even when not enabled #8575

Closed
tt-aho opened this issue May 16, 2024 · 13 comments
Closed

Having DPRINTs in kernels affects perf even when not enabled #8575

tt-aho opened this issue May 16, 2024 · 13 comments
Assignees
Labels
bug Something isn't working P1_critical perf for issues tracking performance problems/improvements

Comments

@tt-aho
Copy link
Contributor

tt-aho commented May 16, 2024

Ex.

Running this test ./build/test/tt_metal/perf_microbenchmark/dispatch/test_pgm_dispatch -w 5000 -s 256 -x 7 -y 6 -a 128 I was seeing perf of ~22.8us without enabling dprint, but after commenting out the actual dprint statements in cq_dispatch.cpp, I get perf of ~21.57us.

This implies that even without enabling dprints, it negatively impacts perf to have dprint statements in kernels, when ideally dprints should be a nop if not enabled and not impact perf.

@tt-aho tt-aho added bug Something isn't working perf for issues tracking performance problems/improvements labels May 16, 2024
@pgkeller
Copy link
Contributor

@tt-dma I think we looked at this before, I thought there were some remnants when dprint was disabled but didn't think they'd affect perf. Do you recall? We need ot make dprint go away 100% when disabled

@tt-dma
Copy link
Contributor

tt-dma commented May 16, 2024

Due to the syntax it's not trivial to #define out like watcher, I think we're currently relying on the compiler to optimize the empty object/calls away, but maybe it's not doing that. I'll investigate.

@tt-dma
Copy link
Contributor

tt-dma commented May 16, 2024

@tt-aho I played around with the test you linked, strangely enough I get worse perf when I commented dprint statements out of cq_dispatch.cpp on my BM. However, I did dump the elf file and saw some references to DebutPrinter that were still in there even though DPRINT is not enabled for the test. Can you try this patch and let me know if that helps? It removes all references to the dprint struct for me in my testing.

diff --git a/tt_metal/hw/inc/debug/dprint.h b/tt_metal/hw/inc/debug/dprint.h
index b30043de6e..e04824269e 100644
--- a/tt_metal/hw/inc/debug/dprint.h
+++ b/tt_metal/hw/inc/debug/dprint.h
@@ -259,7 +259,10 @@ void debug_print(DebugPrinter &dp, DebugPrintData data) {
 }

 template<typename T>
+#if defined(DEBUG_PRINT_ENABLED) && !defined(PROFILE_KERNEL)
+// To get this function optimized away, let it be inlined when dprint disabled.
 __attribute__((__noinline__))
+#endif // ENABLE_DEBUG_PRINT && !PROFILE_KERNEL
 DebugPrinter operator <<(DebugPrinter dp, T val) {
 #if defined(DEBUG_PRINT_ENABLED) && !defined(PROFILE_KERNEL)
     DebugPrintData data{

@tt-aho
Copy link
Contributor Author

tt-aho commented May 16, 2024

@tt-aho I played around with the test you linked, strangely enough I get worse perf when I commented dprint statements out of cq_dispatch.cpp on my BM. However, I did dump the elf file and saw some references to DebutPrinter that were still in there even though DPRINT is not enabled for the test. Can you try this patch and let me know if that helps? It removes all references to the dprint struct for me in my testing.

diff --git a/tt_metal/hw/inc/debug/dprint.h b/tt_metal/hw/inc/debug/dprint.h
index b30043de6e..e04824269e 100644
--- a/tt_metal/hw/inc/debug/dprint.h
+++ b/tt_metal/hw/inc/debug/dprint.h
@@ -259,7 +259,10 @@ void debug_print(DebugPrinter &dp, DebugPrintData data) {
 }

 template<typename T>
+#if defined(DEBUG_PRINT_ENABLED) && !defined(PROFILE_KERNEL)
+// To get this function optimized away, let it be inlined when dprint disabled.
 __attribute__((__noinline__))
+#endif // ENABLE_DEBUG_PRINT && !PROFILE_KERNEL
 DebugPrinter operator <<(DebugPrinter dp, T val) {
 #if defined(DEBUG_PRINT_ENABLED) && !defined(PROFILE_KERNEL)
     DebugPrintData data{

This seems to make perf worse for me. I get ~23.23us after this change.
My results are on a T3K machine.

I'm confused about the comment in the diff, isn't this change removing the inline when dprint is disabled? Misread and thought it was inline

@tt-dma
Copy link
Contributor

tt-dma commented May 17, 2024

I saw a bunch of empty DebugPrinter operator<< functions in the dump that just returned, thought the noinline might be stopping them from getting optimized out, so the above patch removes it when dprints are disabled. If that doesn't get you the same result as when you commented out the dprints manually then it looks like the issue is with the empty DebugPrinter struct getting constructed and not optimized out.

Let me look into it some more to see if there's a good way to get it out of there without blowing up the syntax

@pgkeller
Copy link
Contributor

Wonder if we should change our dprint implementation to use variadic macros calling variadic templated fns, eg:
dprint(x, y, z);

@tt-dma
Copy link
Contributor

tt-dma commented May 17, 2024

@tt-aho Can you try this patch? It gets the same objdump/perf on my machine compared to commenting out the prints:

diff --git a/tt_metal/hw/inc/debug/dprint.h b/tt_metal/hw/inc/debug/dprint.h
index b30043de6e..c9a3538b34 100644
--- a/tt_metal/hw/inc/debug/dprint.h
+++ b/tt_metal/hw/inc/debug/dprint.h
@@ -36,7 +36,11 @@
 #include "dprint_buffer.h"
 #include "status.h"

+#if defined(DEBUG_PRINT_ENABLED)
 #define DPRINT DebugPrinter()
+#else
+#define DPRINT if(0) DebugPrinter()
+#endif

 #ifdef UCK_CHLKC_UNPACK
 #define DPRINT_UNPACK(x) x

Note that when I say commenting out the prints, I also mean the prints in tt_metal/hw/inc/debug/sanitize_noc.h:19.

@tt-dma
Copy link
Contributor

tt-dma commented May 17, 2024

@pgkeller Thoughts on that fix if it works? Looks a bit hacky but I do feel there's value to keeping the printing exactly the same as C++ ostreams.

@tt-aho
Copy link
Contributor Author

tt-aho commented May 18, 2024

I do see perf improvement from this change but not as much as commenting it out originally. Thoug after this change, then commenting out dprints in cq_dispatch doesnt make a difference?

Before change:
With DPRINTs: 22.8us
Commented out: 21.5us

After change:
With DPRINTS: 22.3us
commented out: 22.3us

@tt-dma
Copy link
Contributor

tt-dma commented May 20, 2024

Hey @tt-aho , I did some additional testing, and I think the additional perturbation we're seeing with my change is due to the dprints in cq_prefetch being compiled out now as well. On my machine (N300 BM):

Before change:
With DPRINTS: 28.40us
DPRINTS in cq_prefetch, cq_dispatch, noc_sanitize commented out: 28.60us
DPRINTS in just cq_dispatch, noc_sanitize commented out: 28.56us

After change:
With DPRINTS: 28.60us
DPRINTS in cq_prefetch, cq_dispatch, noc_sanitize commented out: 28.60us
DPRINTS in just cq_dispatch, noc_sanitize commented out: 28.60us

On my machine, compiling out the dprints actually slows things down? But the change makes things consistent, and I don't see any differences in the objdump between commenting out and disabling, so I think this is the way to go. Let me know if you see the same thing when commenting out DPRINTS in cq_prefetch, cq_dispatch, noc_sanitize without the change.

@tt-dma
Copy link
Contributor

tt-dma commented May 20, 2024

Going to merge the PR I have with the fix, let me know if we need to revert for any reason

@tt-aho
Copy link
Contributor Author

tt-aho commented May 20, 2024

I know I tried commenting out all the dprints from the fd2 kernels before (not in noc_sanitize) and didn't sees any difference compared to just cq_dispatch. I can retry with commenting out from noc_sanitize as well but I think with the merge this issue is probably resolved since it makes sure dprint does not perturb perf when not enabled. Thanks!

@tt-dma
Copy link
Contributor

tt-dma commented May 20, 2024

Ok sounds good, I'll close this - feel free to re-open if you see any differences going forwards

@tt-dma tt-dma closed this as completed May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1_critical perf for issues tracking performance problems/improvements
Projects
None yet
Development

No branches or pull requests

3 participants