Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximally parallelize dbt clone #10129

Merged
merged 6 commits into from
May 23, 2024
Merged

Conversation

MichelleArk
Copy link
Contributor

@MichelleArk MichelleArk commented May 10, 2024

resolves #7914

Problem

For GraphRunnableTasks such as the CloneTask, the run queue always enforced topological order of dependencies during execution, even when the task does not strictly require it.

Solution

Make it possible for GraphRunnableTask subclasses to provide a PRESERVE_EDGES class attribute that directs the graph queue generation to remove edges from the graph prior to constructing the priority queue. The reason a class attribute was chosen was because this does not need to be user-configurable and should be a constant queue mechanism for all invocations of the clone command.

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX
  • This PR includes type annotations for new and modified functions

@cla-bot cla-bot bot added the cla:yes label May 10, 2024
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

Copy link

codecov bot commented May 10, 2024

Codecov Report

Attention: Patch coverage is 94.44444% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 88.68%. Comparing base (ecf9436) to head (8ed1a42).
Report is 15 commits behind head on main.

Files Patch % Lines
core/dbt/task/runnable.py 90.90% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10129      +/-   ##
==========================================
+ Coverage   88.19%   88.68%   +0.49%     
==========================================
  Files         181      180       -1     
  Lines       22786    22446     -340     
==========================================
- Hits        20096    19907     -189     
+ Misses       2690     2539     -151     
Flag Coverage Δ
integration 85.96% <94.44%> (+0.50%) ⬆️
unit 63.35% <66.66%> (+0.67%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MichelleArk
Copy link
Contributor Author

MichelleArk commented May 22, 2024

I've done some local 🎩 benchmarking of dbt clone against bigquery (since it supports zero-copy cloning) to compare performance on this branch vs main.

🎩 Setup:

dbt run --target prod
mkdir state
mv target/manifest.json state/

On a project with a chain of 5 models (i.e. 1 -> 2 -> 3 -> 4 -> 5):

Recording average execution time of dbt clone --state state --full-refresh --threads 5 across 5 runs:

  • main: 15.38s, 34% cpu utilization
  • this branch: 10.17s, 48% cpu utilization

On a project with a chain of 10 models:

Recording average execution time of dbt clone --state state --full-refresh --threads 10 across 5 runs:

  • main: 20.41s, 20% cpu utilization
  • this branch: 9.02s, 57% cpu utilization

@MichelleArk MichelleArk marked this pull request as ready for review May 22, 2024 22:21
@MichelleArk MichelleArk requested a review from a team as a code owner May 22, 2024 22:21
Copy link
Contributor

@gshank gshank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice that this was fairly simple. Would like just an additional comment as noted on the line.

core/dbt/graph/queue.py Show resolved Hide resolved
@MichelleArk MichelleArk merged commit fb10bb4 into main May 23, 2024
63 checks passed
@MichelleArk MichelleArk deleted the graph-runnable-task-no-edges branch May 23, 2024 15:06
Copy link
Contributor

@ChenyuLInx ChenyuLInx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late to the party and sorry about being nitpicking here.
Let's discuss a bit as this is probably going to be used as example for next set of tests

self.forced_exception_class = exception_class
self.did_cancel: bool = False
super().__init__(args=MockArgs(), config=MockConfig(), manifest=None)
self.manifest = make_manifest(nodes=nodes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like we are reimplementing the task logic in the test again. I view unit test as documenting behavior in this case, so the only thing we need to test at the runnable task is: are we using the correct arguments used when calling get_graph_queue?

@@ -40,13 +63,25 @@ def _cancel_connections(self, pool):

def get_node_selector(self):
"""This is an `abstract_method` on `GraphRunnableTask`, thus we must implement it"""
return None
selector = ResourceTypeSelector(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


def defer_to_manifest(self, adapter, selected_uids: AbstractSet[str]):
"""This is an `abstract_method` on `GraphRunnableTask`, thus we must implement it"""
return None


class MockRunnableTaskIndependent(MockRunnableTask):
def get_run_mode(self) -> GraphRunnableMode:
return GraphRunnableMode.Independent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are using inheritance where we should be using patch

@@ -387,3 +388,17 @@ def replace_config(n, **kwargs):
config=n.config.replace(**kwargs),
unrendered_config=dict_replace(n.unrendered_config, **kwargs),
)


def make_manifest(nodes=[], sources=[], macros=[], docs=[]) -> Manifest:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is duplicating

here, thoughts on how can we make that one easier to use/more visible?

@ChenyuLInx ChenyuLInx mentioned this pull request Jun 4, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-2723] [spike+] Maximally parallelize dbt clone operations, a different mechanism for processing a queue
3 participants