Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Break apart session processor and the running of each session into se… #6382

Merged
merged 34 commits into from
May 24, 2024

Conversation

brandonrising
Copy link
Collaborator

…parate classes

Summary

Breaks up the currently scary to update session processor into separate components and provides callback function passthroughs which allow you to define tasks be run before/after sessions, as well as before/after each node.

QA Instructions

Run invocations of various types and batch sizes. Compare output images of same inputs against those generated on current main. Validate logging/stats are formatted correctly.
Intentionally run malformed graphs and nodes which will error out to ensure error reporting is still correct.

Merge Plan

I will merge once the PR has been thoroughly reviewed, tested, and approved on multiple OS/environments.

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • Documentation added / updated (if applicable)

@github-actions github-actions bot added python PRs that change python files services PRs that change app services labels May 16, 2024
@lstein
Copy link
Collaborator

lstein commented May 18, 2024

@brandonrising Do you mind taking a look at the multi-GPU support PR #5997 to estimate how difficult it would be to integrate that into what is done here? That PR has done a small amount of refactoring to break the big scary loop into smaller pieces.

@psychedelicious psychedelicious force-pushed the separate-session-runner-into-separate-class branch from 4fac432 to 17a3a04 Compare May 22, 2024 08:29
@github-actions github-actions bot added the api label May 22, 2024
@psychedelicious
Copy link
Collaborator

psychedelicious commented May 22, 2024

Notes from my changes:

  • Rearranged the session callbacks to be in the session runner. IMO this makes sense, all the session and “below” logic is in the session runner. All session callbacks are provided on init to the runner.
  • Added a node error callback to session runner.
  • Added a nonfatal processor error callback, provided on init to the processor.
  • Consolidated some of the callback logic. The main loops are now very clean.
  • Made the callbacks lists and cleaned up a some naming. This lets us better separate concerns when we want to do multiple things during the queue item lifecycle - we don’t have to smoosh all logic in a single callback per injection point.
  • Fixed an issue that would prevent the profiler from ever profiling anything.
  • Created protocols for each callback. This lets them take kwargs instead of positional args, which is a bit safer.
  • Some test callbacks are in dependencies.py for smoke testing

I resolved a longstanding issue where pydantic validation errors for nodes were handled as processor-level errors, instead of node-level errors.

@brandonrising had almost fixed it in the first iteration of this PR, but errors were being set on the previous node, when they should be on the current node.

To resolve this, some error handling was added to the session.next() method that prepares the nodes. This allows us to get the node in its state just before it failed validation and mark the current node as failed. An new NodeInputError represents this failure.

This makes our error reporting more accurate. It also is a better UX - the error messages are very clear.

@psychedelicious
Copy link
Collaborator

Here's how I smoke tested:

  • Do normal generations, single and multiple iterations
  • Cancel individual generations and the whole queue
  • Create a workflow with a single divide integer node, leave it at defaults and invoke. Confirm the error is handled as expected.
  • Create a workflow with a single integer primitive @ 0, connected to the width of a resize image node. This creates a NodeInputError. It should be handled as expected.
  • Set profile_graphs: true in the config. It should write out the profiles.

brandonrising and others added 15 commits May 24, 2024 09:17
- Add `OnNodeError` and `OnNonFatalProcessorError` callbacks
- Move all session/node callbacks to `SessionRunner` - this ensures we dump perf stats before resetting them and generally makes sense to me
- Remove `complete` event from `SessionRunner`, it's essentially the same as `OnAfterRunSession`
- Remove extraneous `next_invocation` block, which would treat a processor error as a node error
- Simplify loops
- Add some callbacks for testing, to be removed before merge
- Use protocol to define callbacks, this allows them to have kwargs
- Shuffle the profiler around a bit
- Move `thread_limit` and `polling_interval` to `__init__`; `start` is called programmatically and will never get these args in practice
We were not handling node preparation errors as node errors before. Here's the explanation, copied from a comment that is no longer required:

---

TODO(psyche): Sessions only support errors on nodes, not on the session itself. When an error occurs outside
node execution, it bubbles up to the processor where it is treated as a queue item error.

Nodes are pydantic models. When we prepare a node in `session.next()`, we set its inputs. This can cause a
pydantic validation error. For example, consider a resize image node which has a constraint on its `width`
input field - it must be greater than zero. During preparation, if the width is set to zero, pydantic will
raise a validation error.

When this happens, it breaks the flow before `invocation` is set. We can't set an error on the invocation
because we didn't get far enough to get it - we don't know its id. Hence, we just set it as a queue item error.

---

This change wraps the node preparation step with exception handling. A new `NodeInputError` exception is raised when there is a validation error. This error has the node (in the state it was in just prior to the error) and an identifier of the input that failed.

This allows us to mark the node that failed preparation as errored, correctly making such errors _node_ errors and not _processor_ errors. It's much easier to diagnose these situations. The error messages look like this:

> Node b5ac87c6-0678-4b8c-96b9-d215aee12175 has invalid incoming input for height

Some of the exception handling logic is cleaned up.
@psychedelicious psychedelicious force-pushed the separate-session-runner-into-separate-class branch from e194fc4 to 7652fbc Compare May 23, 2024 23:26
- Add handling for new error columns `error_type`, `error_message`, `error_traceback`.
- Update queue item model to include the new data. The `error_traceback` field has an alias of `error` for backwards compatibility.
- Add `fail_queue_item` method. This was previously handled by `cancel_queue_item`. Splitting this functionality makes failing a queue item a bit more explicit. We also don't need to handle multiple optional error args.
-
@psychedelicious psychedelicious self-requested a review May 24, 2024 02:22
@github-actions github-actions bot added the frontend PRs that change frontend files label May 24, 2024
…etting reported

I had set the cancel event at some point during troubleshooting an unrelated issue. It seemed logical that it should be set there, and didn't seem to break anything. However, this is not correct.

The cancel event should not be set in response to a queue status change event. Doing so can cause a race condition when nodes are executed very quickly.

It's possible that a previously-executed session's queue item status change event is handled after the next session starts executing. The cancel event is set and the session runner sees it aborting the session run early.

In hindsight, it doesn't make sense to set the cancel event here either. It should be set in response to user action, e.g. the user cancelled the session or cleared the queue (which implicitly cancels the current session). These events actually trigger the queue item status changed event, so if we set the cancel event here, we'd be setting it twice per cancellation.
Show error toasts on queue item error events instead of invocation error events. This allows errors that occurred outside node execution to be surfaced to the user.

The error description component is updated to show the new error message if available. Commercial handling is retained, but local now uses the same component to display the error message itself.
@psychedelicious
Copy link
Collaborator

I've updated a few things in this PR:

  • error_type, error_message and error_traceback are new queue item attributes. They are added via DB migration.
  • Add handling for these new error attributes in the processor
  • Use the new error message attributes in the error toast
  • Trigger error toast on queue item failure instead of invocation error, this lets us surface errors that occurred outside the node
  • Fix a race condition where nodes were not marked as having an error (this was introduced earlier in this PR by me)

Working on this has underscored the need for a way to test the processor (and new session runner). I'm not familiar enough with mocking to have a clear idea of how we can do this, but I think it's feasible - especially now that things are modularized.

@psychedelicious psychedelicious merged commit f5a775a into main May 24, 2024
14 checks passed
@psychedelicious psychedelicious deleted the separate-session-runner-into-separate-class branch May 24, 2024 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api frontend PRs that change frontend files python PRs that change python files services PRs that change app services
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants