Break apart session processor and the running of each session into se… #6382

brandonrising · 2024-05-16T17:39:37Z

…parate classes

Summary

Breaks up the currently scary to update session processor into separate components and provides callback function passthroughs which allow you to define tasks be run before/after sessions, as well as before/after each node.

QA Instructions

Run invocations of various types and batch sizes. Compare output images of same inputs against those generated on current main. Validate logging/stats are formatted correctly.
Intentionally run malformed graphs and nodes which will error out to ensure error reporting is still correct.

Merge Plan

I will merge once the PR has been thoroughly reviewed, tested, and approved on multiple OS/environments.

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
Documentation added / updated (if applicable)

lstein · 2024-05-18T02:58:45Z

@brandonrising Do you mind taking a look at the multi-GPU support PR #5997 to estimate how difficult it would be to integrate that into what is done here? That PR has done a small amount of refactoring to break the big scary loop into smaller pieces.

invokeai/app/services/session_processor/session_processor_default.py

psychedelicious · 2024-05-22T11:31:33Z

Notes from my changes:

Rearranged the session callbacks to be in the session runner. IMO this makes sense, all the session and “below” logic is in the session runner. All session callbacks are provided on init to the runner.
Added a node error callback to session runner.
Added a nonfatal processor error callback, provided on init to the processor.
Consolidated some of the callback logic. The main loops are now very clean.
Made the callbacks lists and cleaned up a some naming. This lets us better separate concerns when we want to do multiple things during the queue item lifecycle - we don’t have to smoosh all logic in a single callback per injection point.
Fixed an issue that would prevent the profiler from ever profiling anything.
Created protocols for each callback. This lets them take kwargs instead of positional args, which is a bit safer.
Some test callbacks are in dependencies.py for smoke testing

I resolved a longstanding issue where pydantic validation errors for nodes were handled as processor-level errors, instead of node-level errors.

@brandonrising had almost fixed it in the first iteration of this PR, but errors were being set on the previous node, when they should be on the current node.

To resolve this, some error handling was added to the session.next() method that prepares the nodes. This allows us to get the node in its state just before it failed validation and mark the current node as failed. An new NodeInputError represents this failure.

This makes our error reporting more accurate. It also is a better UX - the error messages are very clear.

psychedelicious · 2024-05-22T11:35:07Z

Here's how I smoke tested:

Do normal generations, single and multiple iterations
Cancel individual generations and the whole queue
Create a workflow with a single divide integer node, leave it at defaults and invoke. Confirm the error is handled as expected.
Create a workflow with a single integer primitive @ 0, connected to the width of a resize image node. This creates a NodeInputError. It should be handled as expected.
Set profile_graphs: true in the config. It should write out the profiles.

…parate classes

- Add `OnNodeError` and `OnNonFatalProcessorError` callbacks - Move all session/node callbacks to `SessionRunner` - this ensures we dump perf stats before resetting them and generally makes sense to me - Remove `complete` event from `SessionRunner`, it's essentially the same as `OnAfterRunSession` - Remove extraneous `next_invocation` block, which would treat a processor error as a node error - Simplify loops - Add some callbacks for testing, to be removed before merge

- Use protocol to define callbacks, this allows them to have kwargs - Shuffle the profiler around a bit - Move `thread_limit` and `polling_interval` to `__init__`; `start` is called programmatically and will never get these args in practice

We were not handling node preparation errors as node errors before. Here's the explanation, copied from a comment that is no longer required: --- TODO(psyche): Sessions only support errors on nodes, not on the session itself. When an error occurs outside node execution, it bubbles up to the processor where it is treated as a queue item error. Nodes are pydantic models. When we prepare a node in `session.next()`, we set its inputs. This can cause a pydantic validation error. For example, consider a resize image node which has a constraint on its `width` input field - it must be greater than zero. During preparation, if the width is set to zero, pydantic will raise a validation error. When this happens, it breaks the flow before `invocation` is set. We can't set an error on the invocation because we didn't get far enough to get it - we don't know its id. Hence, we just set it as a queue item error. --- This change wraps the node preparation step with exception handling. A new `NodeInputError` exception is raised when there is a validation error. This error has the node (in the state it was in just prior to the error) and an identifier of the input that failed. This allows us to mark the node that failed preparation as errored, correctly making such errors _node_ errors and not _processor_ errors. It's much easier to diagnose these situations. The error messages look like this: > Node b5ac87c6-0678-4b8c-96b9-d215aee12175 has invalid incoming input for height Some of the exception handling logic is cleaned up.

…_traceback` to `session_queue` table

- Add handling for new error columns `error_type`, `error_message`, `error_traceback`. - Update queue item model to include the new data. The `error_traceback` field has an alias of `error` for backwards compatibility. - Add `fail_queue_item` method. This was previously handled by `cancel_queue_item`. Splitting this functionality makes failing a queue item a bit more explicit. We also don't need to handle multiple optional error args. -

There's a race condition where a canceled session may emit a progress event or two after it's been canceled, and the progress image isn't cleared out. To resolve this, the system slice tracks canceled session ids. When a progress event comes in, we check the cancellations and skip setting the progress if canceled.

…string

…etting reported I had set the cancel event at some point during troubleshooting an unrelated issue. It seemed logical that it should be set there, and didn't seem to break anything. However, this is not correct. The cancel event should not be set in response to a queue status change event. Doing so can cause a race condition when nodes are executed very quickly. It's possible that a previously-executed session's queue item status change event is handled after the next session starts executing. The cancel event is set and the session runner sees it aborting the session run early. In hindsight, it doesn't make sense to set the cancel event here either. It should be set in response to user action, e.g. the user cancelled the session or cleared the queue (which implicitly cancels the current session). These events actually trigger the queue item status changed event, so if we set the cancel event here, we'd be setting it twice per cancellation.

Show error toasts on queue item error events instead of invocation error events. This allows errors that occurred outside node execution to be surfaced to the user. The error description component is updated to show the new error message if available. Commercial handling is retained, but local now uses the same component to display the error message itself.

psychedelicious · 2024-05-24T09:07:43Z

I've updated a few things in this PR:

error_type, error_message and error_traceback are new queue item attributes. They are added via DB migration.
Add handling for these new error attributes in the processor
Use the new error message attributes in the error toast
Trigger error toast on queue item failure instead of invocation error, this lets us surface errors that occurred outside the node
Fix a race condition where nodes were not marked as having an error (this was introduced earlier in this PR by me)

Working on this has underscored the need for a way to test the processor (and new session runner). I'm not familiar enough with mocking to have a clear idea of how we can do this, but I think it's feasible - especially now that things are modularized.

brandonrising requested review from blessedcoolant, psychedelicious and hipsterusername as code owners May 16, 2024 17:39

github-actions bot added python PRs that change python files services PRs that change app services labels May 16, 2024

psychedelicious requested changes May 18, 2024

View reviewed changes

invokeai/app/services/session_processor/session_processor_default.py Outdated Show resolved Hide resolved

psychedelicious force-pushed the separate-session-runner-into-separate-class branch from 4fac432 to 17a3a04 Compare May 22, 2024 08:29

github-actions bot added the api label May 22, 2024

brandonrising and others added 15 commits May 24, 2024 09:17

Break apart session processor and the running of each session into se…

e51a302

…parate classes

Run ruff

82957bb

Fix next node calling logic

8edc25d

feat(app): iterate on processor split 2

f7c356d

- Use protocol to define callbacks, this allows them to have kwargs - Shuffle the profiler around a bit - Move `thread_limit` and `polling_interval` to `__init__`; `start` is called programmatically and will never get these args in practice

feat(app): make things in session runner private

cb8e9e1

feat(app): support multiple processor lifecycle callbacks

cef1585

tidy(app): rearrange proccessor

eff3596

tidy(app): "outputs" -> "output"

b1f819a

docs(app): explain why errors are handled poorly

d30c1ad

fix(app): fix logging of error classes instead of class names

80905ff

feat(processor): get user/project from queue item w/ fallback

23b0534

chore: ruff

a55b2f0

fix(processor): restore missing update of session

7652fbc

psychedelicious force-pushed the separate-session-runner-into-separate-class branch from e194fc4 to 7652fbc Compare May 23, 2024 23:26

psychedelicious added 3 commits May 24, 2024 09:28

feat(db): add error_type, error_message, rename error -> `error…

0e81e7b

…_traceback` to `session_queue` table

feat(events): add enriched errors to events

6a34176

psychedelicious added 13 commits May 24, 2024 09:30

feat(processor): update enriched errors & fail_queue_item()

db0ef8d

feat(app): update test event callbacks

19227fe

chore(ui): typegen

9a4c167

feat(ui): handle enriched events

6063487

docs(processor): update docstrings, comments

a98dded

chore: ruff

7d1844e

tidy(queue): delete unused delete_queue_item method

c88de18

tidy(processor): remove test callbacks

169b75b

fix(processor): fix race condition related to clearing the queue

350feee

feat(processor): add debug log stmts to session running callbacks

fb93e68

tidy(ui): remove extraneous condition in socketInvocationError

08a42c3

fix(ui): correctly fallback to error message when traceback is empty …

dc78a0e

…string

psychedelicious self-requested a review May 24, 2024 02:22

psychedelicious requested a review from maryhipp as a code owner May 24, 2024 02:27

github-actions bot added the frontend PRs that change frontend files label May 24, 2024

tidy: remove unnecessary whitespace changes

65e85d1

psychedelicious approved these changes May 24, 2024

View reviewed changes

psychedelicious added 2 commits May 24, 2024 18:29

psychedelicious merged commit f5a775a into main May 24, 2024
14 checks passed

psychedelicious deleted the separate-session-runner-into-separate-class branch May 24, 2024 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Break apart session processor and the running of each session into se… #6382

Break apart session processor and the running of each session into se… #6382

brandonrising commented May 16, 2024

lstein commented May 18, 2024

psychedelicious commented May 22, 2024 •

edited

psychedelicious commented May 22, 2024

psychedelicious commented May 24, 2024

Break apart session processor and the running of each session into se… #6382

Break apart session processor and the running of each session into se… #6382

Conversation

brandonrising commented May 16, 2024

Summary

QA Instructions

Merge Plan

Checklist

lstein commented May 18, 2024

psychedelicious commented May 22, 2024 • edited

psychedelicious commented May 22, 2024

psychedelicious commented May 24, 2024

psychedelicious commented May 22, 2024 •

edited