Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: UI structured representation #495

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

LaPetiteSouris
Copy link
Contributor

@LaPetiteSouris LaPetiteSouris commented Sep 14, 2023

What kind of change does this PR introduce?

Introduce DOMElement class, which will represent fully a screen state regardless of OS type. This DOMElement is the atomic unit to represent any UI and serves as the lowest level of representation. It is the adapter which "translate" the window state into an uniform representational format

Summary

Refer to RFC on UI Representation, the job of prediction of a single action to take on a single screen can be translated into the following steps:

  1. Represent the screen into UI Tree
  2. Translate UI Tree into the universal prompt format
  3. Inference with LLMs model using the translated prompt
  4. Translate LLMs models' inference into valid action

Backward wise, the evaluation process used in model training/tuning/RFLHF can be decomposed into several steps.

Given a window state and action taken as reference pair, then a current window state a predicted action as inference pair:

Represent the reference window state in UI Tree

  1. Base on the action, find the element (usually actionable components like Button, Text Area, Link... etc) that has been interacted with.
  2. Represent the current window state in the UI Tree
  3. Find the actionable elements in the UI Tree that the predicted action interact with
  4. Compare the reference elements and predicted elements

This DOMElement is the low level mapping of everything into a single unified UI, which can later be translated into prompt language, regardless of UI state or OS type.

Checklist

  • My code follows the style guidelines of OpenAdapt
  • I have performed a self-review of my code
  • If applicable, I have added tests to prove my fix is functional/effective
  • [] I have linted my code locally prior to submission
  • I have commented my code, particularly in hard-to-understand areas
  • [] I have made corresponding changes to the documentation (e.g. README.md, requirements.txt)
  • New and existing unit tests pass locally with my changes

How can your code be run and tested?

Other information

Next step:

  1. Define a uniform/base prompt structure. I.e: to have a class such as:
class PromptGenerator
""" To generate prediction pipeline. The structure of the prompt and the prompt pipeline is TBD
"""

griptape AI is a very good candidate for this

With implementation of CompletionProvider, in general CompletionProvider should take the generic prompt generated or the pipeline generated to run based on specific provider.

  1. Write an UITranslator class, which takes DOMElement and translate the element into operational prompt to ask. This is the "secret glue" which translates the representational of the UI to LLM language.

NOTE The main effort is to uniform, standardize the way UI is interacted with first. Then to uniform and standardize the way the models are interacted with. I.e: to translate and represent the UI into LLMs' language.

Screenshot 2023-09-14 at 09 33 56

@LaPetiteSouris LaPetiteSouris marked this pull request as draft September 14, 2023 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant