Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow up: Means of Evaluation and Instruction Automatic Generation #93

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

clean99
Copy link
Contributor

@clean99 clean99 commented Nov 24, 2023

Following up with #70 , we now have a potential way to evaluate the result provided by the system. Meanwhile, we can generate instructions automatically by vision comparison.

Now our system have vision comparison, but it is coupling with other generation process. Thus we can not fully take advantages of vision comparison:

  1. Evaluation: We need a means of evaluation of the result generated by our system. Vision comparison is perfect for this.
  2. Instruction generation: What users are doing is to compare the differences between origin and result images, and tell GPT the differences and ask GPT to update. By Vision Comparison, we can automatically generate the instructions for users, so that users make less effort to find differences and typing.

Current flow:

comparison flow

Update flow:

comparison flow new

We will add another button “Generate Instruction” which can insert the “Auto Generate Instruction” and Eval into the flow.

System design

Frontend

  1. Add Generate Instruction button:
new ui

It should call instructionGenerate when user click on it.

  1. instructionGenerate
function instructionGenerate() {
	const resultImage = await takeScreenshot();
	const originalImage = referenceImages[0];
	doGenerateInstruction({
    generationType: "update",
    image: originalImage,
    resultImage: resultImage,
  });
}

function doGenerateInstruction(params: InstructionGenerationParams) {
    setAppState(AppState.INSTRUCTION_GENERATING);

    // Merge settings with params
    const updatedParams = { ...params, ...settings };

    generateInstruction(
      wsRef,
      updatedParams,
      (token) => setUpdateInstruction((prev) => prev + token),
      (code) => setUpdateInstruction(code),
      () => setAppState(AppState.CODE_READY)
    );
}
  1. AppState.INSTRUCTION_GENERATING

It should set disable and loading for all buttons on panel when appState is AppState.INSTRUCTION_GENERATING.

Backend

  1. prompt
You are a Frontend Vision Comparison expert,
You are required to compare two website screenshots: the first one is the original site and the second one is a redesigned version.
Your task is to identify differences in elements and their css, focusing on layout, style, and structure.
Do not consider the content(text, placeholder) of the elements, only the elements themselves
Analyze the screenshots considering these categories:

Lack of Elements: Identify any element present in the original but missing in the redesign.
Redundant Elements: Spot elements in the redesign that were not in the original.
Wrong Element Properties: Note discrepancies in element properties like size, color, font, and layout.

Provide a clear conclusion as a list, specifying the element, the mistake, and its location.
In ambiguous cases, suggest a manual review.
Remember, this comparison is not pixel-by-pixel, but at a higher, more conceptual level.

Return only the JSON array in this format:
[
  {
    "element": "name, text, etc.",
    "mistake": "wrong color, wrong size, etc.(strictly use css properties to describe)",
    "improvement": "use #xxx color, use width: xxx px, etc.",
    "location": "header"
  },
]
Do not include markdown "```" or "```JSON" at the start or end.
  1. api
generate-instruction

Eval

Add GPT count mistakes that previously made, this can give user a vibe of the performance of our system.
For more serious evaluation, I will explore and follow up.

mistake

@clean99 clean99 marked this pull request as draft November 24, 2023 02:26
Copy link
Contributor

sweep-ai bot commented Nov 24, 2023

Apply Sweep Rules to your PR?

  • Apply: All new business logic should have corresponding unit tests.
  • Apply: Refactor large functions to be more modular.
  • Apply: Add docstrings to all functions and file headers.

@clean99 clean99 marked this pull request as ready for review November 24, 2023 03:22
@abi
Copy link
Owner

abi commented Nov 24, 2023

When I messed around with ChatGPT, it hallucinated a lot when it did a visual comparison. I'm curious if your prompt works well. How are the visual comparison results? Are they accurate?

Thanks for exploring this method of improvement.

@clean99
Copy link
Contributor Author

clean99 commented Nov 24, 2023

When I messed around with ChatGPT, it hallucinated a lot when it did a visual comparison. I'm curious if your prompt works well. How are the visual comparison results? Are they accurate?

Thanks for exploring this method of improvement.

It is not bad after limiting it to only care about the CSS properties mistakes. But I believe there are room to improve and this will be an experimental feature, The user can choose to skip it or modify the result it generate so I'd like to put it here and hopefully there will be better prompt contributed in the future

@clean99 clean99 changed the title [WIP]Follow up: Means of Evaluation and Instruction Automatic Generation Follow up: Means of Evaluation and Instruction Automatic Generation Nov 24, 2023
@abi
Copy link
Owner

abi commented Dec 1, 2023

Thank you for this and sorry I'm slow to review it. Will get it in tomorrow.

@abi
Copy link
Owner

abi commented Dec 4, 2023

Finally found some time to try out this PR.

The primary issue I have with merging this in is that I think the quality of the outputs is not good. Here's an example:

Original
Screenshot 2023-11-29 at 2 56 02 PM
Result
Screenshot 2023-12-03 at 6 00 25 PM
Text
Screenshot 2023-12-03 at 6 00 21 PM
a3251e342">

As you can see, it gets the colors wrong most obviously. Says there's a background video. Not sure what's going on.

Other than that,

  • Make "Generate instruction" smaller
  • Need to fix textarea so that it's not too big but expands based on text in it perhaps.

Fundamentally, the goal of the user is to make the generated result more like the screenshot through repeated loops of "generate instruction" -> "update" but unfortunately, I don't know if GPT4 vision and this approach work well together.

Would love to hear thoughts on how this can be improved, and your experiences with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants