Follow up: Means of Evaluation and Instruction Automatic Generation #93

clean99 · 2023-11-24T02:26:44Z

Following up with #70 , we now have a potential way to evaluate the result provided by the system. Meanwhile, we can generate instructions automatically by vision comparison.

Now our system have vision comparison, but it is coupling with other generation process. Thus we can not fully take advantages of vision comparison:

Evaluation: We need a means of evaluation of the result generated by our system. Vision comparison is perfect for this.
Instruction generation: What users are doing is to compare the differences between origin and result images, and tell GPT the differences and ask GPT to update. By Vision Comparison, we can automatically generate the instructions for users, so that users make less effort to find differences and typing.

Current flow:

Update flow:

We will add another button “Generate Instruction” which can insert the “Auto Generate Instruction” and Eval into the flow.

System design

Frontend

Add Generate Instruction button:

It should call instructionGenerate when user click on it.

instructionGenerate

function instructionGenerate() {
	const resultImage = await takeScreenshot();
	const originalImage = referenceImages[0];
	doGenerateInstruction({
    generationType: "update",
    image: originalImage,
    resultImage: resultImage,
  });
}

function doGenerateInstruction(params: InstructionGenerationParams) {
    setAppState(AppState.INSTRUCTION_GENERATING);

    // Merge settings with params
    const updatedParams = { ...params, ...settings };

    generateInstruction(
      wsRef,
      updatedParams,
      (token) => setUpdateInstruction((prev) => prev + token),
      (code) => setUpdateInstruction(code),
      () => setAppState(AppState.CODE_READY)
    );
}

AppState.INSTRUCTION_GENERATING

It should set disable and loading for all buttons on panel when appState is AppState.INSTRUCTION_GENERATING.

Backend

prompt

You are a Frontend Vision Comparison expert,
You are required to compare two website screenshots: the first one is the original site and the second one is a redesigned version.
Your task is to identify differences in elements and their css, focusing on layout, style, and structure.
Do not consider the content(text, placeholder) of the elements, only the elements themselves
Analyze the screenshots considering these categories:

Lack of Elements: Identify any element present in the original but missing in the redesign.
Redundant Elements: Spot elements in the redesign that were not in the original.
Wrong Element Properties: Note discrepancies in element properties like size, color, font, and layout.

Provide a clear conclusion as a list, specifying the element, the mistake, and its location.
In ambiguous cases, suggest a manual review.
Remember, this comparison is not pixel-by-pixel, but at a higher, more conceptual level.

Return only the JSON array in this format:
[
  {
    "element": "name, text, etc.",
    "mistake": "wrong color, wrong size, etc.(strictly use css properties to describe)",
    "improvement": "use #xxx color, use width: xxx px, etc.",
    "location": "header"
  },
]
Do not include markdown "```" or "```JSON" at the start or end.

api

generate-instruction

Eval

Add GPT count mistakes that previously made, this can give user a vibe of the performance of our system.
For more serious evaluation, I will explore and follow up.

sweep-ai · 2023-11-24T02:27:47Z

Apply Sweep Rules to your PR?

Apply: All new business logic should have corresponding unit tests.
Apply: Refactor large functions to be more modular.
Apply: Add docstrings to all functions and file headers.

abi · 2023-11-24T03:28:05Z

When I messed around with ChatGPT, it hallucinated a lot when it did a visual comparison. I'm curious if your prompt works well. How are the visual comparison results? Are they accurate?

Thanks for exploring this method of improvement.

clean99 · 2023-11-24T03:31:48Z

When I messed around with ChatGPT, it hallucinated a lot when it did a visual comparison. I'm curious if your prompt works well. How are the visual comparison results? Are they accurate?

Thanks for exploring this method of improvement.

It is not bad after limiting it to only care about the CSS properties mistakes. But I believe there are room to improve and this will be an experimental feature, The user can choose to skip it or modify the result it generate so I'd like to put it here and hopefully there will be better prompt contributed in the future

abi · 2023-12-01T21:20:52Z

Thank you for this and sorry I'm slow to review it. Will get it in tomorrow.

abi · 2023-12-04T00:49:31Z

Finally found some time to try out this PR.

The primary issue I have with merging this in is that I think the quality of the outputs is not good. Here's an example:

Original

Result

Text

a3251e342">

As you can see, it gets the colors wrong most obviously. Says there's a background video. Not sure what's going on.

Other than that,

Make "Generate instruction" smaller
Need to fix textarea so that it's not too big but expands based on text in it perhaps.

Fundamentally, the goal of the user is to make the generated result more like the screenshot through repeated loops of "generate instruction" -> "update" but unfortunately, I don't know if GPT4 vision and this approach work well together.

Would love to hear thoughts on how this can be improved, and your experiences with it.

clean99 added 3 commits November 24, 2023 10:09

feat: add instruction generation interface and type

1d35cb4

feat: add instruction generation frontend interact

2df5d55

feat: add disable when generating instruction

90f26de

clean99 marked this pull request as draft November 24, 2023 02:26

clean99 added 9 commits November 24, 2023 10:51

feat: add prompt

dca17c8

feat: add api

a4c2bfa

feat: update prompt

ef6bd4d

feat: optimaze loading state ui

ff96814

feat: set longer text area

bb96c97

feat: set longer text area

ab7d2d3

feat: update instructions convert

6a06d89

feat: add gpt count mistake number

a89e85e

feat: update generate instruction

e0fb04b

clean99 marked this pull request as ready for review November 24, 2023 03:22

clean99 changed the title ~~[WIP]Follow up: Means of Evaluation and Instruction Automatic Generation~~ Follow up: Means of Evaluation and Instruction Automatic Generation Nov 24, 2023

abi added 2 commits December 3, 2023 17:29

Merge branch 'main' into pr/93

1b9142b

revert textarea height

2609311

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Follow up: Means of Evaluation and Instruction Automatic Generation #93

Follow up: Means of Evaluation and Instruction Automatic Generation #93

clean99 commented Nov 24, 2023 •

edited

sweep-ai bot commented Nov 24, 2023

abi commented Nov 24, 2023

clean99 commented Nov 24, 2023 •

edited

abi commented Dec 1, 2023

abi commented Dec 4, 2023 •

edited

Follow up: Means of Evaluation and Instruction Automatic Generation #93

Are you sure you want to change the base?

Follow up: Means of Evaluation and Instruction Automatic Generation #93

Conversation

clean99 commented Nov 24, 2023 • edited

System design

Eval

sweep-ai bot commented Nov 24, 2023

Apply Sweep Rules to your PR?

abi commented Nov 24, 2023

clean99 commented Nov 24, 2023 • edited

abi commented Dec 1, 2023

abi commented Dec 4, 2023 • edited

clean99 commented Nov 24, 2023 •

edited

clean99 commented Nov 24, 2023 •

edited

abi commented Dec 4, 2023 •

edited