Extract web text directly instead of OCR #51

eshoyuan · 2024-02-18T06:21:09Z

I'm working on something pretty similar to what you guys are doing and had a thought. Why not grab text directly from the web instead of using OCR? Langchain and llamaindex both have such tools, and there are also some repos about converting html to markdown.

Just a thought. Would love to know what you think!

will-holley · 2024-03-26T20:32:01Z

Seconding the ask for a Motivations section that discusses when to use this in lieu of parsing the DOM.

asim-shrestha · 2024-05-16T17:08:41Z

Thats a good question. Would be curious to see their approaches and performance is like.

For us, it's very important to contain as much of the visual structure of the page as possible. This includes positions of the text on the 2D plane. Using just the HTML and skipping the actual rendering of the page, you lose a lot of this information. We need this because a) we want our agents to reason about and take actions on the page just as we would, and b) because visibility of elements on screen is required for automation frameworks to actually take actions (you cannot "click" on elements that don't actually appear on the page)

For example, suppose you had a scrollable container element containing 10 child elements total, with 5 elements overflowing and requiring scrolling the parent container to view. I would imagine the other approaches would display the overflowed elements in the ultimate representation, while we want to avoid doing this (Because if an agent were to try and click on these elements, it would cause an element_not_found error)

Hope this makes sense, happy to elaborate further @eshoyuan. (And apologies for the late response) If @will-holley or anyone wants to add this to the README, happy to take a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract web text directly instead of OCR #51

Extract web text directly instead of OCR #51

eshoyuan commented Feb 18, 2024

will-holley commented Mar 26, 2024

asim-shrestha commented May 16, 2024

Extract web text directly instead of OCR #51

Extract web text directly instead of OCR #51

Comments

eshoyuan commented Feb 18, 2024

will-holley commented Mar 26, 2024

asim-shrestha commented May 16, 2024