Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract web text directly instead of OCR #51

Open
eshoyuan opened this issue Feb 18, 2024 · 2 comments
Open

Extract web text directly instead of OCR #51

eshoyuan opened this issue Feb 18, 2024 · 2 comments

Comments

@eshoyuan
Copy link

I'm working on something pretty similar to what you guys are doing and had a thought. Why not grab text directly from the web instead of using OCR? Langchain and llamaindex both have such tools, and there are also some repos about converting html to markdown.

Just a thought. Would love to know what you think!

@will-holley
Copy link

Seconding the ask for a Motivations section that discusses when to use this in lieu of parsing the DOM.

@asim-shrestha
Copy link
Contributor

Thats a good question. Would be curious to see their approaches and performance is like.

For us, it's very important to contain as much of the visual structure of the page as possible. This includes positions of the text on the 2D plane. Using just the HTML and skipping the actual rendering of the page, you lose a lot of this information. We need this because a) we want our agents to reason about and take actions on the page just as we would, and b) because visibility of elements on screen is required for automation frameworks to actually take actions (you cannot "click" on elements that don't actually appear on the page)

For example, suppose you had a scrollable container element containing 10 child elements total, with 5 elements overflowing and requiring scrolling the parent container to view. I would imagine the other approaches would display the overflowed elements in the ultimate representation, while we want to avoid doing this (Because if an agent were to try and click on these elements, it would cause an element_not_found error)

Hope this makes sense, happy to elaborate further @eshoyuan. (And apologies for the late response) If @will-holley or anyone wants to add this to the README, happy to take a PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants