You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem
Retrieval should accept more types of documents.
Users can (from what I can see) only add PDF documents for Retreival. This is not ideal for various reasons. "PDF" is very much not a standardized text storage format and perhaps no PDF application works broadly for text extraction on even most PDFs. Epub is highly organized, standarized, and readable, and txt and other plain-text documents such as .md should be easy for use with RAG (LLM's have no problem with markdown that I have seen). Even raw html might work.
Most of a user's documents, from personal finance to books in the humble-bundle collection are not in use-able pdf format.
(see provided code for epub text extraction below)
Success Criteria
Users should be able to use perform retrieval functions on any common standardized files they have (which oddly may exclude PDF which isn't standardized). txt, md, and even epub should be low handing fruti, maybe docx, .odf, rtf, too.
will be useful in letting users of Jan use epub-books with their Jan Retrieval uses.
This code extracts text from epub books and exports the text into a variety of formats: txt, json, jsonl, and can chunk to specific sizes without cutting words or sentences in half, to better retain meaning.
The text was updated successfully, but these errors were encountered:
Problem
Retrieval should accept more types of documents.
Users can (from what I can see) only add PDF documents for Retreival. This is not ideal for various reasons. "PDF" is very much not a standardized text storage format and perhaps no PDF application works broadly for text extraction on even most PDFs. Epub is highly organized, standarized, and readable, and txt and other plain-text documents such as .md should be easy for use with RAG (LLM's have no problem with markdown that I have seen). Even raw html might work.
Most of a user's documents, from personal finance to books in the humble-bundle collection are not in use-able pdf format.
(see provided code for epub text extraction below)
Success Criteria
Users should be able to use perform retrieval functions on any common standardized files they have (which oddly may exclude PDF which isn't standardized). txt, md, and even epub should be low handing fruti, maybe docx, .odf, rtf, too.
Additional context
For epub files:
Hopefully code like this (see most recent version)
https://github.com/lineality/epub_ingestion_python/blob/main/epub_injestion_jsonl_txt_sized_chunks_v21.py
will be useful in letting users of Jan use epub-books with their Jan Retrieval uses.
This code extracts text from epub books and exports the text into a variety of formats: txt, json, jsonl, and can chunk to specific sizes without cutting words or sentences in half, to better retain meaning.
The text was updated successfully, but these errors were encountered: