feat: users can add epub and txt files for RAG retrieval functions in Jan #1995

lineality · 2024-02-11T20:03:37Z

Problem
Retrieval should accept more types of documents.
Users can (from what I can see) only add PDF documents for Retreival. This is not ideal for various reasons. "PDF" is very much not a standardized text storage format and perhaps no PDF application works broadly for text extraction on even most PDFs. Epub is highly organized, standarized, and readable, and txt and other plain-text documents such as .md should be easy for use with RAG (LLM's have no problem with markdown that I have seen). Even raw html might work.
Most of a user's documents, from personal finance to books in the humble-bundle collection are not in use-able pdf format.
(see provided code for epub text extraction below)

Success Criteria
Users should be able to use perform retrieval functions on any common standardized files they have (which oddly may exclude PDF which isn't standardized). txt, md, and even epub should be low handing fruti, maybe docx, .odf, rtf, too.

Additional context
For epub files:

Hopefully code like this (see most recent version)
https://github.com/lineality/epub_ingestion_python/blob/main/epub_injestion_jsonl_txt_sized_chunks_v21.py

will be useful in letting users of Jan use epub-books with their Jan Retrieval uses.

This code extracts text from epub books and exports the text into a variety of formats: txt, json, jsonl, and can chunk to specific sizes without cutting words or sentences in half, to better retain meaning.

RichardoC · 2024-04-23T13:57:42Z

Plain text files would be really useful, such as markdown or config files as then I can feed a README to the model

Signed-off-by: Richard Tweed <RichardoC@users.noreply.github.com>

lineality added the type: feature request A new feature label Feb 11, 2024

hiro-v added the good first issue Good for newcomers label Feb 13, 2024

louis-jan assigned hiro-v Feb 13, 2024

hiro-v removed their assignment Mar 8, 2024

Van-QA assigned imtuyethan Mar 11, 2024

RichardoC added a commit to RichardoC/jan that referenced this issue Apr 25, 2024

Initial attempt at janhq#1995

3c4dd22

Signed-off-by: Richard Tweed <RichardoC@users.noreply.github.com>

RichardoC linked a pull request Apr 25, 2024 that will close this issue

Feat - DRAFT - Support plain text #2827

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: users can add epub and txt files for RAG retrieval functions in Jan #1995

feat: users can add epub and txt files for RAG retrieval functions in Jan #1995

lineality commented Feb 11, 2024

RichardoC commented Apr 23, 2024

feat: users can add epub and txt files for RAG retrieval functions in Jan #1995

feat: users can add epub and txt files for RAG retrieval functions in Jan #1995

Comments

lineality commented Feb 11, 2024

RichardoC commented Apr 23, 2024