Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: users can add epub and txt files for RAG retrieval functions in Jan #1995

Open
lineality opened this issue Feb 11, 2024 · 1 comment · May be fixed by #2827
Open

feat: users can add epub and txt files for RAG retrieval functions in Jan #1995

lineality opened this issue Feb 11, 2024 · 1 comment · May be fixed by #2827
Assignees
Labels
good first issue Good for newcomers type: feature request A new feature

Comments

@lineality
Copy link

Problem
Retrieval should accept more types of documents.
Users can (from what I can see) only add PDF documents for Retreival. This is not ideal for various reasons. "PDF" is very much not a standardized text storage format and perhaps no PDF application works broadly for text extraction on even most PDFs. Epub is highly organized, standarized, and readable, and txt and other plain-text documents such as .md should be easy for use with RAG (LLM's have no problem with markdown that I have seen). Even raw html might work.
Most of a user's documents, from personal finance to books in the humble-bundle collection are not in use-able pdf format.
(see provided code for epub text extraction below)

Success Criteria
Users should be able to use perform retrieval functions on any common standardized files they have (which oddly may exclude PDF which isn't standardized). txt, md, and even epub should be low handing fruti, maybe docx, .odf, rtf, too.

Additional context
For epub files:

Hopefully code like this (see most recent version)
https://github.com/lineality/epub_ingestion_python/blob/main/epub_injestion_jsonl_txt_sized_chunks_v21.py

will be useful in letting users of Jan use epub-books with their Jan Retrieval uses.

This code extracts text from epub books and exports the text into a variety of formats: txt, json, jsonl, and can chunk to specific sizes without cutting words or sentences in half, to better retain meaning.

@lineality lineality added the type: feature request A new feature label Feb 11, 2024
@hiro-v hiro-v added the good first issue Good for newcomers label Feb 13, 2024
@hiro-v hiro-v removed their assignment Mar 8, 2024
@RichardoC
Copy link

Plain text files would be really useful, such as markdown or config files as then I can feed a README to the model

RichardoC added a commit to RichardoC/jan that referenced this issue Apr 25, 2024
Signed-off-by: Richard Tweed <RichardoC@users.noreply.github.com>
@RichardoC RichardoC linked a pull request Apr 25, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers type: feature request A new feature
Projects
Status: Icebox
Development

Successfully merging a pull request may close this issue.

4 participants