Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support for chunking strategies #1081

Open
orpiske opened this issue May 10, 2024 · 4 comments
Open

[FEATURE] Support for chunking strategies #1081

orpiske opened this issue May 10, 2024 · 4 comments
Labels
enhancement New feature or request P3 Medium priority

Comments

@orpiske
Copy link

orpiske commented May 10, 2024

Is your feature request related to a problem? Please describe.

One of the lessons we learned from a project we worked recently was that there doesn't seem to be great/widespread support for Chunking in Java. We were particularly looking for support for different chunking strategies. That could have helped us maximize our ability to store, retrieve and match data in our VectorDB.

Describe the solution you'd like

We would like to discuss with the Langchain4j community whether they having support for chunking feasible within this project and aligned with the project goals and feature set.

Describe alternatives you've considered

Among other things, we have considered creating a chunking library as a separate project, but we believe that adding a chunking library as part of Langchain4j would result in a better developer experience and would also allow the project to, more easily, implement chunking strategies that would involve LLM-based chunking.

Additional context

If the community believes that this is in line with the project, we are motivated to contribute and help maintain this feature.

@orpiske orpiske added the enhancement New feature or request label May 10, 2024
@langchain4j
Copy link
Owner

HI @orpiske this sounds great! We are looking forward to improve this in the LC4J, so any contributions are welcome!
Apart from existing document splitters we plan to add markdown splitter and semantic splitter in the near future.

What chunking strategies do you have in mind?

@orpiske
Copy link
Author

orpiske commented May 10, 2024

It's great news that you have plans for Markdown (and/or AsciiDoc) and semantic splitter! Those would have been very useful for our project.

In general, specialized chunkers/splitters (and/or an interface for implementing those) could be particularly helpful (i.e.: so we could deal with YAML, XML, etc) for a subset of our data.

I also think that it API/services/LLM-based chucking, where we defer the chunking to an external service, could be useful.

@langchain4j
Copy link
Owner

The interface you are looking for is DocumentSplitter, please take a look at it and it's existing implementations

@orpiske
Copy link
Author

orpiske commented May 10, 2024

The interface you are looking for is DocumentSplitter, please take a look at it and it's existing implementations

Noted, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P3 Medium priority
Projects
None yet
Development

No branches or pull requests

2 participants