Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] DX: Add dimension() in EmbeddingModel #1080

Open
Martin7-1 opened this issue May 10, 2024 · 4 comments
Open

[FEATURE] DX: Add dimension() in EmbeddingModel #1080

Martin7-1 opened this issue May 10, 2024 · 4 comments
Labels
DX Developer Experience enhancement New feature or request P2 High priority

Comments

@Martin7-1
Copy link
Contributor

Is your feature request related to a problem? Please describe.

As describe in Redis embedding store spring boot starter. Embedding model should have dimension() method to enable EmbeddingStore to have a default value.

Describe the solution you'd like

In EmbeddingModel, will add method like below:

    /**
     * Get EmbeddingModel's dimension
     *
     * @return dimension
     */
    default int dimension() {
        return embed(TextSegment.from("test")).content().dimension();
    }

Describe alternatives you've considered

Additional context

@Martin7-1 Martin7-1 added the enhancement New feature or request label May 10, 2024
@langchain4j
Copy link
Owner

HI @Martin7-1 thank you, this is long due!

A few thoughts:

  • while it is great to have a default implementation that will work for all implementations of this interface, I think we should go one step further and have an overriden impl of this method for popular EmbeddingModel implementations. In many cases dimensions are known up front, so we can skip embedding of "test". When dimension is not known, this call can be cached.
  • we need to decide finally whether to name it dimensions or dimension (singular). Right now we have both, it is inconsistent.

@langchain4j langchain4j added the P2 High priority label May 10, 2024
@langchain4j langchain4j changed the title [FEATURE] Add dimension() in EmbeddingModel [FEATURE] DX: Add dimension() in EmbeddingModel May 10, 2024
@langchain4j langchain4j added the DX Developer Experience label May 10, 2024
@Martin7-1
Copy link
Contributor Author

HI @Martin7-1 thank you, this is long due!

A few thoughts:

  • while it is great to have a default implementation that will work for all implementations of this interface, I think we should go one step further and have an overriden impl of this method for popular EmbeddingModel implementations. In many cases dimensions are known up front, so we can skip embedding of "test". When dimension is not known, this call can be cached.
  • we need to decide finally whether to name it dimensions or dimension (singular). Right now we have both, it is inconsistent.
  • I think it is a very good idea to override the commonly used EmbeddingModel, and I will try to collect the dimensions of EmbeddingModel from common model providers and implement it. However, for model providers that provide multiple EmbeddingModels, we need to modify the implementation when the model is added or changed. Maybe we need a way to configure it (in file or other format maybe?) to reduce the number of code changes. WDYT?
  • Absolutely it's important to have consistent naming strategy. I prefer dimension because it tends to return just a single number.

@langchain4j
Copy link
Owner

However, for model providers that provide multiple EmbeddingModels, we need to modify the implementation when the model is added or changed. Maybe we need a way to configure it (in file or other format maybe?) to reduce the number of code changes. WDYT?

We can add a mapping (model->dimensions) and a conditional logic in the dimension() method. If the model is known (is in the mapping), return mapped value. If not, do "test" embedding. When new models are added, we could extend the mapping. Not sure about the file, I guess it is fine to have it in the *EmbeddingModel class. WDYT?

Absolutely it's important to have consistent naming strategy. I prefer dimension because it tends to return just a single number.

For me it is more natural to say "dimensions" as in "How many dimensions does this embedding vector have?". Sadly, the naming is inconsistent between providers. But we have much more "dimension" than "dimensions" in our codebase already, so I guess it makes more sense to use "dimension" for new code and migrate all "dimansions" to "dimension".

@Martin7-1
Copy link
Contributor Author

Martin7-1 commented May 13, 2024

We can add a mapping (model->dimensions) and a conditional logic in the dimension() method. If the model is known (is in the mapping), return mapped value. If not, do "test" embedding. When new models are added, we could extend the mapping. Not sure about the file, I guess it is fine to have it in the *EmbeddingModel class. WDYT?

I think it's ok! Another question is that when we do "test" embedding, there are duplicate code(check whether cache -> do embedding -> cache to a variable -> return). If we do not abstract the logic, it seems that there are duplicate variable in all EmbeddingModel.

Currently I want to implement by a abstract class, having a variable named dimension.

/**
 * An Embedding Model which contains
 */
public abstract class AbstractEmbeddingModel implements EmbeddingModel {

    /**
     * embedding model dimension
     */
    protected Integer dimension;

    /**
     * A map contains known model name and its dimension
     *
     * @return A map, key is the common represented model name, value is its dimension
     */
    protected abstract Map<String, Integer> dimensionMap();

    /**
     * Get embedding model's name
     *
     * @return embedding model's name
     */
    protected abstract String modelName();

    @Override
    public int dimension() {
        if (dimension != null) {
            return dimension;
        }

        return dimensionMap().compute(modelName(), (key, value) -> {
            this.dimension = Optional.ofNullable(value).orElse(embed("test").content().dimension());
            return dimension;
        });
    }
}

And then all specific *EmbeddingModel just extend this AbstractEmbeddingModel. WDYT?

For me it is more natural to say "dimensions" as in "How many dimensions does this embedding vector have?". Sadly, the naming is inconsistent between providers. But we have much more "dimension" than "dimensions" in our codebase already, so I guess it makes more sense to use "dimension" for new code and migrate all "dimansions" to "dimension".

OK. I will take a look when I implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DX Developer Experience enhancement New feature or request P2 High priority
Projects
None yet
Development

No branches or pull requests

2 participants