fix: truncate long document #843

himself65 · 2024-05-16T00:52:12Z

Fixes: #836

There are some possible ways to solve that issue.

force split text right before embedding

This is not what the user expected, and must lose much of Infos

Split documents inside of each loader.

Good, but it still will overflow the chunk size since metadata may overflow the chunk size.

What's more

I think in the future, we should detect document size dynamically. And everytime user updates the document and if it overflow we should warn or error

changeset-bot · 2024-05-16T00:52:15Z

⚠️ No Changeset found

Latest commit: 04967b3

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

vercel · 2024-05-16T00:52:16Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
llama-index-ts-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 16, 2024 5:16am

himself65

Some ideas about the new design of Node classes

move BaseNode to interface
getContent to be value getter
metadata to be Map<string, JSONValue>
hidden text property
auto-memo id

interface BaseNode {
  readonly id: string;
  content: string;
}

type JSONObject = { [key: string]: JSONValue };
type JSONArray = JSONValue[];
type JSONValue = string | number | boolean | null | JSONObject | JSONArray;
export type Metadata = Map<string, JSONValue>;

export const Settings = {
  chunkSize: undefined as number | undefined,
};

function chunkSizeCheck(
  originalGetter: () => string,
  context: ClassGetterDecoratorContext,
) {
  return function (this: Document) {
    const content = originalGetter.call(this);
    if (Settings.chunkSize !== undefined) {
      if (content.length > Settings.chunkSize) {
        console.warn(
          `Document (${this.id.substring(
            0,
            8,
          )}) is larger than chunk size: ${content.length}`,
        );
        console.warn(`Truncating content...`);
        console.warn("If you want to disable this warning:");
        console.warn("  1. Set Settings.chunkSize = undefined");
        console.warn("  2. Set Settings.chunkSize to a larger value");
        console.warn(
          "  3. Change the way of splitting content into different chunks",
        );
        return content.slice(0, Settings.chunkSize);
      }
    }
    return content;
  };
}

export class Document implements BaseNode {
  get id(): string {
    return Document.hash(this.text, this.metadata);
  }

  @chunkSizeCheck
  get content(): string {
    const leading = JSON.stringify([...this.metadata.entries()]);
    return `${leading ? `${leading}\n` : ""}${this.text}`;
  }

  constructor(
    private text: string,
    public metadata: Metadata = new Map(),
  ) {}

  static hash = memoize(
    (content: string, metadata: Metadata) => {
      return createHash("sha256")
        .update(content)
        .update(JSON.stringify([...metadata.entries()]))
        .digest("hex");
    },
    { cacheKey: JSON.stringify },
  );
}

fix: truncate long document

699f175

vercel bot deployed to Preview May 16, 2024 01:06 View deployment

himself65 requested a review from marcusschiesser May 16, 2024 01:20

himself65 commented May 16, 2024

View reviewed changes

fix: deps

04967b3

vercel bot deployed to Preview May 16, 2024 05:16 View deployment

himself65 mentioned this pull request Jun 10, 2024

feat: truncate embedding tokens #918

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: truncate long document #843

fix: truncate long document #843

himself65 commented May 16, 2024 •

edited

changeset-bot bot commented May 16, 2024 •

edited

vercel bot commented May 16, 2024 •

edited

himself65 left a comment

fix: truncate long document #843

Are you sure you want to change the base?

fix: truncate long document #843

Conversation

himself65 commented May 16, 2024 • edited

What's more

changeset-bot bot commented May 16, 2024 • edited

⚠️ No Changeset found

vercel bot commented May 16, 2024 • edited

himself65 left a comment

Choose a reason for hiding this comment

himself65 commented May 16, 2024 •

edited

changeset-bot bot commented May 16, 2024 •

edited

vercel bot commented May 16, 2024 •

edited