Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: truncate long document #843

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

himself65
Copy link
Member

@himself65 himself65 commented May 16, 2024

Fixes: #836

There are some possible ways to solve that issue.

  1. force split text right before embedding

This is not what the user expected, and must lose much of Infos

  1. Split documents inside of each loader.

Good, but it still will overflow the chunk size since metadata may overflow the chunk size.

What's more

I think in the future, we should detect document size dynamically. And everytime user updates the document and if it overflow we should warn or error

Copy link

changeset-bot bot commented May 16, 2024

⚠️ No Changeset found

Latest commit: 04967b3

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link

vercel bot commented May 16, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
llama-index-ts-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 16, 2024 5:16am

Copy link
Member Author

@himself65 himself65 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some ideas about the new design of Node classes

  1. move BaseNode to interface
  2. getContent to be value getter
  3. metadata to be Map<string, JSONValue>
  4. hidden text property
  5. auto-memo id
interface BaseNode {
  readonly id: string;
  content: string;
}

type JSONObject = { [key: string]: JSONValue };
type JSONArray = JSONValue[];
type JSONValue = string | number | boolean | null | JSONObject | JSONArray;
export type Metadata = Map<string, JSONValue>;

export const Settings = {
  chunkSize: undefined as number | undefined,
};

function chunkSizeCheck(
  originalGetter: () => string,
  context: ClassGetterDecoratorContext,
) {
  return function (this: Document) {
    const content = originalGetter.call(this);
    if (Settings.chunkSize !== undefined) {
      if (content.length > Settings.chunkSize) {
        console.warn(
          `Document (${this.id.substring(
            0,
            8,
          )}) is larger than chunk size: ${content.length}`,
        );
        console.warn(`Truncating content...`);
        console.warn("If you want to disable this warning:");
        console.warn("  1. Set Settings.chunkSize = undefined");
        console.warn("  2. Set Settings.chunkSize to a larger value");
        console.warn(
          "  3. Change the way of splitting content into different chunks",
        );
        return content.slice(0, Settings.chunkSize);
      }
    }
    return content;
  };
}

export class Document implements BaseNode {
  get id(): string {
    return Document.hash(this.text, this.metadata);
  }

  @chunkSizeCheck
  get content(): string {
    const leading = JSON.stringify([...this.metadata.entries()]);
    return `${leading ? `${leading}\n` : ""}${this.text}`;
  }

  constructor(
    private text: string,
    public metadata: Metadata = new Map(),
  ) {}

  static hash = memoize(
    (content: string, metadata: Metadata) => {
      return createHash("sha256")
        .update(content)
        .update(JSON.stringify([...metadata.entries()]))
        .digest("hex");
    },
    { cacheKey: JSON.stringify },
  );
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: PapaCSVReader concatRows=true fails for some .csv files
1 participant