Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharepoint server incremental sync #2335

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sjors101
Copy link
Contributor

@sjors101 sjors101 commented Apr 3, 2024

Closes #2334

This allows users to run incremental syncs with the Sharepoint server connector. It uses the default timestamp cursor from the framework. Tested agains the Sharepoint 2019 (16.0.0.10337: 1). The test could use some improvement

What is included in the incremental sync:

  • New objects (sites / libraries / lists / list_items / drive_items) since last sync
  • Modified objects (sites / libraries / lists / list_items / drive_items) since last sync

What is not included in the incremental sync:

  • Delete objects (sites / libraries / lists / list_items / drive_items)

Checklists

Pre-Review Checklist

  • this PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check config.yml.example)
  • this PR has a meaningful title
  • this PR links to all relevant github issues that it fixes or partially addresses
  • if there is no GH issue, please create it. Each PR should have a link to an issue
  • this PR has a thorough description
  • Covered the changes with automated tests
  • Tested the changes locally
  • Added a label for each target release version (example: v7.13.2, v7.14.0, v8.0.0)

Copy link
Member

@artem-shelkovnikov artem-shelkovnikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution!

I skimmed through the code and wondering now - does the incremental sync here optimise calls to Sharepoint Server, or does it filter out the data in the connector fully?

@sjors101
Copy link
Contributor Author

sjors101 commented Apr 3, 2024

Thank you for your contribution!

I skimmed through the code and wondering now - does the incremental sync here optimise calls to Sharepoint Server, or does it filter out the data in the connector fully?

It filters out data in the connector. This still requires a lot of calls to the Sharepoint API, since we need to retrieve the modified timestamp. Since 'all' Sharepoint objects are in a list we can filter based on the meta data before downloading the actual drive_items / attachments, which is a big win.

@artem-shelkovnikov
Copy link
Member

Theoretically it should already work like this, incremental syncs are already enabled for Sharepoint Server connector:

https://github.com/elastic/connectors/blob/main/connectors/sources/sharepoint_server.py#L437-L442:

Since get_docs_incrementally is not specified in the connector, it will use get_docs function and will skip the documents who have their _timestamp field provided that is not later than _timestamp of documents already saved in the index.

Did you by chance test if your implementation is faster or delivers different results, than our default incremental sync for this source?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sharepoint server incremental sync
2 participants