-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Refactor][Bitbucket_Server] Speed up PR collector/extractor #7457
Comments
Hi @sstojak1 , it's a valid refactore, but may I know why the same data keeps been adding to the RAW_PULL_REQUEST_TABLE after each run? |
Hi @Startrekzky I would say that this is because BB Server API doesn't support date query parameter. So you cannot fetch PRs that were updated/created after last import job. Because of that Devlake is importing all PRs during each job run (check how BB Cloud is not doing that since date query parameter is supported for that tool). |
Based on the information you provided, it seems the collector would benefit from using the simpler ApiCollector since it would purge related records from the raw data table before saving new PR information. This might be more efficient compared to the StatefulApiCollector in this context. |
@klesh That might work! Is there a way to run Devlake locally in debug mode? I'd like to go through the ApiCollector impl to understand its impact on the rest of the steps for importing BB server data... |
Yes, sure. You may follow this guide. In case you wanna execute specific subtasks, you may go to the
remember to change the plugin name and arguments accordingly. |
@klesh can you please assign this task to me? I have started to work on it |
@sstojak1 Thanks for the reminder. Done. 🤝 |
What and why to refactor
What are you trying to refactor? Why should it be refactored now?
The pr_collector for Bitbucket Server consistently adds the same data to the RAW_PULL_REQUEST_TABLE after each run.
Consequently, the extractApiPullRequests process slows down because it has to sift through all the records in the raw table, including duplicates.
For instance, if a repository has 1000 pull requests, after 10 job runs, the raw table will contain 10,000 rows, and extractApiPullRequests will have to process each of these records.
Was there a need to have all those history raw API data imports and because of that delete is not a feasible option?
Describe the solution you'd like
How to refactor?
Related issues
Please link any other
Additional context
Add any other context or screenshots about the feature request here.
How to recreate:
Run Collect Data for Bitbucket Server more than once and observe the size of _raw_bitbucket_server_api_pull_requests table.
The text was updated successfully, but these errors were encountered: