Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logging #2329

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Add logging #2329

wants to merge 6 commits into from

Conversation

moxarth-elastic
Copy link
Collaborator

@moxarth-elastic moxarth-elastic commented Apr 3, 2024

Part of #2299

This PR includes logging for the following connectors:

  1. Azure Blob Storage
  2. Confluence
  3. MS SQL
  4. Oracle
  5. PostgreSQL

Also, there are few changes in confluence as per this PR: #2326

Drive folder for log files with DEBUG log_level: https://drive.google.com/drive/folders/13-P99PjgjFdjI9hrmPsuhNst6Kdu5C0-

Checklists

Pre-Review Checklist

  • this PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check config.yml.example)
  • this PR has a meaningful title
  • this PR links to all relevant github issues that it fixes or partially addresses
  • if there is no GH issue, please create it. Each PR should have a link to an issue
  • this PR has a thorough description
  • Covered the changes with automated tests

Copy link
Contributor

@timgrein timgrein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement so far! Left some comments

@@ -714,6 +726,7 @@ async def fetch_server_space_permission(self, space_key):
return {}

url = URLS[SPACE_PERMISSION].format(space_key=space_key)
self._logger.debug(f"Fetching permissions for '{space_key}' space")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be info log-level as it indicates progress?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should be info.

Comment on lines 258 to 260
self._logger.info(
"Fetching all tables as the configuration field 'tables' is set to '*' or advanced sync rule is enabled."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._logger.info(
"Fetching all tables as the configuration field 'tables' is set to '*' or advanced sync rule is enabled."
)
self._logger.info(
"Fetching all tables as the configuration field 'tables' is set to '*' or advanced sync rules are enabled."
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also split the log message without concatenating two possible reasons with or. This adds ambiguity IMO.

Comment on lines 343 to 345
self._logger.debug(
f"Streaming records from database for table: {table} or using query: {query}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not the biggest fan of using or inside a log statement as it adds ambiguity. I think we should have two distinct log messages depending on the value of query. You could also move the assignment of the parameter for get_cursor also into this condition to not check it twice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, i've bifurcated this and made it dependant on the value of query.

@@ -553,6 +568,9 @@ async def fetch_documents_from_query(self, tables, query):
Yields:
Dict: Document to be indexed
"""
self._logger.debug(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be info level as it indicates progress?

@@ -218,6 +219,9 @@ async def ping(self):
async def get_tables_to_fetch(self, is_filtering=False):
tables = configured_tables(self.tables)
if is_wildcard(tables) or is_filtering:
self._logger.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reasoning as in mssql.py: I think we should've two distinct log messages instead of using one containing or.

@@ -537,6 +546,9 @@ async def fetch_documents_from_query(self, tables, query):
Yields:
Dict: Document to be indexed
"""
self._logger.debug(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be info level?

@artem-shelkovnikov
Copy link
Member

Please do not combine multiple connectors in same PR.

As for the changes - can you include example logs for the sync before/after for each connector with INFO and separately with DEBUG log level? E.g.:

  • ABS Before INFO
  • ABS After INFO
  • ABS Before DEBUG
  • ABS After DEBUG
  • Confluence Before INFO
  • Confluence After INFO
  • etc...

@moxarth-elastic
Copy link
Collaborator Author

Please do not combine multiple connectors in same PR.

@artem-shelkovnikov We're planning to update the logging functionality across roughly 20 connectors.
During our weekly meeting, we discussed with @DianaJourdan that bundling the changes for every 5 connectors into a single pull request would streamline the process.
Since the changes are confined to the logger functionality, reviewing the updates for 5 connectors at a time should not be overly time-consuming or complicated. This approach avoids the complexity of simultaneously managing 20 open pull requests per connector.
Consequently, we will proceed by consolidating the updates into four separate pull requests, each encompassing 5 connectors.

Let me know if you have differing opinion for raising four PRs each containing 5 connector.

@moxarth-elastic
Copy link
Collaborator Author

can you include example logs for the sync before/after for each connector with INFO and separately with DEBUG log level?

I think we can inspect the logs from E2E tests in Buildkite rather than creating four separate log files (including before/after logs for both INFO and DEBUG levels). Generating four files for approximately 20 connectors seems very time-consuming.

@artem-shelkovnikov
Copy link
Member

I think we can inspect the logs from E2E tests in Buildkite rather than creating four separate log files (including before/after logs for both INFO and DEBUG levels). Generating four files for approximately 20 connectors seems very time-consuming.

That is my concern as well, but also reviewing logs for E2E tests in Buildkite side by side for each reviewer is time consuming, I think it's fair to at least attach the links to E2E test runs here for each connector before/after. Also E2E only sends DEBUG logs, which produces a lot more than INFO, manually filtering these outs is huge work as well.

Without actually looking at logs generated we cannot tell if logs are good now or not.

@parth-elastic
Copy link
Collaborator

Without actually looking at logs generated we cannot tell if logs are good now or not.

Indeed, your point sounds reasonable. How about a hybrid approach, where we generate a before/after and an Info/Debug log file for just one connector within each PR for all the 4 PRs.
Since the logs for different connectors are likely to be similar, this method would enable us to check the log changes effectively with reduced effort. WDYT?

@artem-shelkovnikov
Copy link
Member

Since the logs for different connectors are likely to be similar, this method would enable us to check the log changes effectively with reduced effort. WDYT?

But I think also the point is that for some connectors (like S3/GCS/ABS) they will likely be similar, so for such connectors we can do that, but for different style connectors (S3 vs Jira vs Github) expected logs are gonna be different. We can try to break connectors by log simplicity into grouped PRs, but I'm also not super sure about this. Let me bring this up with the team

@parth-elastic
Copy link
Collaborator

parth-elastic commented Apr 3, 2024

the point is that for some connectors (like S3/GCS/ABS) they will likely be similar, so for such connectors we can do that

We can organize our pull requests by grouping similar types of connectors together. For instance, this PR includes 3 database connectors, so we could produce before/after logs specifically for MSSQL as a representative. In a subsequent PR, if we bundle together connectors like S3, GCS, Box, OneDrive, and Dropbox, which are all storage services, we can create a single set of logs for one of them as an example. We can apply the same strategy for future PRs, streamlining our logging efforts by focusing on a representative connector from each group.

@artem-shelkovnikov
Copy link
Member

I've opened the discussion with the team, we'll get back to you about it shortly

@artem-shelkovnikov
Copy link
Member

Hey @moxarth-elastic, as mentioned in Slack, can we have logs in DEBUG log_level for each connector changed here to continue the review?

@moxarth-elastic
Copy link
Collaborator Author

Hey @moxarth-elastic, as mentioned in Slack, can we have logs in DEBUG log_level for each connector changed here to continue the review?

@artem-shelkovnikov Let me upload the log files for the connectors of this PR, so you can continue your review

@moxarth-elastic
Copy link
Collaborator Author

@artem-shelkovnikov You can check the log file from this drive folder: https://drive.google.com/drive/folders/13-P99PjgjFdjI9hrmPsuhNst6Kdu5C0-

@artem-shelkovnikov artem-shelkovnikov requested a review from a team April 15, 2024 16:29
@artem-shelkovnikov
Copy link
Member

Can you break this PR into smaller PRs per connector? I've checked the logs and will have feedback for each connector, addressing all those in one PR will be pretty tough IMO

@moxarth-elastic
Copy link
Collaborator Author

Can you break this PR into smaller PRs per connector? I've checked the logs and will have feedback for each connector, addressing all those in one PR will be pretty tough IMO

@artem-shelkovnikov Split this PR into smaller connector-wise PRs
ABS - #2530
Confluence - #2526
Oracle - #2528
MSSQL - #2527
PostgreSQL - #2529

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants