Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exception catch for redis connection failure #1261

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

corynezin
Copy link
Contributor

@corynezin corynezin commented May 25, 2020

Resolves #1153
Resolves (Maybe?) #998

@codecov
Copy link

codecov bot commented May 25, 2020

Codecov Report

Merging #1261 (704083d) into master (bd0fcc1) will increase coverage by 1.16%.
The diff coverage is 95.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1261      +/-   ##
==========================================
+ Coverage   93.84%   95.01%   +1.16%     
==========================================
  Files          22       43      +21     
  Lines        2341     5841    +3500     
==========================================
+ Hits         2197     5550    +3353     
- Misses        144      291     +147     
Impacted Files Coverage Δ
rq/worker.py 88.90% <91.66%> (-2.11%) ⬇️
tests/test_worker.py 97.40% <100.00%> (ø)
rq/exceptions.py 88.23% <0.00%> (-11.77%) ⬇️
rq/registry.py 96.95% <0.00%> (-1.83%) ⬇️
rq/scheduler.py 96.68% <0.00%> (-0.19%) ⬇️
rq/logutils.py 100.00% <0.00%> (ø)
rq/decorators.py 100.00% <0.00%> (ø)
rq/cli/helpers.py 86.36% <0.00%> (ø)
rq/serializers.py 100.00% <0.00%> (ø)
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd0fcc1...704083d. Read the comment docs.

@corynezin corynezin marked this pull request as ready for review May 25, 2020 19:17
rq/worker.py Outdated
except redis.exceptions.ConnectionError as conn_err:
self.log.error('Could not connect to Redis instance: %s '
'Retrying...', conn_err)
time.sleep(1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe would be better to use an exponential backoff approach here instead of trying each 1 second?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, added

self.log.error('Could not connect to Redis instance: %s '
'Retrying...', conn_err)
time.sleep(connection_wait_time)
connection_wait_time *= self.exponential_backoff_factor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's limit the connection_wait_time to 60 seconds max. So it'll try every 1, 2, 4, 8 ... up to 60 seconds.

rq/worker.py Outdated
@@ -567,14 +567,15 @@ def dequeue_job_and_maintain_ttl(self, timeout):
self.set_state(WorkerStatus.IDLE)
self.procline('Listening on ' + qnames)
self.log.debug('*** Listening on %s...', green(qnames))

connection_wait_time = 1.0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the use of semicolons.

rq/worker.py Outdated
time.sleep(connection_wait_time)
connection_wait_time *= self.exponential_backoff_factor
else:
connection_wait_time = 1.0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the use of semicolons.

rq/worker.py Outdated
@@ -591,6 +592,13 @@ def dequeue_job_and_maintain_ttl(self, timeout):
break
except DequeueTimeout:
pass
except redis.exceptions.ConnectionError as conn_err:
self.log.error('Could not connect to Redis instance: %s '
'Retrying...', conn_err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add Retrying in <wait_time> seconds...

rq/worker.py Outdated
@@ -1,4 +1,4 @@
# -*- coding: utf-8 -*-
#wait_time_after_connection_failure -*- coding: utf-8 -*-
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this line need to be changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake!

@Asrst
Copy link
Contributor

Asrst commented Dec 2, 2020

Hi @selwin
I visited this thread from the issue #998
may I know what is the reason, this PR is not merged yet into master ? I can make required changes & provide a fix.

rq workers are not automatically retrying to connect when connect is lost/redis-server is restarted for maintenance. So until rq worker is restarted, it keeps raising ConnectionError resulting in failure. Lately this is becoming serious issue in production. Hoping this PR can fix it.

@selwin
Copy link
Collaborator

selwin commented Dec 2, 2020

@Asrst because my comments haven't been addressed.

@rkvf01
Copy link

rkvf01 commented Jan 7, 2021

@selwin Is there anything we can do since it is in this state from Jun 20 ?

@selwin
Copy link
Collaborator

selwin commented Jan 8, 2021

Yes, if my comments on this PR is addressed, this is also something I'd like to see merged in.

@corynezin
Copy link
Contributor Author

I think I have resolved the issues now, please let me know if there is anything else!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add exception catching for when Redis server is down
5 participants