-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd: skip upgrade if migration is pending #14196
Conversation
pkg/operator/ceph/cluster/osd/osd.go
Outdated
// skip update of all the OSDs if the migration of OSD is pending. | ||
if c.replaceOSD != nil { | ||
logger.Info("skipping update of all the OSDs since OSD migration is pending") | ||
updateQueue.Clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only want to skip the reconcile of OSDs that are still on the old/unsupported format. For example, what if all the OSD migration is completed, except one OSD gets stuck and fails the migration? We don't want to block the update of all the OSDs that have successfully migrated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might need some more changes. I'll take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the changes to only skip reconcile of the OSDs that are still on the old/unsupported format and need migration. Testing looks good and added details in the description.
37a4749
to
a152bb4
Compare
a152bb4
to
5c9fca9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think things here look good. I have some questions to make sure my understanding is correct, and one suggestion to avoid having to do nil pointer checks for any future changes.
pkg/operator/ceph/cluster/osd/osd.go
Outdated
// identify OSDs that need migration to a new backend store | ||
err := c.replaceOSDForNewStore() | ||
osdsToSkipReconcile, err := controller.GetDaemonsToSkipReconcile(c.clusterInfo.Context, c.context, c.clusterInfo.Namespace, OsdIdLabelKey, AppName) | ||
if err != nil { | ||
return errors.Wrapf(err, "failed to replace an OSD that needs migration to new backend store in namespace %q", namespace) | ||
logger.Warningf("failed to get osds to skip reconcile. %v", err) | ||
} | ||
|
||
osdsToSkipReconcile, err := controller.GetDaemonsToSkipReconcile(c.clusterInfo.Context, c.context, c.clusterInfo.Namespace, OsdIdLabelKey, AppName) | ||
// replace OSDs for a new backing store | ||
osdsToBeReplaced, err := c.replaceOSDForNewStore() | ||
if err != nil { | ||
logger.Warningf("failed to get osds to skip reconcile. %v", err) | ||
return errors.Wrapf(err, "failed to replace OSD for new backing store %q in namespace %q", c.spec.Storage.Store.Type, namespace) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like these 2 blocks are changing positions. What is the behavior that results from switching the order here? I'm having trouble understanding what the purpose might be, so I imagine it must be related to side-effects of execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No side-effects. I'll move it back to the original order.
// getOSDReplaceInfo returns an existing OSD that needs to be replaced for a new backend store | ||
func (c *Cluster) getOSDReplaceInfo() (*OSDReplaceInfo, error) { | ||
osdReplaceInfo, err := GetOSDReplaceConfigMap(c.context, c.clusterInfo) | ||
func (c *Cluster) replaceOSDForNewStore() (*OSDReplaceInfoList, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be best to have this return a non-pointer list. Looking at usage of this function's output, I see a lot of checks for nil followed immediately by some other check. I believe that future code maintenance will be slightly less likely to introduce bugs if we don't have to worry about non-nil return values in the caller.
func (c *Cluster) replaceOSDForNewStore() (*OSDReplaceInfoList, error) { | |
func (c *Cluster) replaceOSDForNewStore() (OSDReplaceInfoList, error) { |
pkg/operator/ceph/cluster/osd/osd.go
Outdated
// replaceOSDForNewStore deletes an existing OSD deployment and saves the information in the configmap | ||
func (c *Cluster) deleteOSDDeployment(osdID int) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function name in doc comment and actual function name don't match. It looks like the rest of the text is still correct based on my interpretation, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a manual test to consider for validating these changes... Restart the operator in the middle of a migration, then confirm that the non-migrated OSDs will still be migrated, the migrated OSDs won't be migrated again, and the in-progress migration finishes.
5c9fca9
to
d4efb2b
Compare
Tested with following scenarios:
1 and 3 works fine. For 2, the config map is not deleted and in the next reconcile operator will migrate the same OSD again. This is not desirable (although all the OSDs get migrated without issues at the end). One way to fix is to always delete the configMap at the end of the OSD reconcile without checking if |
d4efb2b
to
3d76974
Compare
This fixes the migration of the OSD that was previously migrated. Pushed the changes in the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for working through those manual test scenarios. Just a couple nits to consider.
3d76974
to
30a0532
Compare
Currently we allow upgrade of other OSDs while migration of OSD, due to change in backing store, is pending. This will not work if the updated OSD ceph version does not support the currently applied backing store. So rook will skip any OSD upgrade if the OSD migration is pending. Signed-off-by: sp98 <sapillai@redhat.com>
30a0532
to
322549c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look good.
While I have larger-scale thoughts that the OSD code is quite complicated and could be improved with some code tidying, that isn't really an issue that this PR can solve.
Currently we allow upgrade of other OSDs while migration of OSD, due to change in backing store, is pending. This will not work if the updated OSD ceph version does not support the currently applied backing store. So rook will skip any OSD upgrade if the OSD migration is pending.
Testing:
ceph
image and thestore
type in the cephCluster CR at the same time.operator-logs.txt
cephCluster CR status before upgrade:
cephCluster CR status after upgrade.
Ceph status
Checklist: