Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Associate restore snapshot task to parent mount task #108705

Conversation

nicktindall
Copy link
Contributor

@nicktindall nicktindall commented May 16, 2024

Implemented the test as an integration test that polls the tasks API to confirm the association of parent/child tasks. Thanks for the feedback @DaveCTurner.

Closes #105830

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah mocking this much stuff is pretty bad IMO, it will get in the way of future changes/refactorings to these APIs I'm sure.

I would suggest doing this as a org.elasticsearch.xpack.searchablesnapshots.BaseSearchableSnapshotsIntegTestCase which gives you real implementations of all this stuff. Submit a task to the master which blocks, something like this:

clusterService.submitUnbatchedStateUpdateTask("block master service", new ClusterStateUpdateTask() {
@Override
public ClusterState execute(ClusterState currentState) throws Exception {
safeAwait(cyclicBarrier); // notify test that we are blocked
safeAwait(cyclicBarrier); // wait to be unblocked by test
return currentState;
}
@Override
public void onFailure(Exception e) {
throw new AssertionError("block master service", e);
}
});

(get the clusterService using internalCluster().getCurrentMasterNodeInstance(ClusterService.class))

That will give you time to call the list-tasks API and verify that the mount task is the parent of the restore task.

@nicktindall nicktindall added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >bug labels May 17, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @nicktindall, I've created a changelog YAML for you.

@nicktindall nicktindall added the Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. label May 17, 2024
@nicktindall nicktindall force-pushed the fix/105830_link_restore_snapshot_task_to_parent_mount branch from a34015d to c5bdd8b Compare May 17, 2024 05:44
safeAwait(cyclicBarrier); // Unblock the master thread
if (response != null) {
response.actionGet();
assertAcked(indicesAdmin().prepareDelete(indexName));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actionGet and delete index here were an attempt to fix a failure that occurred in CI in a teardown, one or both of these may be necessary. Happy to remove the delete if it's not necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, yeah, it seems we have to wait for the mount to complete before finishing the test or else there's a race between cleaning up the test cluster and the mount process creating the internal .snapshot-blob-cache index.

I expect there's no need to manually delete the test index here tho but it doesn't hurt either. Would you add a code comment about why we need these nonobvious cleanup steps?

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test looks great. I left a few pointers but nothing substantial.

safeAwait(cyclicBarrier); // Unblock the master thread
if (response != null) {
response.actionGet();
assertAcked(indicesAdmin().prepareDelete(indexName));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, yeah, it seems we have to wait for the mount to complete before finishing the test or else there's a race between cleaning up the test cluster and the mount process creating the internal .snapshot-blob-cache index.

I expect there's no need to manually delete the test index here tho but it doesn't hurt either. Would you add a code comment about why we need these nonobvious cleanup steps?

} finally {
safeAwait(cyclicBarrier); // Unblock the master thread
if (response != null) {
response.actionGet();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This'd be as good a time as any to stop using these infinite waits in tests, doing something like this and then using safeGet() instead:

diff --git a/test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java b/test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java
index 80f9f2abea1..94c89318718 100644
--- a/test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java
+++ b/test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java
@@ -2154,6 +2154,10 @@ public abstract class ESTestCase extends LuceneTestCase {
     public static <T> T safeAwait(SubscribableListener<T> listener) {
         final var future = new PlainActionFuture<T>();
         listener.addListener(future);
+        return safeGet(future);
+    }
+
+    private static <T> T safeGet(Future<T> future) {
         try {
             return future.get(SAFE_AWAIT_TIMEOUT.millis(), TimeUnit.MILLISECONDS);
         } catch (InterruptedException e) {

@DaveCTurner DaveCTurner dismissed their stale review May 17, 2024 09:27

comments addressed

- use assertBusy to reduce task-get complexity
- don't wait forever for mount to complete
- document additional required cleanup
- add ESTestCase#safeGet and use it
@nicktindall nicktindall marked this pull request as ready for review May 20, 2024 00:13
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label May 20, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few tiny style points.

}

private TaskInfo getTaskForActionFromMaster(String action) {
var ltr = new ListTasksRequest().setDetailed(true).setNodes(internalCluster().getMasterName()).setActions(action);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would rather not use abbrs. for var. names wherever poss. :)

Here I think I'd just inline the variable, it's only used in one place - saves having to think up a name at all.

var ltr = new ListTasksRequest().setDetailed(true).setNodes(internalCluster().getMasterName()).setActions(action);
ListTasksResponse response = client().execute(TransportListTasksAction.TYPE, ltr).actionGet();
int matchingTasks = response.getTasks().size();
assertEquals(String.format(Locale.ROOT, "Expected a single task for action %s, got %d", action, matchingTasks), 1, matchingTasks);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use assertThat(..., hasSize(1)) here - that should give the same information on failure as what you've got here but without needing a hand-crafted message.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nicktindall nicktindall merged commit deb3ef9 into elastic:main May 21, 2024
15 checks passed
@nicktindall nicktindall deleted the fix/105830_link_restore_snapshot_task_to_parent_mount branch May 21, 2024 07:28
jedrazb pushed a commit to jedrazb/elasticsearch that referenced this pull request May 21, 2024
* Associate restore snapshot task to parent mount task

Closes elastic#105830
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Distributed Meta label for distributed team v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cluster:admin/snapshot/mount does not link to its child cluster:admin/snapshot/restore task
4 participants