Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: backup-restore/online-restore failed #124330

Closed
cockroach-teamcity opened this issue May 17, 2024 · 3 comments · Fixed by #124348
Closed

roachtest: backup-restore/online-restore failed #124330

cockroach-teamcity opened this issue May 17, 2024 · 3 comments · Fixed by #124348
Assignees
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 17, 2024

roachtest.backup-restore/online-restore failed with artifacts on master @ 855b9cc97afa3df4f7e17f928c04ab0834b2630c:

(monitor.go:154).Wait: monitor failure: backup 1_round-trip-test-backup_cluster: error verifying online restore: download job 969459392374472708 did not download all data
test artifacts and logs in: /artifacts/backup-restore/online-restore/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-38828

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels May 17, 2024
@msbutler msbutler self-assigned this May 17, 2024
@msbutler
Copy link
Collaborator

i can repro this, and interestingly it only occurs after cluster restore. I annotated the error and we're only missing like 2 mb's of data. I wonder if this check is failing because we don't download the temp system db from the pre restore phase.

msbutler added a commit to msbutler/cockroach that referenced this issue May 17, 2024
This patch adds the pre restore data spans to the list of spans to download.
While these pre restore spans map to data in the temporary system table
database that are then rewwritten to the actual system table, the download job
ought to download all external data linked into the cluster out of principle.

Fixes cockroachdb#124330

Release note: none
@msbutler
Copy link
Collaborator

my theory was correct. fix here #124348

@cockroach-teamcity
Copy link
Member Author

roachtest.backup-restore/online-restore failed with artifacts on master @ 5d013285fa696f53df2abb39f44ffc777125fe1c:

(monitor.go:154).Wait: monitor failure: backup 2_round-trip-test-backup_cluster: error verifying online restore: download job 970308748052201474 did not download all data
test artifacts and logs in: /artifacts/backup-restore/online-restore/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@msbutler msbutler removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label May 20, 2024
craig bot pushed a commit that referenced this issue May 21, 2024
119416: pkg/util/eventagg: general aggregation framework for reduction of event cardinality r=dhartunian a=abarganier

**Reviewer note: review commit-wise**

The eventagg package is (currently) a proof of concept ("POC") that aims to provide an easy-to-use library that standardizes the way in which we aggregate Observability event data in CRDB. The goal is to eventually emit that data as "exhaust" from CRDB, which downstream systems can consume to build Observability features that do not rely on CRDB's own availability to aid in debugging & investigations. Additionally, we want to provide facilities for code within CRDB to consume this same data, such that it can also power features internally.

This pull request contains work to create the aggregation mechanism in `pkg/util/eventagg`.

This facilities provide a way of aggregating notable events to reduce cardinality, before performing further processing and/or structured logging.

In addition to the framework, a toy SQL Stats example is provided in `pkg/sql/sqlstats/aggregate.go`, which shows the current developer experience when using the APIs.

See `pkg/util/eventagg/doc.go` for more details

Since this feature is currently experimental, it's gated by the `COCKROACH_ENABLE_STRUCTURED_EVENTS` environment variable, which is disabled by default.

---

Release note: none

Epic: CRDB-35919

123120: ui: Highlight unavailable ranges in red on the summary bar with nonzero r=abarganier a=theloneexplorerquest

Modify the summary bar to change the color of unavailable ranges. When the unavailable range is greater than zero, it will be displayed in red; if it is zero, it will be green.

Fix: #122014

Release note (ui): Changed the color of unavailable ranges on the summary bar to red when nonzero; ranges are green when zero.

124160: roachtest: add test for admission control disk bandwidth  r=sumeerbhola a=aadityasondhi

This test runs a single node target cluster that has two workloads
running on it. The lower priority (qos=background) is very bandwidth
intensive, and without the AC bandwidth limiter would saturate the
provisioned bandwidth (controlled using cgroups).

This test shows how setting the cluster setting
`kvadmission.store.provisioned-bandwidth` limits the disk bandwidth
usage of lower priority work and shapes it at the value set in the
setting.

Fixes #121576.

Release note: None


124293: tools: switch md5 cmd name based on existence  r=dt a=dt

Release note: none.
Epic: none.

124348: backupccl: download pre restore data in cluster restore r=dt a=msbutler

This patch adds the pre restore data spans to the list of spans to download.
While these pre restore spans map to data in the temporary system table
database that are then rewwritten to the actual system table, the download job
ought to download all external data linked into the cluster out of principle.

Fixes #124330

Release note: none

124403: roachtest: use first transient error when checking for flakes r=srosenberg a=renatolabs

Previously, roachtest would only look at the outermost error in a chain that matched a `TransientError` (or `ErrorWithOwnership`) when checking for flakes. However, that is in most cases *not* what we want: if a transient error wraps another transient error, the actual reason for the failure is the original (wrapped) error.

Informs: #123887

Release note: None

124486: kvclient: add WithFiltering option to rangefeed client r=nvanbenschoten,msbutler a=stevendanna

This adds a WithFiltering option to the rangefeed client that passes through the option to the underlying rangefeed.

Epic: none
Release note: None

124491: raft: remove RawNode.TickQuiesced r=pav-kv a=nvanbenschoten

This commit removes the `(*RawNode).TickQuiesced` method. The method was deprecated back in etcd-io/raft#62 and has not been in use since 2018.

Epic: None
Release note: None

Co-authored-by: Alex Barganier <abarganier@cockroachlabs.com>
Co-authored-by: theloneexplorerquest <theloneexplorerquest@gmail.com>
Co-authored-by: Aaditya Sondhi <20070511+aadityasondhi@users.noreply.github.com>
Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Co-authored-by: Michael Butler <butler@cockroachlabs.com>
Co-authored-by: Renato Costa <renato@cockroachlabs.com>
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in be876f3 May 21, 2024
Disaster Recovery Backlog automation moved this from Backlog to Done May 21, 2024
msbutler added a commit to msbutler/cockroach that referenced this issue Jun 3, 2024
This patch adds the pre restore data spans to the list of spans to download.
While these pre restore spans map to data in the temporary system table
database that are then rewwritten to the actual system table, the download job
ought to download all external data linked into the cluster out of principle.

Fixes cockroachdb#124330

Release note: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery
Development

Successfully merging a pull request may close this issue.

2 participants