Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stress test fail: Expected State - assertion IsDone failed #813

Open
Yuval-Ariel opened this issue Jan 23, 2024 · 1 comment
Open

stress test fail: Expected State - assertion IsDone failed #813

Yuval-Ariel opened this issue Jan 23, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@Yuval-Ariel
Copy link
Contributor

Yuval-Ariel commented Jan 23, 2024

db_stress: db_stress_tool/expected_state.cc:399: virtual rocksdb::{anonymous}::ExpectedStateTraceRecordHandler::~ExpectedStateTraceRecordHandler(): Assertion `IsDone()\' failed.
#7  0x00007f9a52aea729 in __assert_fail_base (fmt=0x7f9a52c80588 "%s%s%s:%u: %s%sAssertion `%s\' failed.\
%n", assertion=0x55f5bbb04740 "IsDone()", file=0x55f5bbb04600 "db_stress_tool/expected_state.cc", line=399, function=<optimized out>) at assert.c:92
92\tassert.c: No such file or directory.
#8  0x00007f9a52afbfd6 in __GI___assert_fail (assertion=assertion@entry=0x55f5bbb04740 "IsDone()", file=file@entry=0x55f5bbb04600 "db_stress_tool/expected_state.cc", line=line@entry=399, function=function@entry=0x55f5bbb046a0 "virtual rocksdb::{anonymous}::ExpectedStateTraceRecordHandler::~ExpectedStateTraceRecordHandler()") at assert.c:101
101\tin assert.c
#9  0x000055f5bb8abd48 in rocksdb::(anonymous namespace)::ExpectedStateTraceRecordHandler::~ExpectedStateTraceRecordHandler (this=<optimized out>, __in_chrg=<optimized out>) at db_stress_tool/expected_state.cc:399
399\t  ~ExpectedStateTraceRecordHandler() { assert(IsDone()); }
#10 0x000055f5bb8ce028 in rocksdb::(anonymous namespace)::ExpectedStateTraceRecordHandler::~ExpectedStateTraceRecordHandler (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/9/ext/atomicity.h:69
69\t    return __result;
#11 rocksdb::(anonymous namespace)::ExpectedStateTraceRecordHandler::~ExpectedStateTraceRecordHandler (this=<optimized out>, __in_chrg=<optimized out>) at db_stress_tool/expected_state.cc:399
399\t  ~ExpectedStateTraceRecordHandler() { assert(IsDone()); }
#12 std::default_delete<rocksdb::(anonymous namespace)::ExpectedStateTraceRecordHandler>::operator() (this=<optimized out>, __ptr=<optimized out>) at /usr/include/c++/9/bits/unique_ptr.h:81
81\t\tdelete __ptr;
#13 std::default_delete<rocksdb::(anonymous namespace)::ExpectedStateTraceRecordHandler>::operator() (this=<optimized out>, __ptr=<optimized out>) at /usr/include/c++/9/bits/unique_ptr.h:75
75\t      operator()(_Tp* __ptr) const
#14 std::unique_ptr<rocksdb::(anonymous namespace)::ExpectedStateTraceRecordHandler, std::default_delete<rocksdb::(anonymous namespace)::ExpectedStateTraceRecordHandler> >::reset (__p=<optimized out>, this=<optimized out>) at /usr/include/c++/9/bits/unique_ptr.h:402
402\t\t  get_deleter()(std::move(__p));
#15 rocksdb::FileExpectedStateManager::Restore (this=<optimized out>, db=<optimized out>) at db_stress_tool/expected_state.cc:670
670\t      handler.reset(new ExpectedStateTraceRecordHandler(seqno - saved_seqno_,
#16 0x000055f5bba6b7c5 in rocksdb::SharedState::Restore (db=<optimized out>, this=0x6120000001c0) at db_stress_tool/db_stress_test_base.cc:396
396\t    ProcessRecoveredPreparedTxns(shared);
#17 rocksdb::StressTest::FinishInitDb (this=0x61c000000080, shared=0x6120000001c0) at db_stress_tool/db_stress_test_base.cc:384
384\t    Status s = shared->Restore(db_);
#18 0x000055f5bb9ed07f in rocksdb::RunStressTestImpl (shared=0x6120000001c0) at db_stress_tool/db_stress_driver.cc:77
77\t  stress->FinishInitDb(shared);
#19 0x000055f5bb9f258d in rocksdb::RunStressTest (shared=shared@entry=0x6120000001c0) at db_stress_tool/db_stress_driver.cc:213
213\t  bool result = RunStressTestImpl(shared);
#20 0x000055f5bb8993f3 in rocksdb::db_stress_tool (argc=<optimized out>, argv=<optimized out>) at db_stress_tool/db_stress_tool.cc:397
397\t  if (RunStressTest(shared.get())) {
[2024-01-22 19:00:28.564882] ERROR: db_stress (pid=41123) failed before kill: exitcode=-6, signal=SIGABRT

further documentation in ~/expected_err in instance 173

There are several issues:

ExpectedStateTraceRecordHandler IsDone Assertion

During FileExpectedStateManager::Restore(DB* db), a ExpectedStateTraceRecordHandler is created with max_write_ops which is the gap between the current db seq number and the seq found on the state file. The trace record replayer replays the trace file and increments num_write_ops_ on each operation. IsDone checks that num_write_ops_ == max_write_ops_.
When using use_txn=1 and txn_write_policy=1, there are about half num_write_ops than the max. This always happens.

use_txn with reopen > 1 - fails with corruption.

After reopen, a new state is created with a seq > 0. This happens since we added the trace_ops flag.
However, When db_stress is rerun on the db, it recognizes theres a state to recover from and a trace file too. But the seq from the db is 0 which is lower than in the state. (in

if (seqno < saved_seqno_) {
return Status::Corruption("DB is older than any restorable expected state");
)
Cmd to reproduce:

                "--threads=1",
                "--use_txn=1",
                "--skip_verifydb=true",
                "-continuous_verification_interval=0",
                "--clear_column_family_one_in=0",
                "--reopen=1",

Txn with write policy 1 creates a trace file which is unreadable.

All of the above came up since we've added #797 which creates a trace file by default.

@Yuval-Ariel Yuval-Ariel added the bug Something isn't working label Jan 23, 2024
@Yuval-Ariel Yuval-Ariel self-assigned this Jan 23, 2024
Yuval-Ariel added a commit that referenced this issue Jan 24, 2024
This should be reverted once #813 is handled
@Yuval-Ariel
Copy link
Contributor Author

Yuval-Ariel commented Jan 25, 2024

for the IsDone assertion failure, my findings are:
When the stress test resumes and finds a trace file and state file with a seq number prefix e.g. 0.state ,
The stress test tries to verify the trace file holds all the changes done to the db since that seq number. It does that by replaying the trace file and counting how many put, delete and rangedelete there were. Then it compares that to how many seq numbers have changed since that seq on the file (IsDone assertion).
The assumption that the gap in seq numbers should be equivalent to the number of write and delete operations recorded in the trace file seems very weak to me. The txn db for example doesnt keep those assumptions and it seems like there are other instances where this assumption is false.

since we want to keep the trace file for debugging, for now, the workaround can be that we'll avoid this assertion and only print the numbers (num_write_ops_ and max_write_ops_) for future debugging.

workaround will be done in - #816

Yuval-Ariel added a commit that referenced this issue Jan 28, 2024
Adding a trace file by default in PR #797 has revealed some incompatibilities between the trace file and several configurations (more details in #813). Keep the trace file and remove the IsDone assertion.
Yuval-Ariel added a commit that referenced this issue Jan 28, 2024
Adding a trace file by default in PR #797 has revealed some incompatibilities between the trace file and several configurations (more details in #813). Keep the trace file and remove the IsDone assertion.
Yuval-Ariel added a commit that referenced this issue Jan 29, 2024
Adding a trace file by default in PR #797 has revealed some incompatibilities between the trace file and several configurations (more details in #813). Keep the trace file and remove the IsDone assertion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant