Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile with libcxx using clang-17 #557

Open
wants to merge 481 commits into
base: libcxx
Choose a base branch
from
Open

Conversation

JinHai-CN
Copy link
Contributor

@JinHai-CN JinHai-CN commented Feb 7, 2024

What problem does this PR solve?

Fix lots of compilation error.

Add corresponding issue link with summary if exists -->

Issue link:

What is changed and how it works?

Code changes

  • Has Code change

Check List

Tests

  • Unit test
  • Integration test

@JinHai-CN JinHai-CN added ci PR can be test wip work in progress labels Feb 7, 2024
Ognimalf and others added 28 commits March 26, 2024 17:47
### What problem does this PR solve?

Add param type check in python sdk when build `knn` query to avoid
thrift exception and failed to disconnect.

Issue link:None

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Python SDK impacted, Need to update PyPI
Refactor TableEntry::MemIndexInsert

Issue link:infiniflow#432

### Type of change

- [x] Refactoring
### What problem does this PR solve?

Add benchmark for python insertion.

Issue link:None

### Type of change

- [x] Test cases
### What problem does this PR solve?

Fix unit test corruption for index skiplist writer
Clean left codes in infiniflow#878

Issue link:infiniflow#813

### Type of change

- [x] Refactoring
infiniflow#883)

### What problem does this PR solve?

Add basic unit tests for PostingMerger
Fix the memory leak problem caused by ColumnIndexIterator not destroying
doc_list_slice_ without memory pool

Issue link:[infiniflow#813](infiniflow#813)

### Type of change

- [x] Bug Fix
- [x] Test cases

---------

Co-authored-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?

Heap buffer overflow happens during offline index building

Issue link:infiniflow#887

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
…w#879)

### What problem does this PR solve?

1. Rename the function name from Append to AppendData.
2. Add delete flag of base entry const attribute. This member is
supported to be immutable.

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?

1. Add: recycle log file after checkpoint. recycle wal log after both
checkpoint and recycle delta checkpoint after full checkpoint.
2. Refactor: parse process of log filename.
3. Fix: cleanup bug after replay. infiniflow#727.
 Cleanup may encounter not empty directory under such circumstance:
 3.1. Create table, add some block. Write delta catalog file.
 3.2. Drop table and write another delta catalog file.
3.3. Replay delta catalog files. "add some block" ops will be pruned in
replay stage. so not appears in catalog tree.
3.4. When cleanup, the table directory is not empty. because block
cannot be found in catalog.
 3.5. Solve: Cleanup will remove the whole directory.
6. Fix: force(manual) checkpoint will recycle log file now.
7. Fix: checkpoint should always write wal. (even if checkpoint file is
empty)
8. Add: unit test for recycle.

TODO:
record all checkpoint file path in the last checkpoint file.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
- [x] Test cases
### What problem does this PR solve?

Memory is not recycled during offline index building
Outlier token will be abandoned(length >= 1024) which will lead to
buffer overflow

Issue link:infiniflow#889
Issue link:infiniflow#890

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
)

### Type of change

- [x] Refactor

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?

HTTP API: SHOW SEGMENT

Issue link:infiniflow#779

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

Signed-off-by: morphes1995 <morphes1995@gmail.com>
### What problem does this PR solve?

Problem:
1. A `buffer_object` will appear multiple times in `gc_queue_` of
`BufferManager` , which is conflict with cleanup cause it will remove
`buffer_object` itself, thus `gc_queue_` will hold wild pointer.
2. Queue blocked: Currently, `wal_manager` will produce the checkpoint
task and send it to `background_processer` when the wal log size is too
big. This will cause the task queue in `background_processer` blocked.
Consider such a situation when the `background_processer` is handling a
checkpoint task, and the txn of checkpoint is waiting to commit. It need
`WalManager::Flush()` to commit txn, but in `Flush()` , a new checkpoint
task is submitting to `background_processer` but the queue is full. Thus
the queue blocked cause it need to commit the current txn to have free
slot but the txn need `Flush()` to handle itself which is waiting queue.
3. After cleaning up `table_entry` , the raw pointer saved in index file
worker will be wild pointers which will cause segment fault when
destructing itself.

Things have done:
1. Add a flag to ensure that each `buffer_object` appears only once in
the `gc_queue_` .
2. Add new status to express the status of ready for cleanup.
3. Add more unit test for new state machine, data insertion and so on.
4. Change the pointer to  `shared_ptr`  in index file worker.
5. Add a global variable in `WalManager` to judge whether a checkpoint
txn is being processed, ensuring that at most only one checkpoint
transaction is in progress.


Issue link:infiniflow#849

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
- [x] Test cases

---------

Co-authored-by: shenyushi <shenyushi99@qq.com>
### What problem does this PR solve?
* Fix the memory leak problem caused by column_index_iterator calling
Next() multiple times
* Fix the problem of incorrect doc_id when merging index
* Add basic unit tests for ColumnIndexMerger

Issue link:[infiniflow#813](infiniflow#813)

### Type of change
- [x] Bug Fix
- [x] Test cases
…#900)

### What problem does this PR solve?

Things have done:
1. Add fulltext index test for python sdk.
2. Modify `query_builder` in python sdk to support aggregate fields for
SELECT.
3. Add a new status for empty args in aggregate function to avoid
service crashes.

Issue link:infiniflow#682

### Type of change

- [X] Bug Fix (non-breaking change which fixes an issue)
- [x] Test cases
- [x] Python SDK impacted, Need to update PyPI
### What problem does this PR solve?

New PR use blocking queue in WAL manager to replace polling method to
improve TXN performance 100x

### Type of change

- [x] Refactoring
- [x] Performance Improvement

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?

Show specific block of segment

Issue link:infiniflow#779

### Type of change

- [x] New Feature (non-breaking change which adds functionality)

Signed-off-by: morphes1995 <morphes1995@gmail.com>
### What problem does this PR solve?

When delete data from table, the `row_count` is miscalculated, which
will cause some errors like execute `COUNT` query.

Issue link:infiniflow#731

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Test cases
…erger runs correctly (infiniflow#904)

### What problem does this PR solve?

* Add GenerateParagraphs to generate data for ColumnIndexMerger test
* Added test for ColumnIndexMerger under big data
* Fix the problem of merge failure when the data is large
* Fix file_reader bug
* Refactored ColumnIndexMergerTest
* Refactor PostingMerger::Merge

Issue link:infiniflow#813

### Type of change

- [x] Bug Fix
- [x] Refactoring
- [x] Test cases

---------

Co-authored-by: Zhichang Yu <yuzhichang@gmail.com>
### What problem does this PR solve?

Add block max info for fulltext index skiplist

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Refactoring
…infiniflow#902)

Record AddSegmentIndexStore and AddChunkIndexStore in table_entry.cpp
Close infiniflow#798

Issue link:infiniflow#798

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)

---------

Co-authored-by: salieri <lomlieri@gmail.com>
### What problem does this PR solve?

Currently, when storage init, segment data will be read to recreate the
mem index and a faked txn is created.

This faked txn will be commit several times, and will not begin at. This
TXN is also managed by txn_manager.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?

1. Enable stemming for standard analyzer
2. Enabls simdbitpacking for posting codec (only for uint32 compression
because it does not support uint16 right now)
3. Refactor the lifecycle of memory pool for MemoryIndexer, and disable
memory pool for posting of MemoryIndexer, which will cause corruption
under concurrent read during indexing when memory pool is destroyed

Issue link:infiniflow#868

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
- [x] Performance Improvement
- [x] Test cases
)

### What problem does this PR solve?

1. Checkpoint will cost too much time.
2. During the time, too many CatalogDeltaOp senting in txn commit bottom
will introduce traffic jam.
3. Checkpoint txn commit is also need to execute, and the thread of
commit bottom is occupied by sending CatalogDeltaOp txn.
 
Solution:
Change sending CatalogDeltaOp from TXN Commit bottom to after Txn Commit
bottom.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
OPTIMIZE merges all chunks of a SegmentIndexEntry into one.

Issue link:infiniflow#366

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?

The service will print many error log cause it try to delete non-empty
directory when cleanup, cause now we delay the cleanup of `buffer_obj`
into `Free` , but actually we can delete the data file of `buffer_obj`
immediately when cleanup is triggered.


Related PR link: infiniflow#861
 
Issue link:None

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Jin Hai <haijin.chn@gmail.com>
…#913)

### What problem does this PR solve?

unit test for catelog: replay_table_single_index_named_db

### Type of change

- [x] Test cases
### What problem does this PR solve?

Now `count(*)` will get the `row_count` stored in `table_entry` , which
is incorrect when facing multiple read-write requests.

Fixed by merge  `count(*)`  and  `count(col)` .

Issue link:infiniflow#731

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
### What problem does this PR solve?

Add a TableIndexReaderCache member to table_entry
Reuse previous IndexReader cache if possible

Issue link:infiniflow#641

### Type of change

- [x] Refactoring
- [x] Performance Improvement
JinHai-CN and others added 30 commits May 16, 2024 13:27
### What problem does this PR solve?

- Add error code and message, if loading tokenizer file error.
- Move raising exception out of construction.

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?

1. Refactor: implement compaction operator.
2. Refactor: use scheduler to execute compaction.
3. Refactor: fragment plan is a DAG now by connecting multiple plan
tree.
4. Fix: Add delete to compaction todelete list in `SegmentEntry::Commit`
instead of `SegmentEntry::DeleteData`, because it should be atomic with
`SegmentEntry::max_row_ts_` update.

Issue link:infiniflow#1182

### Type of change
- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Refactoring
- [x] Test cases
### What problem does this PR solve?
Should provide empty load meta info for MergeAggregate node

Issue link:infiniflow#1211

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
- [x] Test cases
Refactor benchmark script

### Type of change

- [x] Refactoring
Updated benchmark.md
- [x] Documentation Update
### What problem does this PR solve?

1. Compact a fill table cause bug.
2. parallel set compact state bug.

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
### What problem does this PR solve?

Japanese morphological analyzer is added

Issue link:infiniflow#1137

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
### What problem does this PR solve?

Add MatchTensorScan operator
Now only support exhaustive MaxSim top-n search

Change sql syntax:
SEARCH MATCH -> SEARCH MATCH TEXT    (fulltext topk search)
SEARCH KNN -> SEARCH MATCH VECTOR   (embedding knn search)

Add sql syntax:
SEARCH MATCH TENSOR (now only support exhaustive MaxSim topn search)

Issue link:infiniflow#1179

### Type of change

- [x] New Feature (non-breaking change which adds functionality)
- [x] Breaking Change (fix or feature that could cause existing
functionality not to work as expected)
- [x] Refactoring
- [x] Test cases
### What problem does this PR solve?

Refactor code: LOG_ERROR message before raise recoverable error

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?

Issue link:infiniflow#1137

### Type of change

- [x] Documentation Update
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?

_Briefly describe what this PR aims to solve. Include background context
that will help reviewers understand the purpose of the PR._

### Type of change

- [x] Refactoring

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
### What problem does this PR solve?

1. Remove some codes of benchmark directory
2. Change table restricts container from std::unordered_set to std::set.

### Type of change

- [x] Refactoring

---------

Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Signed-off-by: Jin Hai <haijin.chn@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci PR can be test wip work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet