向量存储建议 #38

thomas-yanxin · 2023-04-26T04:15:22Z

No description provided.

sujianwei1 · 2023-04-26T04:24:17Z

如何运行Qdrant有各种模式，根据所选择的模式，会有一些细微的差别。选项包括：
本地模式，不需要服务器
本地服务器部署
云部署

sujianwei1 · 2023-04-26T04:27:27Z

本地模式，不使用Qdrant服务器，也可以将向量存储在磁盘上，这样它们就可以在两次运行之间保持不变。
from langchain.vectorstores import Qdrant qdrant = Qdrant.from_documents( docs, embeddings, path="/tmp/local_qdrant", collection_name="my_documents", )

sujianwei1 · 2023-04-26T04:37:13Z

应该可以，如果您想重用现有的集合，您总是可以自己创建一个Qdrant实例，并将连接详细信息传递给Qdrant Client实例。
`import qdrant_client

client = qdrant_client.QdrantClient(
path="/tmp/local_qdrant", prefer_grpc=True
)
qdrant = Qdrant(
client=client, collection_name="my_documents",
embedding_function=embeddings.embed_query
)`

sujianwei1 · 2023-04-26T04:39:04Z

这个就有点像启动的时候，加载下历史存储数据，从而保证一直不丢失

sujianwei1 · 2023-04-26T04:43:22Z

检索
query = "What did the president say about Ketanji Brown Jackson" found_docs = qdrant.similarity_search(query)

sujianwei1 · 2023-04-26T04:51:26Z

批量加载文档可以看看这个函数
from langchain.document_loaders import DirectoryLoader loader = DirectoryLoader(solidity_root, glob = "**/*.txt") docs = loader.load()
分词
split_docs = text_splitter.split_documents(docs)
然后embeddings存到向量数据库
vectorstore = vectorstore.from_documents(split_docs, embeddings, persist_directory=persist_directory)

HkkSimple · 2023-04-26T05:17:11Z

如果需要对存量的大规模文档进行vector存储的话，可能使用基于磁盘（disk-based）的数据库进行缓存可能是更好的选择。
我看GPTCache是基于此概念搭建的，而且也是面向LLM专门搭建的，功能性上可能是开箱即用的。(https://github.com/zilliztech/GPTCache)

online2311 · 2023-04-26T05:35:09Z

Milvus Litehttps://github.com/milvus-io/milvus-lite ,完全兼容Milvus ，
可以嵌入到 Python 应用程序。pip install milvus https://pypi.org/project/milvus/
方便未来生成环境使用Milvus，可盐可甜。

benli2023 · 2023-04-29T05:44:24Z

我的代码这样，帮看看有没有问题，获取不了中文的相似的文本

def qdrant(docs_path):
texts = []
for doc in tqdm(os.listdir(docs_path)):
if doc.endswith('.txt'):
with open(f'{docs_path}/{doc}','r',encoding='utf-8') as f:
doc_data = f.read()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(doc_data)
Qdrant.from_texts(texts, embeddings,
metadatas=[{"source": f"{i}-doc"} for i in range(len(texts))],
host="localhost",
prefer_grpc=False,
collection_name="Finance"

from qdrant_client import QdrantClient

client = QdrantClient(host="localhost",port=6333)
embeddings = HuggingFaceEmbeddings(model_name='/home/ubuntu/models/GanymedeNil_text2vec-large-chinese')

qdrant=Qdrant(client,'Finance',embeddings.embed_query)

documents=qdrant.similarity_search("test",4)

for doc in documents:
print(doc.page_content)

)

thomas-yanxin · 2023-04-29T12:15:06Z

目前尝试使用Qdrant，后续将做更细致的调研。

参考资料：

向量数据库大PK｜来自百万级数据的基准测试

zhugexinxin · 2023-06-25T04:13:16Z

是否可以增量更新collections的api

thomas-yanxin mentioned this issue Apr 26, 2023

是不是可以考虑加入向量数据库qdrant实现批量加载文档，达到本地知识库的效果 #13

Closed

thomas-yanxin added the enhancement New feature or request label Apr 26, 2023

thomas-yanxin pinned this issue Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

向量存储建议 #38

向量存储建议 #38

thomas-yanxin commented Apr 26, 2023

sujianwei1 commented Apr 26, 2023

sujianwei1 commented Apr 26, 2023

sujianwei1 commented Apr 26, 2023

sujianwei1 commented Apr 26, 2023

sujianwei1 commented Apr 26, 2023

sujianwei1 commented Apr 26, 2023

HkkSimple commented Apr 26, 2023

online2311 commented Apr 26, 2023

benli2023 commented Apr 29, 2023

thomas-yanxin commented Apr 29, 2023

zhugexinxin commented Jun 25, 2023

向量存储建议 #38

向量存储建议 #38

Comments

thomas-yanxin commented Apr 26, 2023

sujianwei1 commented Apr 26, 2023

sujianwei1 commented Apr 26, 2023

sujianwei1 commented Apr 26, 2023

sujianwei1 commented Apr 26, 2023

sujianwei1 commented Apr 26, 2023

sujianwei1 commented Apr 26, 2023

HkkSimple commented Apr 26, 2023

online2311 commented Apr 26, 2023

benli2023 commented Apr 29, 2023

thomas-yanxin commented Apr 29, 2023

zhugexinxin commented Jun 25, 2023