Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion about the issue of disk IO botteleneck 【存储IO瓶颈问题的交流】 #341

Closed
PhoebusSi opened this issue Apr 28, 2024 · 2 comments
Labels

Comments

@PhoebusSi
Copy link

Hi!
I am Qingyi Si, an AI researcher from Data Storage Research Department of Huawei. Recently, we have been collecting the storage requirements in the LLM era.
We’ve been following the Open-sora project you‘re leading. In the report_02.md updated on the 25th, it is mentioned that you encountered “a disk I/O bottleneck for training and data processing at the same time”. That excites us, and we'd like to figure out exactly where and how this bottleneck is happening, and maybe we can help solve it in the short future. I wonder if it's convenient for you to have a further communication with us?
We have three questions on this issue:
1. Which data processing stage encountered such IO bottleneck or had high IO requirement?
2. What is the mode of ‘training and data processing at the same time’ mentioned in the report? Is the data processing pipeline and Sora training running in the same computing cluster?
3. Is the disk I/O bottleneck caused by insufficient CPU cores, excessive disk I/O requests, slow communication or other reasons? What are the specific phenomena?

I look forward to your reply.
All the best!

Qingyi Si
WeChat/Phone: 13161685288

您好!
我是来自华为数据存储研究部的AI研究员佀庆一。最近我们团队一直在收集大模型时代的存储需求。
我们一直在关注贵团队主导开发的Open-sora项目。在25日更新的report_02.md中有提到 “a disk I/O bottleneck for training and data processing at the same time”。这让我们兴奋不已,我们想弄清楚这个瓶颈是在哪里发生以及如何发生的,也许我们可以在不久的将来尝试解决它。
不知您是否方便与我们进一步沟通? 关于这个issue,我们有三个问题想了解:
1.哪个数据处理环节遇到了这样的IO瓶颈,或者说哪个数据处理的步骤对IO要求比较高?
2.report_2中提到的“训练和数据处理同时进行”是什么模式?是指数据处理的pipeline和Sora训练是运行在同一个计算集群中吗?
3.磁盘IO瓶颈是CPU核不足,磁盘IO请求过多,数据通信慢还是其他原因?是否可以提供些具体的现象?

期待您的回复!
祝好!

佀庆一
微信/手机:13161685288

Copy link

github-actions bot commented May 6, 2024

This issue is stale because it has been open for 7 days with no activity.

@github-actions github-actions bot added the stale label May 6, 2024
@zhengzangw
Copy link
Collaborator

We have replied to you by email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants