feat(op): support varlen npu flash attention #209

SolenoidWGT · 2024-04-19T09:34:21Z

Motivation

支持torch_npu的 var flash attention.

Modification

移除了在pack数据下，给q和k进行padding和unpadding下的操作

BC-breaking (Optional)

Does the modification introduce changes that break the backward compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here and update the documentation.

Checklist

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues.
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
CLA has been signed and all committers have signed the CLA in this PR.

…feat/add_npu_fa

mwiacx · 2024-05-24T06:33:15Z

internlm/model/ops/attention.py

@@ -472,11 +604,13 @@ def _qkv_without_cu_seqlens(self, qkv, softmax_scale=None, causal=None, key_padd
            return _torch_fixedlen_qkvpacked_attn(qkv, self.dropout, softmax_scale, causal, key_padding_mask)

    @forward.register(conditions=(str(QKVPackType.KVPACKED), str(CuSeqlenType.WithOut)))
-    def _q_kv_without_cu_seqlens(self, q, kv, softmax_scale=None, causal=None, key_padding_mask=None):
+    def _q_kv_without_cu_seqlens(
+        self, q, kv, softmax_scale=None, causal=None, key_padding_mask=None, use_flash_attn=True


不用把use_flash_attn作为attention forward的参数，这意味着model需要感知底层是否为fa的算子，我们不希望为model的开发者引入复杂度，这个东西作为算子选择系统的自动逻辑或者配置的一部分就好。

mwiacx · 2024-05-24T06:36:13Z

internlm/core/parallel/comm/tensor.py

@@ -341,7 +341,7 @@ def output_hook(self, module: Embedding1D, args: Any, output: Tuple[Any]) -> Tup
        """
        _emb_dim = 2  # [bsz, seqlen, emb_dim]

-        return gather_forward_split_backward(output, self._parallel_mode, dim=_emb_dim)
+        return gather_forward_split_backward(output, self._parallel_mode, dim=_emb_dim), DUMMY_HANDLE_CONST


这里为啥要加DUMMY_HANDLE_CONST哇，他是作为register_forward_hook来用的，下同

mwiacx · 2024-05-24T06:51:24Z

internlm/model/ops/attention.py

@@ -78,18 +77,15 @@ def _nyi_attn(func_name, *args, **kwargs):  # pylint: disable=W0613


 def _flash_float32_compatibility_wrapper(input_idxs: Tuple, flash_func: Callable, *args, **kwargs):
-    if gpc.config.model.dtype is torch.float32:


这里为啥要去掉if

mwiacx · 2024-05-24T06:54:09Z

internlm/model/ops/attention.py



-def _npu_varlen_qkvpacked_attn(
-    qkv: torch.Tensor, cu_seqlens, max_seqlen, dropout_p, softmax_scale=None, causal=False  # pylint: disable=W0613
+def __npu_varlen_qkvsplited_attn(


__npu -> _npu 一个下划线

feat(op): support varlen npu flash attention

14aa004

mm-assistant bot assigned ZwwWayne Apr 19, 2024

add unit test for var len npu fa

14f0106

SolenoidWGT force-pushed the feat/add_npu_fa branch from a257888 to 14f0106 Compare April 24, 2024 14:05

SolenoidWGT requested a review from gaoyang07 April 25, 2024 02:46

gaoyang07 requested review from sallyjunjun and removed request for gaoyang07 April 25, 2024 03:31

gaoyang07 assigned SolenoidWGT and unassigned ZwwWayne Apr 25, 2024

gaoyang07 added the enhancement New feature or request label Apr 25, 2024

SolenoidWGT added 3 commits April 25, 2024 10:01

fix dipu

d5cfd4d

fix

a642399

fix qkv index in AscendFlashAttention

dbc6869

SolenoidWGT force-pushed the feat/add_npu_fa branch from 82758c5 to dbc6869 Compare April 28, 2024 07:54

SolenoidWGT added 7 commits May 21, 2024 15:31

Merge branch 'develop' of https://github.com/InternLM/InternEvo into …

eb2d7ba

…feat/add_npu_fa

reformat

fe02b92

test npu fa

9492212

fix pp ci

d1fc682

fix ci

95bf6d3

fix ci

23a0596

fix tensor return

d7b1689

mwiacx reviewed May 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(op): support varlen npu flash attention #209

feat(op): support varlen npu flash attention #209

SolenoidWGT commented Apr 19, 2024 •

edited

mwiacx May 24, 2024

mwiacx May 24, 2024

mwiacx May 24, 2024

mwiacx May 24, 2024

		@@ -78,18 +77,15 @@ def _nyi_attn(func_name, args, *kwargs): # pylint: disable=W0613


		def _flash_float32_compatibility_wrapper(input_idxs: Tuple, flash_func: Callable, args, *kwargs):
		if gpc.config.model.dtype is torch.float32:

feat(op): support varlen npu flash attention #209

Are you sure you want to change the base?

feat(op): support varlen npu flash attention #209

Conversation

SolenoidWGT commented Apr 19, 2024 • edited

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

mwiacx May 24, 2024

Choose a reason for hiding this comment

mwiacx May 24, 2024

Choose a reason for hiding this comment

mwiacx May 24, 2024

Choose a reason for hiding this comment

mwiacx May 24, 2024

Choose a reason for hiding this comment

SolenoidWGT commented Apr 19, 2024 •

edited