Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamic embedding may lead to segmentation faults #2006

Open
sherlockkenan opened this issue May 15, 2024 · 1 comment
Open

dynamic embedding may lead to segmentation faults #2006

sherlockkenan opened this issue May 15, 2024 · 1 comment

Comments

@sherlockkenan
Copy link

sherlockkenan commented May 15, 2024

Hi, We've recently been trying to use the dynamic embedding feature in torchrec contrib, but we have encountered a few challenges. The process may result in a segmentation fault.

the code like,as mentioned in readme, using a tde.wrap

item_emb_config = torchrec.EmbeddingConfig(
        name="item",
        embedding_dim=args.hidden_units,
        num_embeddings=embedding_num + 1,
        feature_names=["seq", "pos", "neg"])
  dataloader , model= tde.wrap( "redis://127.0.0.1:6379/?num_threads=20", dataloader, model, 
       {"_model.item_emb": [item_emb_config]}, num_prefetch=0)

  dataloader = iter(dataloader)

we find two problems that may cause this

1) fetch_notifications_ is not thread-safe.
2)Tensor data might be recycled during pushing operations.

For the first issue, the fetch_notifications_ variable , defined herehttps://github.com/pytorch/torchrec/blob/main/contrib/dynamic_embedding/src/tde/ps.h#L108 , is accessed in both the pull and sync fetch functions. As these functions are called in different threads, there might be a thread safety issue.

For the second issue, the tensor (https://github.com/pytorch/torchrec/blob/main/contrib/dynamic_embedding/src/tde/ps.cpp#L122) generated by concat in the push function is a temporary variable. The subsequent call to io.push is asynchronous and does not copy the data (
https://github.com/pytorch/torchrec/blob/main/contrib/dynamic_embedding/src/tde/ps.cpp#L139) . Therefore, the data might have been recycled by the time it is executed asynchronously, leading to access to invalid data.

@sherlockkenan
Copy link
Author

sherlockkenan commented May 15, 2024

submit a pull requests that may solve above problems
#1947

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant