Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing samples containing profanity #51

Open
vmurahari3 opened this issue Jun 29, 2019 · 1 comment
Open

Removing samples containing profanity #51

vmurahari3 opened this issue Jun 29, 2019 · 1 comment

Comments

@vmurahari3
Copy link

Do you think it makes sense to remove samples containing profanity?

@matthen
Copy link
Contributor

matthen commented Jul 1, 2019

In general there is a lot of questionable language in the reddit dataset, as it is totally unfiltered and we are including all subreddits including 'nsfw' ones. It is still natural language, and a potentially useful learning signal, though of course we need to be careful how the resulting model is used.

We could perhaps add flags to the pipeline for filtering based on the nsfw label etc. These would be off by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants