Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vocab Error when running 'Interpreting Classical Text Classification models' Notebook #176

Open
Chris-hughes10 opened this issue Jul 2, 2021 · 3 comments

Comments

@Chris-hughes10
Copy link

Using a clean Python 3.7 environment on Ubuntu, and installing interpret-text using pip, I am hitting an error when I try to walk through the 'Interpreting Classical Text Classification models' notebook; I have made no changes to the code.

When attempting to fit the model, on the line:

classifier, best_params = explainer.fit(X_train, y_train)

I get the following error:

/anaconda/envs/interpret/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:489: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
  warnings.warn("The parameter 'token_pattern' will not be used"
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-47f4fc43855d> in <module>
----> 1 classifier, best_params = explainer.fit(X_train, y_train)

/anaconda/envs/interpret/lib/python3.7/site-packages/interpret_text/experimental/classical.py in fit(self, X_str, y_train)
     92         :rtype: list
     93         """
---> 94         X_train = self._encode(X_str)
     95         if self.is_trained is False:
     96             if self.model is None:

/anaconda/envs/interpret/lib/python3.7/site-packages/interpret_text/experimental/classical.py in _encode(self, X_str)
     61         :rtype: array_like (ndarray, pandas dataframe). Same rows as X_str
     62         """
---> 63         X_vec, _ = self.preprocessor.encode_features(X_str)
     64         return X_vec
     65 

/anaconda/envs/interpret/lib/python3.7/site-packages/interpret_text/experimental/common/utils_classical.py in encode_features(self, X_str, needs_fit, keep_ids)
    129         # needs_fit will be set to true if encoder is not already trained
    130         if needs_fit is True:
--> 131             self.vectorizer.fit(X_str)
    132         if isinstance(X_str, str):
    133             X_str = [X_str]

/anaconda/envs/interpret/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in fit(self, raw_documents, y)
   1167         """
   1168         self._warn_for_unused_params()
-> 1169         self.fit_transform(raw_documents)
   1170         return self
   1171 

/anaconda/envs/interpret/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
   1201 
   1202         vocabulary, X = self._count_vocab(raw_documents,
-> 1203                                           self.fixed_vocabulary_)
   1204 
   1205         if self.binary:

/anaconda/envs/interpret/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
   1131             vocabulary = dict(vocabulary)
   1132             if not vocabulary:
-> 1133                 raise ValueError("empty vocabulary; perhaps the documents only"
   1134                                  " contain stop words")
   1135 

ValueError: empty vocabulary; perhaps the documents only contain stop words

Am I missing something obvious here?

@RitaDS
Copy link

RitaDS commented Oct 1, 2021

Hi @Chris-hughes10 I was having the same problem and it was a problem related with the env. What libraries and versions are you using?

@imatiach-msft
Copy link
Collaborator

I had a similar issue, using an older version of spacy (2.3.7) package on pypi fixed it, looks like the tokenizer code needs to be updated to latest spacy

@imatiach-msft
Copy link
Collaborator

see related issue:
#182

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants