-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT _weighted_percentile
supports np.nan values
#29034
base: main
Are you sure you want to change the base?
MNT _weighted_percentile
supports np.nan values
#29034
Conversation
_weighted_percentile
supports np.nan values_weighted_percentile
supports np.nan values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some early feedback.
sklearn/utils/stats.py
Outdated
@@ -41,6 +41,8 @@ def _weighted_percentile(array, sample_weight, percentile=50): | |||
sorted_weights = np.take_along_axis(sample_weight, sorted_idx, axis=0) | |||
|
|||
# Find index of median prediction for each sample | |||
# nan_mask = np.isnan(array) #??? | |||
# weight_cdf[nan_mask] = 0 # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to zero the weights of nan values in the tiled version of sample_weight
, before computing the cumsum so that missing entries in a given column do not contribute to the empirical CDF estimated for that column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that when the weights are already passed as a 2d array, we should be careful not to mutate the user provided input and instead make a copy of the weights before setting some of the entries to zero (I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this will eliminate the issue that varying weights for values that are nans could influence _weighted_percentile
to deliver a different result.
Note that when the weights are already passed as a 2d array, we should be careful not to mutate the user provided input and instead make a copy of the weights before setting some of the entries to zero (I think).
I decided to set the weights from nan-values to 0 on sorted_weights
not on sample_weight
) and I believe that I don't need to copy thus, correct?
sklearn/utils/tests/test_stats.py
Outdated
array[np.random.rand(*array.shape) < 0.8] = np.nan | ||
weights = np.random.randint(1, 6, size=(100, 3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the rand
and randint
method of the rng
instance instead of the methods from np.random
that rely on a shared global RandomState
instance.
This is necessary to isolate the behavior of this test from any side effect on the shared global numpy RandomState
instance. Otherwise it makes it really hard to debug randomly failing tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh thank you for that explanation. It makes total sense now.
sklearn/utils/tests/test_stats.py
Outdated
|
||
|
||
def test_weighted_percentile_nan(): | ||
"""Test that np.nan is not returned as a value.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to test something stronger: we need to check that calling _weighted_percentile
on an array with nan values returns the same results as calling _weighted_percentile
on a filtered version of the data, where all missing values and the corresponding weights have been removed.
This is easy to test for the following cases:
- 1d data and 1d weights;
- 2d data and 2d weights;
but it's a bit more challenging to implement and test for the:
- 2d data with shared 1d weights.
sklearn/utils/stats.py
Outdated
"One feature contains too many NaN values. This error should " | ||
"actually either raise within the API, or there needs to be some " | ||
"validation here before to make sure it cannot happen." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to return np.nan
valued percentile (maybe with a warning?) if some array is all-nan valued.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree and have removed this from here (and also repaired the buggy loop).
But some further thoughts, that I am not sure how important they are to be taken:
The reason I had put it here was that I had tested this on the branch from the nans in SplineTransformer PR where we add test_spline_transformer_handles_all_nans
(quite a special case anyway) and this made it raise in a not very informative way.
I think the error needs to be handled somehow, but within SplineTransformer
, so that _weighted_percentile
can stay flexible in case it's used differently in another context at one point.
For the other cases, we could raise a warning similar to like its done in numpy:
/home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/numpy/lib/_nanfunctions_impl.py:1623: RuntimeWarning: All-NaN slice encountered return fnb._ureduce(a, ...
But should we add a warning within a private method? I'm a bit hesitant, as this could be confusing for the users who might not be aware about _weighted_percentile
being internally used. Maybe this warning should instead be raised withing the realm of whatever API method is using it if need be. Candidates might be AbsoluteError.fit_intercept_only()
, PinballLoss.fit_intercept_only()
, HuberLoss.fit_intercept_only()
, KBinsDiscretizer.fit()
and maybe I should add similar all-nan-column tests and see if they raise or if we want to add a warning. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have looked at this again and went through all the places where _weighted_percentile
gets used.
The losses I mentioned above (only internal use) as well as set_huber_delta()
and median_absolute_error()
find a percentile from within y_true
or y_true-some_other_1darray
, where I think it should be pretty obvious for the user if all values were nan. This is why I think it's okay without a warning.
KBinsDiscretizer
and SplineTransformer
use it on the whole X
within their fit methods and I think we could add a warning there for all-nan-columns if this kind of validation is not already done somewhere else.
sklearn/utils/stats.py
Outdated
while bool(np.isnan(percentile).any()) and ( | ||
percentile_idx[np.isnan(percentile)].all() > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should bool(np.isnan(percentile).any())
be an if-condition in the inside of the loop for better readability? It actually doesn't construct the looping condition.
Edit: I accidentally published that review comment too early. but to answer my own question: Yes, I would think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the percentile_idx[np.isnan(percentile)].all() > 0
condition. percentile_idx[np.isnan(percentile)]
is an array of (non-binary) integers. So calling .all()
on it should be equivalent to 0 not in percentile_idx[np.isnan(percentile)]
but then I do not see the need to compare that boolean to 0 with > 0
.
Maybe you meant
(percentile_idx[np.isnan(percentile)] > 0).all()
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, that's been odd code, sorry for this.
What I meant in words is "while we're still above the lowest possible index for all the indices we're about to manipulate, enter while loop...", which would be (percentile_idx[np.isnan(percentile)] > 0).all()
.
I will change that and modify the comment.
The idea of excluding bool(np.isnan(percentile).any()
from the loop condition, does not work (entering infinite loop when no nans present).
Edit: Actually, I got aware it should be (percentile_idx[np.isnan(percentile)] > 0).any()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am addressing your comments in my latest push, @ogrisel, thank you.
That test now fails because it is an actual test. Through it I got aware that we also need to mask out the nans for calculating the percentile_idx
. Simply ignoring nans within the indexing tasks is not enough. I am still trying to figure out how to do that using masked arrays, pushing the current state of my attempts.
Edit: I now found out what I did wrong before, I had created the nan mask on the input array, and but it should have been the sorted one. With the corrected nan mask we can indeed calculate percentile_idx and using a masked array is not necessary anymore, it seems. I fixed the errors below and all the tests pass.
sklearn/utils/tests/test_stats.py
Outdated
array[np.random.rand(*array.shape) < 0.8] = np.nan | ||
weights = np.random.randint(1, 6, size=(100, 3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh thank you for that explanation. It makes total sense now.
sklearn/utils/stats.py
Outdated
while bool(np.isnan(percentile).any()) and ( | ||
percentile_idx[np.isnan(percentile)].all() > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, that's been odd code, sorry for this.
What I meant in words is "while we're still above the lowest possible index for all the indices we're about to manipulate, enter while loop...", which would be (percentile_idx[np.isnan(percentile)] > 0).all()
.
I will change that and modify the comment.
The idea of excluding bool(np.isnan(percentile).any()
from the loop condition, does not work (entering infinite loop when no nans present).
Edit: Actually, I got aware it should be (percentile_idx[np.isnan(percentile)] > 0).any()
.
What does this implement/fix? Explain your changes.
This PR adds support for
nan
input into_weighted_percentile
, which is used withinSplineTransformer
, for instance.Any other comments?
This is for @ogrisel since we had talked about that:
This does not deliver equivalent results as
np.nanpercentile
in the current dev-version, because they were different before:_weighted_percentile
is finding the next lower percentile, whilenp.nanpercentile
seems to find their percentiles differently (closest?). The more variance in thesample_weights
, the more important the effect of this gets.Here an example without any nans and without weights:
output:
While setting
percentile = 67
leads to the same results.Edit: What I wrote is actually not true. The differing output was due to not putting sample weights of nan values to 0 before. Now that is done, we have the same results it seems.