MNT `_weighted_percentile` supports np.nan values #29034

StefanieSenger · 2024-05-16T23:00:44Z

What does this implement/fix? Explain your changes.

This PR adds support for nan input into _weighted_percentile, which is used within SplineTransformer, for instance.

Any other comments?

This is for @ogrisel since we had talked about that:

This does not deliver equivalent results as np.nanpercentile in the current dev-version, because they were different before: _weighted_percentile is finding the next lower percentile, while np.nanpercentile seems to find their percentiles differently (closest?). The more variance in the sample_weights, the more important the effect of this gets.

Here an example without any nans and without weights:

import numpy as np
from sklearn.utils.stats import _weighted_percentile

percentile = 66
arr2D = np.array([[3,30],[1,20],[3,10]])
weights2D = np.array([[1,1],[1,1],[1,1]])
print(_weighted_percentile(arr2D, weights2D, percentile))
print(np.nanpercentile(arr2D, percentile, weights=weights2D, axis=0, method="inverted_cdf"))

output:

[3 20]
[3 30]

While setting percentile = 67 leads to the same results.

Edit: What I wrote is actually not true. The differing output was due to not putting sample weights of nan values to 0 before. Now that is done, we have the same results it seems.

github-actions · 2024-05-16T23:01:59Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 63dc932. Link to the linter CI: here}

ogrisel

Some early feedback.

ogrisel · 2024-05-21T15:50:11Z

sklearn/utils/stats.py

@@ -41,6 +41,8 @@ def _weighted_percentile(array, sample_weight, percentile=50):
    sorted_weights = np.take_along_axis(sample_weight, sorted_idx, axis=0)

    # Find index of median prediction for each sample
+    # nan_mask = np.isnan(array) #???
+    # weight_cdf[nan_mask] = 0 #


I think we need to zero the weights of nan values in the tiled version of sample_weight, before computing the cumsum so that missing entries in a given column do not contribute to the empirical CDF estimated for that column.

Note that when the weights are already passed as a 2d array, we should be careful not to mutate the user provided input and instead make a copy of the weights before setting some of the entries to zero (I think).

Yes, this will eliminate the issue that varying weights for values that are nans could influence _weighted_percentile to deliver a different result.

Note that when the weights are already passed as a 2d array, we should be careful not to mutate the user provided input and instead make a copy of the weights before setting some of the entries to zero (I think).

I decided to set the weights from nan-values to 0 on sorted_weights not on sample_weight) and I believe that I don't need to copy thus, correct?

ogrisel · 2024-05-21T15:53:21Z

sklearn/utils/tests/test_stats.py

+    array[np.random.rand(*array.shape) < 0.8] = np.nan
+    weights = np.random.randint(1, 6, size=(100, 3))


Use the rand and randint method of the rng instance instead of the methods from np.random that rely on a shared global RandomState instance.

This is necessary to isolate the behavior of this test from any side effect on the shared global numpy RandomState instance. Otherwise it makes it really hard to debug randomly failing tests.

Oh thank you for that explanation. It makes total sense now.

ogrisel · 2024-05-21T15:58:01Z

sklearn/utils/tests/test_stats.py

+
+
+def test_weighted_percentile_nan():
+    """Test that np.nan is not returned as a value."""


We need to test something stronger: we need to check that calling _weighted_percentile on an array with nan values returns the same results as calling _weighted_percentile on a filtered version of the data, where all missing values and the corresponding weights have been removed.

This is easy to test for the following cases:

1d data and 1d weights;

2d data and 2d weights;

but it's a bit more challenging to implement and test for the:

2d data with shared 1d weights.

ogrisel · 2024-05-21T16:01:59Z

sklearn/utils/stats.py

+                "One feature contains too many NaN values. This error should "
+                "actually either raise within the API, or there needs to be some "
+                "validation here before to make sure it cannot happen."
+            )


I think it's fine to return np.nan valued percentile (maybe with a warning?) if some array is all-nan valued.

I agree and have removed this from here (and also repaired the buggy loop).

But some further thoughts, that I am not sure how important they are to be taken:

The reason I had put it here was that I had tested this on the branch from the nans in SplineTransformer PR where we add test_spline_transformer_handles_all_nans (quite a special case anyway) and this made it raise in a not very informative way.

I think the error needs to be handled somehow, but within SplineTransformer, so that _weighted_percentile can stay flexible in case it's used differently in another context at one point.

For the other cases, we could raise a warning similar to like its done in numpy:

/home/stefanie/.pyenv/versions/3.12.2/envs/scikit-learn_dev/lib/python3.12/site-packages/numpy/lib/_nanfunctions_impl.py:1623: RuntimeWarning: All-NaN slice encountered return fnb._ureduce(a, ...

But should we add a warning within a private method? I'm a bit hesitant, as this could be confusing for the users who might not be aware about _weighted_percentile being internally used. Maybe this warning should instead be raised withing the realm of whatever API method is using it if need be. Candidates might be AbsoluteError.fit_intercept_only(), PinballLoss.fit_intercept_only(), HuberLoss.fit_intercept_only(), KBinsDiscretizer.fit() and maybe I should add similar all-nan-column tests and see if they raise or if we want to add a warning. Does that make sense?

I have looked at this again and went through all the places where _weighted_percentile gets used.

The losses I mentioned above (only internal use) as well as set_huber_delta() and median_absolute_error() find a percentile from within y_true or y_true-some_other_1darray , where I think it should be pretty obvious for the user if all values were nan. This is why I think it's okay without a warning.

KBinsDiscretizer and SplineTransformer use it on the whole X within their fit methods and I think we could add a warning there for all-nan-columns if this kind of validation is not already done somewhere else.

StefanieSenger · 2024-05-24T11:52:14Z

sklearn/utils/stats.py

+    while bool(np.isnan(percentile).any()) and (
+        percentile_idx[np.isnan(percentile)].all() > 0


Should bool(np.isnan(percentile).any()) be an if-condition in the inside of the loop for better readability? It actually doesn't construct the looping condition.

Edit: I accidentally published that review comment too early. but to answer my own question: Yes, I would think so.

I don't understand the percentile_idx[np.isnan(percentile)].all() > 0 condition. percentile_idx[np.isnan(percentile)] is an array of (non-binary) integers. So calling .all() on it should be equivalent to 0 not in percentile_idx[np.isnan(percentile)] but then I do not see the need to compare that boolean to 0 with > 0.

Maybe you meant

(percentile_idx[np.isnan(percentile)] > 0).all()

instead?

Oh yes, that's been odd code, sorry for this.

What I meant in words is "while we're still above the lowest possible index for all the indices we're about to manipulate, enter while loop...", which would be (percentile_idx[np.isnan(percentile)] > 0).all().

I will change that and modify the comment.

The idea of excluding bool(np.isnan(percentile).any() from the loop condition, does not work (entering infinite loop when no nans present).

Edit: Actually, I got aware it should be (percentile_idx[np.isnan(percentile)] > 0).any().

…asked arrays

StefanieSenger

I am addressing your comments in my latest push, @ogrisel, thank you.

That test now fails because it is an actual test. Through it I got aware that we also need to mask out the nans for calculating the percentile_idx. Simply ignoring nans within the indexing tasks is not enough. I am still trying to figure out how to do that using masked arrays, pushing the current state of my attempts.

Edit: I now found out what I did wrong before, I had created the nan mask on the input array, and but it should have been the sorted one. With the corrected nan mask we can indeed calculate percentile_idx and using a masked array is not necessary anymore, it seems. I fixed the errors below and all the tests pass.

StefanieSenger · 2024-05-24T12:53:34Z

sklearn/utils/tests/test_stats.py

+    array[np.random.rand(*array.shape) < 0.8] = np.nan
+    weights = np.random.randint(1, 6, size=(100, 3))


Oh thank you for that explanation. It makes total sense now.

StefanieSenger · 2024-05-27T10:49:36Z

sklearn/utils/stats.py

+    while bool(np.isnan(percentile).any()) and (
+        percentile_idx[np.isnan(percentile)].all() > 0


Oh yes, that's been odd code, sorry for this.

What I meant in words is "while we're still above the lowest possible index for all the indices we're about to manipulate, enter while loop...", which would be (percentile_idx[np.isnan(percentile)] > 0).all().

I will change that and modify the comment.

The idea of excluding bool(np.isnan(percentile).any() from the loop condition, does not work (entering infinite loop when no nans present).

Edit: Actually, I got aware it should be (percentile_idx[np.isnan(percentile)] > 0).any().

nan support for _weighted_percentile

9b99ece

github-actions bot added the module:utils label May 16, 2024

StefanieSenger added 2 commits May 17, 2024 10:01

add condition to handle all nan features

37638e3

add test

6310d87

StefanieSenger changed the title ~~ENH _weighted_percentile supports np.nan values~~ MNT _weighted_percentile supports np.nan values May 17, 2024

ogrisel reviewed May 21, 2024

View reviewed changes

StefanieSenger added 2 commits May 24, 2024 13:21

return nan percentiles if thats the only choice, not raise error

13c3114

proper stop condition for while loop

ed8d6ca

StefanieSenger commented May 24, 2024

View reviewed changes

StefanieSenger added 2 commits May 24, 2024 14:36

set weights for nans to 0

cb05c73

improve test, fix while loop condition, unfruitful experiments with m…

a864f5d

…asked arrays

StefanieSenger commented May 28, 2024

View reviewed changes

StefanieSenger and others added 4 commits May 28, 2024 14:52

Merge branch 'main' into _weighted_percentile

1e5bcc3

three different fixes

c156819

add test for 1D sample_weight

07ab1b6

add test for redirecting nan percentiles

63dc932

StefanieSenger marked this pull request as ready for review May 29, 2024 13:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MNT `_weighted_percentile` supports np.nan values #29034

MNT `_weighted_percentile` supports np.nan values #29034

StefanieSenger commented May 16, 2024 •

edited

github-actions bot commented May 16, 2024 •

edited

ogrisel left a comment •

edited

ogrisel May 21, 2024

ogrisel May 21, 2024

StefanieSenger May 24, 2024

ogrisel May 21, 2024

StefanieSenger May 24, 2024

ogrisel May 21, 2024

ogrisel May 21, 2024

StefanieSenger May 24, 2024 •

edited

StefanieSenger May 29, 2024

StefanieSenger May 24, 2024 •

edited

ogrisel May 24, 2024

StefanieSenger May 27, 2024 •

edited

StefanieSenger left a comment •

edited

StefanieSenger May 24, 2024

StefanieSenger May 27, 2024 •

edited

		array[np.random.rand(*array.shape) < 0.8] = np.nan
		weights = np.random.randint(1, 6, size=(100, 3))



		def test_weighted_percentile_nan():
		"""Test that np.nan is not returned as a value."""

		while bool(np.isnan(percentile).any()) and (
		percentile_idx[np.isnan(percentile)].all() > 0

MNT _weighted_percentile supports np.nan values #29034

Are you sure you want to change the base?

MNT _weighted_percentile supports np.nan values #29034

Conversation

StefanieSenger commented May 16, 2024 • edited

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented May 16, 2024 • edited

✔️ Linting Passed

ogrisel left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanieSenger May 24, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanieSenger May 24, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanieSenger May 27, 2024 • edited

Choose a reason for hiding this comment

StefanieSenger left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanieSenger May 27, 2024 • edited

Choose a reason for hiding this comment

MNT `_weighted_percentile` supports np.nan values #29034

MNT `_weighted_percentile` supports np.nan values #29034

StefanieSenger commented May 16, 2024 •

edited

github-actions bot commented May 16, 2024 •

edited

ogrisel left a comment •

edited

StefanieSenger May 24, 2024 •

edited

StefanieSenger May 24, 2024 •

edited

StefanieSenger May 27, 2024 •

edited

StefanieSenger left a comment •

edited

StefanieSenger May 27, 2024 •

edited