Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about an error in the multi-task regression #199

Open
yxnyu opened this issue Dec 25, 2023 · 10 comments
Open

about an error in the multi-task regression #199

yxnyu opened this issue Dec 25, 2023 · 10 comments

Comments

@yxnyu
Copy link

yxnyu commented Dec 25, 2023

Hi,

Thanks for your wondering work. I am wondering if I could ask some question about the error in the multi-task regression in unimol.

My label of X,Y is like:

SMILES,TARGET_0,TARGET_1,TARGET_2,TARGET_3,TARGET_4,TARGET_5,TARGET_6,TARGET_7,TARGET_8,TARGET_9,TARGET_10,TARGET_11,TARGET_12,TARGET_13,TARGET_14,TARGET_15,TARGET_16,TARGET_17,TARGET_18,TARGET_19,TARGET_20,TARGET_21,TARGET_22,TARGET_23,TARGET_24,TARGET_25,TARGET_26,TARGET_27,TARGET_28,TARGET_29,TARGET_30,TARGET_31,TARGET_32,TARGET_33,TARGET_34,TARGET_35,TARGET_36,TARGET_37,TARGET_38,TARGET_39,TARGET_40,TARGET_41,TARGET_42,TARGET_43,TARGET_44,TARGET_45,TARGET_46,TARGET_47,TARGET_48,TARGET_49,TARGET_50,TARGET_51,TARGET_52,TARGET_53,TARGET_54,TARGET_55,TARGET_56,TARGET_57,TARGET_58,TARGET_59,TARGET_60,TARGET_61,TARGET_62,TARGET_63,TARGET_64,TARGET_65,TARGET_66,TARGET_67,TARGET_68,TARGET_69,TARGET_70,TARGET_71,TARGET_72,TARGET_73,TARGET_74,TARGET_75,TARGET_76,TARGET_77,TARGET_78,TARGET_79,TARGET_80,TARGET_81,TARGET_82,TARGET_83,TARGET_84,TARGET_85,TARGET_86,TARGET_87,TARGET_88,TARGET_89,TARGET_90,TA

And there is 140k smiles and 700+labels.
However, when I run the bohruim, the unimol shows that:
image
but I check my datasets:
image
I also used to transfer my datasets to the float16 and run the unimol but it is the same result.

I carefully check the output, the training is fine while the validation is not ok.
image

the full output is
`2023-12-25 18:01:23 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer.
2023-12-25 18:01:23 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer.
2023-12-25 18:01:23 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer.
2023-12-25 18:01:24 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer.
2023-12-25 18:01:24 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer.
2023-12-25 18:01:24 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer.
2023-12-25 18:01:24 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer.
2023-12-25 18:01:25 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer.
2023-12-25 18:01:25 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer.
2023-12-25 18:01:25 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer.
2023-12-25 18:01:25 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer.
2023-12-25 18:01:25 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer.
2023-12-25 18:01:26 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer.
2023-12-25 18:01:26 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer.
2023-12-25 18:01:26 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer.
2023-12-25 18:01:26 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer.
2023-12-25 18:01:26 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer.
2023-12-25 18:01:27 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer.
2023-12-25 18:01:27 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer.
2023-12-25 18:01:32 | unimol/data/conformer.py | 62 | INFO | Uni-Mol(QSAR) | Start generating conformers...
129817it [09:26, 228.99it/s]
2023-12-25 18:11:00 | unimol/data/conformer.py | 66 | INFO | Uni-Mol(QSAR) | Failed to generate conformers for 0.00% of molecules.
2023-12-25 18:11:00 | unimol/data/conformer.py | 68 | INFO | Uni-Mol(QSAR) | Failed to generate 3d conformers for 8.93% of molecules.
2023-12-25 18:11:00 | unimol/train.py | 88 | INFO | Uni-Mol(QSAR) | Output directory already exists: ./uv
2023-12-25 18:11:00 | unimol/train.py | 89 | INFO | Uni-Mol(QSAR) | Warning: Overwrite output directory: ./uv
2023-12-25 18:11:01 | unimol/models/unimol.py | 116 | INFO | Uni-Mol(QSAR) | Loading pretrained weights from /opt/conda/lib/python3.8/site-packages/unimol-0.0.2-py3.8.egg/unimol/weights/mol_pre_all_h_220816.pt
2023-12-25 18:11:01 | unimol/models/nnmodel.py | 103 | INFO | Uni-Mol(QSAR) | start training Uni-Mol:unimolv1
val: 100%|██████████| 51/51 [00:14<00:00, 4.01it/s, Epoch=Epoch 1/20, loss=1.0418]

ValueError Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 clf.fit('/personal/updated_combined_data_uv2_float16.csv')

File /opt/conda/lib/python3.8/site-packages/unimol-0.0.2-py3.8.egg/unimol/train.py:56, in MolTrain.fit(self, data)
54 self.trainer = Trainer(save_path=self.save_path, **self.config)
55 self.model = NNModel(self.data, self.trainer, **self.config)
---> 56 self.model.run()
57 scalar = self.data['target_scaler']
58 y_pred = self.model.cv['pred']

File /opt/conda/lib/python3.8/site-packages/unimol-0.0.2-py3.8.egg/unimol/models/nnmodel.py:120, in NNModel.run(self)
117 if fold > 0:
118 # need to initalize model for next fold training
119 self.model = self._init_model(**self.model_params)
--> 120 _y_pred = self.trainer.fit_predict(
121 self.model, traindataset, validdataset, self.loss_func, self.activation_fn, self.save_path, fold, self.target_scaler)
122 y_pred[te_idx] = _y_pred
124 if 'multiclass_cnt' in self.data:

File /opt/conda/lib/python3.8/site-packages/unimol-0.0.2-py3.8.egg/unimol/tasks/trainer.py:157, in Trainer.fit_predict(self, model, train_dataset, valid_dataset, loss_func, activation_fn, dump_dir, fold, target_scaler, feature_name)
154 batch_bar.close()
155 total_trn_loss = np.mean(trn_loss)
--> 157 y_preds, val_loss, metric_score = self.predict(
158 model, valid_dataset, loss_func, activation_fn, dump_dir, fold, target_scaler, epoch, load_model=False, feature_name=feature_name)
159 end_time = time.time()
160 total_val_loss = np.mean(val_loss)

File /opt/conda/lib/python3.8/site-packages/unimol-0.0.2-py3.8.egg/unimol/tasks/trainer.py:254, in Trainer.predict(self, model, dataset, loss_func, activation_fn, dump_dir, fold, target_scaler, epoch, load_model, feature_name)
252 inverse_y_preds = target_scaler.inverse_transform(y_preds)
253 inverse_y_truths = target_scaler.inverse_transform(y_truths)
--> 254 metric_score = self.metrics.cal_metric(
255 inverse_y_truths, inverse_y_preds, label_cnt=label_cnt) if not load_model else None
256 else:
257 metric_score = self.metrics.cal_metric(
258 y_truths, y_preds, label_cnt=label_cnt) if not load_model else None

File /opt/conda/lib/python3.8/site-packages/unimol-0.0.2-py3.8.egg/unimol/utils/metrics.py:197, in Metrics.cal_metric(self, label, predict, nan_value, threshold, label_cnt)
195 def cal_metric(self, label, predict, nan_value=-1.0, threshold=0.5, label_cnt=None):
196 if self.task in ['regression', 'multilabel_regression']:
--> 197 return self.cal_reg_metric(label, predict, nan_value)
198 elif self.task in ['classification', 'multilabel_classification']:
199 return self.cal_classification_metric(label, predict, nan_value)

File /opt/conda/lib/python3.8/site-packages/unimol-0.0.2-py3.8.egg/unimol/utils/metrics.py:175, in Metrics.cal_reg_metric(self, label, predict, nan_value)
172 metric, _, _ = metric_value
173 def nan_metric(label, predict): return cal_nan_metric(
174 label, predict, nan_value, metric)
--> 175 res_dict[metric_type] = nan_metric(label, predict)
177 return res_dict

File /opt/conda/lib/python3.8/site-packages/unimol-0.0.2-py3.8.egg/unimol/utils/metrics.py:173, in Metrics.cal_reg_metric..nan_metric(label, predict)
--> 173 def nan_metric(label, predict): return cal_nan_metric(
174 label, predict, nan_value, metric)

File /opt/conda/lib/python3.8/site-packages/unimol-0.0.2-py3.8.egg/unimol/utils/metrics.py:49, in cal_nan_metric(y_true, y_pred, nan_value, metric_func)
47 _mask = mask[:, i]
48 if not (~_mask).all():
---> 49 result.append(metric_func(
50 y_true[:, i][_mask], y_pred[:, i][_mask]))
51 return np.mean(result)

File /opt/conda/lib/python3.8/site-packages/sklearn/utils/validation.py:63, in _deprecate_positional_args.._inner_deprecate_positional_args..inner_f(*args, **kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
65 # extra_args > 0
66 args_msg = ['{}={}'.format(name, arg)
67 for name, arg in zip(kwonly_args[:extra_args],
68 args[-extra_args:])]

File /opt/conda/lib/python3.8/site-packages/sklearn/metrics/_regression.py:335, in mean_squared_error(y_true, y_pred, sample_weight, multioutput, squared)
274 @_deprecate_positional_args
275 def mean_squared_error(y_true, y_pred, *,
276 sample_weight=None,
277 multioutput='uniform_average', squared=True):
278 """Mean squared error regression loss.
279
280 Read more in the :ref:User Guide <mean_squared_error>.
(...)
333 0.825...
334 """
--> 335 y_type, y_true, y_pred, multioutput = _check_reg_targets(
336 y_true, y_pred, multioutput)
337 check_consistent_length(y_true, y_pred, sample_weight)
338 output_errors = np.average((y_true - y_pred) ** 2, axis=0,
339 weights=sample_weight)

File /opt/conda/lib/python3.8/site-packages/sklearn/metrics/_regression.py:89, in _check_reg_targets(y_true, y_pred, multioutput, dtype)
55 """Check that y_true and y_pred belong to the same regression task.
56
57 Parameters
(...)
86 the dtype argument passed to check_array.
87 """
88 check_consistent_length(y_true, y_pred)
---> 89 y_true = check_array(y_true, ensure_2d=False, dtype=dtype)
90 y_pred = check_array(y_pred, ensure_2d=False, dtype=dtype)
92 if y_true.ndim == 1:

File /opt/conda/lib/python3.8/site-packages/sklearn/utils/validation.py:63, in _deprecate_positional_args.._inner_deprecate_positional_args..inner_f(*args, **kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
65 # extra_args > 0
66 args_msg = ['{}={}'.format(name, arg)
67 for name, arg in zip(kwonly_args[:extra_args],
68 args[-extra_args:])]

File /opt/conda/lib/python3.8/site-packages/sklearn/utils/validation.py:720, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
716 raise ValueError("Found array with dim %d. %s expected <= 2."
717 % (array.ndim, estimator_name))
719 if force_all_finite:
--> 720 _assert_all_finite(array,
721 allow_nan=force_all_finite == 'allow-nan')
723 if ensure_min_samples > 0:
724 n_samples = _num_samples(array)

File /opt/conda/lib/python3.8/site-packages/sklearn/utils/validation.py:103, in _assert_all_finite(X, allow_nan, msg_dtype)
100 if (allow_nan and np.isinf(X).any() or
101 not allow_nan and not np.isfinite(X).all()):
102 type_err = 'infinity' if allow_nan else 'NaN, infinity'
--> 103 raise ValueError(
104 msg_err.format
105 (type_err,
106 msg_dtype if msg_dtype is not None else X.dtype)
107 )
108 # for object dtype data, we only check for NaNs (GH-13254)
109 elif X.dtype == np.dtype('object') and not allow_nan:

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').`

I am wondering how can I do that I had carefully clean and check my datasets and make sure that it is fine, but I do not know if the unimol can handle such a big datasets.

@KaiChen-lr
Copy link

I had the same problem.

@Naplessss
Copy link
Contributor

Hi, you can use the latest image(unimol-qsar:v0.5), multi-task regression bugs should fixed in the latest version.

@yxnyu
Copy link
Author

yxnyu commented Dec 26, 2023

Hi, you can use the latest image(unimol-qsar:v0.5), multi-task regression bugs should fixed in the latest version.

Thanks you for your reply. I just try it but still failure. I use several method to research this problem. I find that when I set my value all to zero, unimol can calculate the validation. I try to use a 4000 molecules truncated version of my dataset to do such a work but failure. I found that when the bohrium shows the

2023-12-26 16:41:32 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-26 16:41:32 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-26 16:41:32 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-26 16:41:32 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-26 16:41:32 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-26 16:41:32 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer. 2023-12-26 16:41:32 | unimol/data/datascaler.py | 83 | INFO | Uni-Mol(QSAR) | Auto select power transformer
image

image

The validation cannot work. I believe that all my data is in float32 and maybe there is the problem of datascaler? I had no idea how to solve it.
image

@yxnyu
Copy link
Author

yxnyu commented Dec 26, 2023

https://drive.google.com/file/d/1HmEPKFl6Vn5r6AY9_OzYTrrXR9CA2lXX/view?usp=drive_link
Here is the csv link of truncated version.

@HongshuaiWang1
Copy link
Contributor

After our testing, it has been confirmed that v0.5 has removed the power transformer. Please check the image you selected again. In addition, we tested this data and found that this data is very sparse, so it is recommended to change the targetscaler to ‘none’ to avoid the problem of value overflow during the std standardization process.

@yxnyu
Copy link
Author

yxnyu commented Dec 27, 2023

After our testing, it has been confirmed that v0.5 has removed the power transformer. Please check the image you selected again. In addition, we tested this data and found that this data is very sparse, so it is recommended to change the targetscaler to ‘none’ to avoid the problem of value overflow during the std standardization process.

Thanks for your information. I also think the sparse problem has an impact on unimol. However, I just try the targetscaler to ‘none’ like
image
image
And the result seems like the same? If you have successfully test, would you like to share some tips or setting?

I comfirmed that v0.5 was successfully selected.

@yxnyu
Copy link
Author

yxnyu commented Dec 27, 2023

Thanks! I used the target_normalize='none' and the trainning is ok!

@HongshuaiWang1
Copy link
Contributor

After our testing, it has been confirmed that v0.5 has removed the power transformer. Please check the image you selected again. In addition, we tested this data and found that this data is very sparse, so it is recommended to change the targetscaler to ‘none’ to avoid the problem of value overflow during the std standardization process.

Thanks for your information. I also think the sparse problem has an impact on unimol. However, I just try the targetscaler to ‘none’ like image image And the result seems like the same? If you have successfully test, would you like to share some tips or setting?

I comfirmed that v0.5 was successfully selected.

Sorry. The true name of the targetsclaer interface parameters is 'target_normalize', you can change it to 'none'. I think it will work.
reg = MolTrain(......
target_normalize='none',
...... )

We also try to develop the function to automatically handle overflow values to handle this situation well.

@Naplessss
Copy link
Contributor

BTW, it's possible to enhance your data preprocessing by incorporating domain expertise, this may involve manual normalization of the target variable, anomaly detection, and other specialized techniques. unimol have some auto preprocess strategies but not cover much enough.

@yxnyu
Copy link
Author

yxnyu commented Dec 28, 2023

Thanks! And to such a big dataset, I am wondering if the unimol tool has the mulit-GPU version parameters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants