Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add pre-processing to PMML #16199

Open
wendycwong opened this issue May 16, 2024 · 5 comments
Open

How to add pre-processing to PMML #16199

wendycwong opened this issue May 16, 2024 · 5 comments
Assignees

Comments

@wendycwong
Copy link
Contributor

wendycwong commented May 16, 2024

A customer wants to add simple pre-processing to XGBoost mojo. However here is the trick:

  1. Customer has old mojo with earlier H2O-3 version;
  2. customer converted mojo to PMML version;

What I know is that we can add preprocessing to current model and use a flag to enable it as it would be disabled by default.

The question here is if the translated PMML mojo can work with a flag. However, during the mojo to PMML conversion, none of the code in genmodel directory is used and hence this route would not work. Adding things to genmodel will not translate into PMML.

@wendycwong wendycwong self-assigned this May 16, 2024
@wendycwong wendycwong changed the title Investigate if generic model can be saved as mojo again. Investigate old mojo being loaded with new h2o-3 version and add new mojo features this way. May 25, 2024
@wendycwong
Copy link
Contributor Author

According to @narasimhard : Customer already has written a library to translate H2O-3 mojo to PMML:

They are currently using a JAVA env to convert here is a reference: https://github.com/jpmml/jpmml-h2o?tab=readme-ov-file#the-java-side-of-operations

java -jar pmml-h2o-example/target/pmml-h2o-example-executable-1.2-SNAPSHOT.jar --mojo-input mojo.zip --pmml-output mojo.pmml
10:45
It using the JAR pmml-h2o-example-executable-1.2-SNAPSHOT.jar

Using Intellij, I was able to generate pmml from h2o-3 mojo using their org.jpmml.h2o.example.Main.java.

@wendycwong
Copy link
Contributor Author

My idea here is to add more arguments to Main.java to if a specific argument is present: --fill-missing-values, we will generate PMML file with the preprocessing enabled.

From my reading on PMML, it is very easy to add missing value replacement. You need to add it to the mining schema.

Screenshot 2024-06-03 at 7 26 17 AM

@wendycwong
Copy link
Contributor Author

You can also look at the overview of variable scoping in PMML:
Screenshot 2024-06-03 at 7 27 16 AM

@wendycwong wendycwong changed the title Investigate old mojo being loaded with new h2o-3 version and add new mojo features this way. How to add pre-processing to PMML Jun 3, 2024
@wendycwong
Copy link
Contributor Author

wendycwong commented Jun 3, 2024

The encodeSchema() method in Converter.java generates information for the DataFied (DataDictionary) part of PMML.

The call Model pmmlModel = encodeModel(schema) in Converter.java will make the call to encode model information. In addition, it will eventually call encoder.encodePMML() method to actually do the encoding.

This in turn will get us to ModelEncoder.class method encodePMML. This means that we do not have access to the actual .java file and cannot make changes.

The final action I recommend is in encodeModel method of ModelEncoder.class. In particular, we will need to add to the MiningModel where the mining schema reside.

For information on how the mining schema is specified, checkout: https://dmg.org/pmml/v4-0-1/MiningSchema.html.

However, before we can add to the mining Schema, we need to setup the proper TransformationDictionary in TransformationDictionary.class. Looks like there are three types of transforms: extensions, defineFunctions, derivedFields. The TransformationDictionary is defined in ln 51 of PMMLEncoder.class. The TransformationDictionary is derived from fields: derivedFields and defineFunctions of H2OEncoder.

However, in Converter.java method encodeSchema() method, they did not even bother to add any derivedFields, defineFunctions. This is the first place to add something regarding TransformationDictionary. Note the derivedFields and defineFunctions are members of H2OEncoder class. Around ln 69 of Converter.java, the encoder.createSchema() is called which will direct us to (eventually) Schema() constructor of Schema.class. Note, here, nothing regarding the derivedFields or defineFunctions are set. We will probably need to figure out how to add those features into the encoder.createSchema().

Inside XGBoostMojoModelConverter.java, for each feature, we do have line 68 a stream of MissingValueFeature. If you step into MissingValueFeature.class, you will see that if a feature missing value is of interest, getDerivedName will be called to create some sort of new name isMissing_feature_name.

Following XGBoost Mojo to PMML, when XGBoostMojoModelConverter.java call learner.encodeModel(options, xgbSchema), we will end up in Learner.class which is from org/jpmml/xgboost/lLearninger/encodeModel/Learner.class. This is an external library. Note that we have MiningModel after the this.encodeModel() call in line 401.

Next, in Convert.java, line 99 calling encoder.encodePMML() will lead to line 54 of ModelEncoder.class and if you step in will arrive at ln 39 method encodePMML where we have:

  • if dataFields and derviedFields are not disjoint, an error is thrown. This should not be.
  • ln 49-61: generate transformationDictionary from derviedFields and defineFunctions.
  • an PMML model is returned.

Back to line 55 of ModelEncoder.class, this call model = this.encodeModel(model) will goto method encodeModel method in ModelEncoder.class. This is where the MiningModel is created if transformers are not empty. Unfortunately, we skipped this part of the code where MiningSchema is invoked: line 76 to 84. In line 82, the createModel will lead to MiningModelUtil.class method createModelChain().

In createModelChain() method of MinningModelUtil.class, it will eventually call createMiningSchema(models). We need to somehow enable the missing value schema creation in createMiningSchema method.

New idea:
if we look at the (PMMLModel) pmml returned in line 67 of ModelEncoder.class, you will see the following fields that are of interest:

Screenshot 2024-06-03 at 3 10 43 PM

This means that the missingValueReplacement is present in the miningFields of miningSchema which is part of the MiningModel. If we can change this somewhere, we are golden.

@wendycwong
Copy link
Contributor Author

Inside the MiningSchema.class, there are various fields and the corresponding URLs http://www.dmg.org/PMML-4_4 about them. I hope the URLs will be clear on how to set those fields to set the missingValue replacements for the various predictors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant