-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to add pre-processing to PMML #16199
Comments
According to @narasimhard : Customer already has written a library to translate H2O-3 mojo to PMML: They are currently using a JAVA env to convert here is a reference: https://github.com/jpmml/jpmml-h2o?tab=readme-ov-file#the-java-side-of-operations java -jar pmml-h2o-example/target/pmml-h2o-example-executable-1.2-SNAPSHOT.jar --mojo-input mojo.zip --pmml-output mojo.pmml Using Intellij, I was able to generate pmml from h2o-3 mojo using their org.jpmml.h2o.example.Main.java. |
The encodeSchema() method in Converter.java generates information for the DataFied (DataDictionary) part of PMML. The call Model pmmlModel = encodeModel(schema) in Converter.java will make the call to encode model information. In addition, it will eventually call encoder.encodePMML() method to actually do the encoding. This in turn will get us to ModelEncoder.class method encodePMML. This means that we do not have access to the actual .java file and cannot make changes. The final action I recommend is in encodeModel method of ModelEncoder.class. In particular, we will need to add to the MiningModel where the mining schema reside. For information on how the mining schema is specified, checkout: https://dmg.org/pmml/v4-0-1/MiningSchema.html. However, before we can add to the mining Schema, we need to setup the proper TransformationDictionary in TransformationDictionary.class. Looks like there are three types of transforms: extensions, defineFunctions, derivedFields. The TransformationDictionary is defined in ln 51 of PMMLEncoder.class. The TransformationDictionary is derived from fields: derivedFields and defineFunctions of H2OEncoder. However, in Converter.java method encodeSchema() method, they did not even bother to add any derivedFields, defineFunctions. This is the first place to add something regarding TransformationDictionary. Note the derivedFields and defineFunctions are members of H2OEncoder class. Around ln 69 of Converter.java, the encoder.createSchema() is called which will direct us to (eventually) Schema() constructor of Schema.class. Note, here, nothing regarding the derivedFields or defineFunctions are set. We will probably need to figure out how to add those features into the encoder.createSchema(). Inside XGBoostMojoModelConverter.java, for each feature, we do have line 68 a stream of MissingValueFeature. If you step into MissingValueFeature.class, you will see that if a feature missing value is of interest, getDerivedName will be called to create some sort of new name isMissing_feature_name. Following XGBoost Mojo to PMML, when XGBoostMojoModelConverter.java call learner.encodeModel(options, xgbSchema), we will end up in Learner.class which is from org/jpmml/xgboost/lLearninger/encodeModel/Learner.class. This is an external library. Note that we have MiningModel after the this.encodeModel() call in line 401. Next, in Convert.java, line 99 calling encoder.encodePMML() will lead to line 54 of ModelEncoder.class and if you step in will arrive at ln 39 method encodePMML where we have:
Back to line 55 of ModelEncoder.class, this call model = this.encodeModel(model) will goto method encodeModel method in ModelEncoder.class. This is where the MiningModel is created if transformers are not empty. Unfortunately, we skipped this part of the code where MiningSchema is invoked: line 76 to 84. In line 82, the createModel will lead to MiningModelUtil.class method createModelChain(). In createModelChain() method of MinningModelUtil.class, it will eventually call createMiningSchema(models). We need to somehow enable the missing value schema creation in createMiningSchema method. New idea: This means that the missingValueReplacement is present in the miningFields of miningSchema which is part of the MiningModel. If we can change this somewhere, we are golden. |
Inside the MiningSchema.class, there are various fields and the corresponding URLs http://www.dmg.org/PMML-4_4 about them. I hope the URLs will be clear on how to set those fields to set the missingValue replacements for the various predictors. |
A customer wants to add simple pre-processing to XGBoost mojo. However here is the trick:
What I know is that we can add preprocessing to current model and use a flag to enable it as it would be disabled by default.
The question here is if the translated PMML mojo can work with a flag. However, during the mojo to PMML conversion, none of the code in genmodel directory is used and hence this route would not work. Adding things to genmodel will not translate into PMML.
The text was updated successfully, but these errors were encountered: