Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't write parquet of mixed types #159

Open
zenazn opened this issue Jul 13, 2022 · 4 comments
Open

Can't write parquet of mixed types #159

zenazn opened this issue Jul 13, 2022 · 4 comments

Comments

@zenazn
Copy link

zenazn commented Jul 13, 2022

(Thank you so much for this project! I was about to write the very same thing and was fortunate enough to stumble upon your version)

The following code fails in version v0.3.1

const table = tableFromArrays({
  name: ['carl'],
  fav_number: [3],
});
const arrowBytes = tableToIPC(table, 'file');
const writerProps = new WriterPropertiesBuilder()
  .setCompression(Compression.SNAPPY)
  .setEncoding(Encoding.PLAIN_DICTIONARY) // <-- encoding line
  .build();
const bytes = writeParquet2(arrowBytes, writerProps);
return bytes;

In particular, it throws the following error: External format error: Invalid argument error: The datatype Float64 cannot be encoded by PlainDictionary.

If I comment out the encoding line above, I instead get the following error: External format error: Not yet implemented: Dictionary arrays only support dictionary encoding

I don't think any single parquet encoding works for both strings and numbers—instead, encodings need to be settable or inferred per field, which requires an API change of some sort to WriterPropertiesBuilder.

@kylebarron
Copy link
Owner

I don't think any single parquet encoding works for both strings and numbers

I think more specifically... some parquet encodings cannot be used for both strings and numbers. In the test case, read-write-read works:

const initialTable = tableFromIPC(wasm.readParquet2(arr));
const writerProperties = new wasm.WriterPropertiesBuilder().build();
const parquetBuffer = wasm.writeParquet2(
tableToIPC(initialTable, "file"),
writerProperties
);
const table = tableFromIPC(wasm.readParquet2(parquetBuffer));
for this table that has both strings and numbers:
"str": pa.array(["a", "b", "c", "d"], type=pa.string()),
"uint8": pa.array([1, 2, 3, 4], type=pa.uint8()),
"int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()),
"bool": pa.array([True, True, False, False], type=pa.bool_()),
when using the default Plain encoding:
let encoding = arrow2::io::parquet::write::Encoding::Plain;

But it makes sense that dictionary encodings won't work on some data types.

On the arrow1 side, metadata for column encoding is supported natively:

Self(self.0.set_column_encoding(column_path, value.to_arrow1()))

On the arrow2 side, we store encoding separately:

encoding: arrow2::io::parquet::write::Encoding,
, so we'd need to store some sort of mapping from column name to encoding (with a separate fallback encoding). The writer properties doesn't have any knowledge of the dataset schema, so the properties wouldn't be able to catch mis-matched column names or similar.

@zenazn
Copy link
Author

zenazn commented Jul 14, 2022

Oh interesting! It looks like Arrow is inferring dictionaries for strings in tableFromArrays instead of plain values, which your example parquet file uses:

https://github.com/apache/arrow/blob/e766828c699c6c74eba3b8c5de99e541017b8b9e/js/src/factories.ts#L142

I think I can work around this! Although of course it would be great to have support for arrow dictionaries turning into one of the parquet dictionary encodings by default

@kylebarron
Copy link
Owner

Ah interesting. My test case isn't testing writing a table initially created in arrow.js, because it loaded the table saved from pyarrow.

Regardless, as I mentioned above, I think it would be good to support column-specific encodings. They're already supported in arrow1; we just need to change encoding to a hashmap in the arrow2 writer properties.

@zenazn
Copy link
Author

zenazn commented Jul 14, 2022

Agreed!

FWIW, working around this by inferring the type Utf8 for strings (instead of Dictionary<...>) is a decent workaround!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants