Quick question before I fall back on trial and error, because my dataset is quite big, but does anyone know if changing the partitioning attribute order on a partitioned parquet writer determines the partition folder hierarchy? The doc says its the schema order which determines the partition hierarchy, I’m hoping to override the reader’s attribute order. I’m hoping manually editing the writer attribute order will determine partition hierarchy. Thanks.
Best answer by bruceharold
So here is what I found. Sorting on partition fields is mandatory or you’ll get many fragmented output files per partition leaf (unless your data comes sorted). The writer field order determines the partition folder hierarchy. You can blow Windows file handle limit if you partition on fields with large numbers of unique values. It helps usability a lot if you replace Windows (or S3) illegal characters like colon (Windows) or spaces (S3) with (say) underscores so partition folders don’t get URI-encoded, which is ugly to read.
Did this help you find an answer to your question?
This post is closed to further activity.
It may be a question with a best answer, an idea that has been implemented, or just no longer relevant.
If you have a follow-up or related question, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.
@virtualcitymatt The partition hierarchy is indeed set by the writer field order,which is great news! I’m following up looking into compression options, with SNAPPY ahead at present. I’ll SUM.
So here is what I found. Sorting on partition fields is mandatory or you’ll get many fragmented output files per partition leaf (unless your data comes sorted). The writer field order determines the partition folder hierarchy. You can blow Windows file handle limit if you partition on fields with large numbers of unique values. It helps usability a lot if you replace Windows (or S3) illegal characters like colon (Windows) or spaces (S3) with (say) underscores so partition folders don’t get URI-encoded, which is ugly to read.
We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.