Skip to main content
Solved

Partioned parquet writing - partition attribute order


bruceharold
Contributor
Forum|alt.badge.img+17

Hi

Quick question before I fall back on trial and error, because my dataset is quite big, but does anyone know if changing the partitioning attribute order on a partitioned parquet writer determines the partition folder hierarchy?  The doc says its the schema order which determines the partition hierarchy, I’m hoping to override the reader’s attribute order.  I’m hoping manually editing the writer attribute order will determine partition hierarchy.  Thanks.

Best answer by bruceharold

So here is what I found.  Sorting on partition fields is mandatory or you’ll get many fragmented output files per partition leaf (unless your data comes sorted).  The writer field order determines the partition folder hierarchy.  You can blow Windows file handle limit if you partition on fields with large numbers of unique values.  It helps usability a lot if you replace Windows (or S3) illegal characters like colon (Windows) or spaces (S3) with (say) underscores so partition folders don’t get URI-encoded, which is ugly to read.

View original
Did this help you find an answer to your question?
This post is closed to further activity.
It may be a question with a best answer, an idea that has been implemented, or just no longer relevant.
If you have a follow-up or related question, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

3 replies

virtualcitymatt
Celebrity
Forum|alt.badge.img+34

let us know what you find out Bruce


bruceharold
Contributor
Forum|alt.badge.img+17
  • Author
  • Contributor
  • January 28, 2025

@virtualcitymatt The partition hierarchy is indeed set by the writer field order,which is great news!  I’m following up looking into compression options, with SNAPPY ahead at present.  I’ll SUM.


bruceharold
Contributor
Forum|alt.badge.img+17
  • Author
  • Contributor
  • Best Answer
  • January 30, 2025

So here is what I found.  Sorting on partition fields is mandatory or you’ll get many fragmented output files per partition leaf (unless your data comes sorted).  The writer field order determines the partition folder hierarchy.  You can blow Windows file handle limit if you partition on fields with large numbers of unique values.  It helps usability a lot if you replace Windows (or S3) illegal characters like colon (Windows) or spaces (S3) with (say) underscores so partition folders don’t get URI-encoded, which is ugly to read.


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings