Skip to main content

Hi

Quick question before I fall back on trial and error, because my dataset is quite big, but does anyone know if changing the partitioning attribute order on a partitioned parquet writer determines the partition folder hierarchy?  The doc says its the schema order which determines the partition hierarchy, I’m hoping to override the reader’s attribute order.  I’m hoping manually editing the writer attribute order will determine partition hierarchy.  Thanks.

let us know what you find out Bruce


@virtualcitymatt The partition hierarchy is indeed set by the writer field order,which is great news!  I’m following up looking into compression options, with SNAPPY ahead at present.  I’ll SUM.


So here is what I found.  Sorting on partition fields is mandatory or you’ll get many fragmented output files per partition leaf (unless your data comes sorted).  The writer field order determines the partition folder hierarchy.  You can blow Windows file handle limit if you partition on fields with large numbers of unique values.  It helps usability a lot if you replace Windows (or S3) illegal characters like colon (Windows) or spaces (S3) with (say) underscores so partition folders don’t get URI-encoded, which is ugly to read.