Solved

Partioned parquet writing - partition attribute order

2 months ago
January 27, 2025
3 replies
47 views

+17

bruceharold
Contributor
336 replies

Hi

Quick question before I fall back on trial and error, because my dataset is quite big, but does anyone know if changing the partitioning attribute order on a partitioned parquet writer determines the partition folder hierarchy? The doc says its the schema order which determines the partition hierarchy, I’m hoping to override the reader’s attribute order. I’m hoping manually editing the writer attribute order will determine partition hierarchy. Thanks.

Best answer by bruceharold

So here is what I found. Sorting on partition fields is mandatory or you’ll get many fragmented output files per partition leaf (unless your data comes sorted). The writer field order determines the partition folder hierarchy. You can blow Windows file handle limit if you partition on fields with large numbers of unique values. It helps usability a lot if you replace Windows (or S3) illegal characters like colon (Windows) or spaces (S3) with (say) underscores so partition folders don’t get URI-encoded, which is ugly to read.

View original

Did this help you find an answer to your question?

This post is closed to further activity.
It may be a question with a best answer, an implemented idea, or just a post needing no comment.
If you have a follow-up or related question, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

+35

virtualcitymatt
Celebrity
1831 replies
2 months ago
January 28, 2025

let us know what you find out Bruce

+17

bruceharold
Author
Contributor
336 replies
2 months ago
January 28, 2025

@virtualcitymatt The partition hierarchy is indeed set by the writer field order,which is great news! I’m following up looking into compression options, with SNAPPY ahead at present. I’ll SUM.

+17

bruceharold
Author
Contributor
336 replies
Best Answer
2 months ago
January 30, 2025

So here is what I found. Sorting on partition fields is mandatory or you’ll get many fragmented output files per partition leaf (unless your data comes sorted). The writer field order determines the partition folder hierarchy. You can blow Windows file handle limit if you partition on fields with large numbers of unique values. It helps usability a lot if you replace Windows (or S3) illegal characters like colon (Windows) or spaces (S3) with (say) underscores so partition folders don’t get URI-encoded, which is ugly to read.

Partioned parquet writing - partition attribute order

3 replies

Helpful Members This Week

Recently Solved Questions

Geometry Validator not repairing Self Intersection

FeatureReader failes to retrieve featuretypes from ArcGIS Portal Service

Performance Scaling

Neighbourfinder - One to one Relationship

Automatically change year in new attribute

Community Stats

Latest FME

Cookie policy

Cookie settings

Related Topics

Can you Please Help Me Test the AI Briefs Capability for Customer Success or Acct Mgmt Teams?icon

Helpful Members This Week

Recently Solved Questions

Popular Tags

Community Stats

Latest FME

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings