Skip to main content
Question

Parquet File not Readable over s3 because of slashes

  • November 27, 2024
  • 3 replies
  • 66 views

mccunee
Contributor
Forum|alt.badge.img+2

I have many parquet files stored in an s3 bucket; when I use the s3 connector to download the parquet file, I am able to read it successfully, but passing the s3 uri (“s3://”) raises an error:

PARQUET reader: PARQUET reader: Failed to open file 's3:\my-bucket-s3-test\fme-test\test.parquet' for reading.  Please ensure that the file exists and you have sufficient privileges to read it
Failed to obtain any schemas from reader 'PARQUET' from 1 datasets. This may be due to invalid datasets or format accessibility issues due to licensing, dependencies, or module loading. See logfile for more information

When I try to pass the url:

PARQUET reader: PARQUET reader: Failed to open file 'HTTPS:\s3.us-west-2.amazonaws.com\my-bucket-s3-test\fme-test\test.parquet' for reading.  Please ensure that the file exists and you have sufficient privileges to read it

 

When I use the feature reader, regardless of whether or not I specify a web connection, it converts forward slashes to back slashes. How do I work with this?

 

Context:

There will be 1000s of files in this bucket. I am using “list” to generate a feature of the pathnames, then using automations to process the files in this bucket. I do not want to download all of these files, given the memory strain, and I need to use attributes to pass the uris to next part of the process

3 replies

hkingsbury
Celebrity
Forum|alt.badge.img+50
  • Celebrity
  • November 28, 2024

My understanding of how s3 works in FME is that you do need to download each file before being able to use it in a reader.

The approach i’d take in this scenario is:

  • Get a list of all the files in the bucket
  • Explode them to individual features
  • In a custom transformer (set to group by the file name)
    • Download the file
    • read it
    • delete the downloaded file
    • perform any required analysis/processing

mccunee
Contributor
Forum|alt.badge.img+2
  • Author
  • Contributor
  • December 3, 2024

Ok, thank you ​@hkingsbury much appreciated for the clarification. I did figure out something, but it still downloads the Parquet file to temp (which slightly undermines the purpose of cloud optimized formats). In the Apache Parquet format the drop down arrow has a browse web - select from s3 option. This reformats the URL in a way that safe can read, which isn’t your typical s3 url. I can then modify with a text editor to use different inputs. This is kind of clunky however, and the feature reader UI makes it really tough to modify s3 URIs. If any developers read this- improvements to the s3 browsing and parsing would be much appreciated.


hkingsbury
Celebrity
Forum|alt.badge.img+50
  • Celebrity
  • December 4, 2024
mccunee wrote:

Ok, thank you ​@hkingsbury much appreciated for the clarification. I did figure out something, but it still downloads the Parquet file to temp (which slightly undermines the purpose of cloud optimized formats). In the Apache Parquet format the drop down arrow has a browse web - select from s3 option. This reformats the URL in a way that safe can read, which isn’t your typical s3 url. I can then modify with a text editor to use different inputs. This is kind of clunky however, and the feature reader UI makes it really tough to modify s3 URIs. If any developers read this- improvements to the s3 browsing and parsing would be much appreciated.



It would be worth creating an idea with this - https://community.safe.com/ideas


Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings