Skip to main content

Hi all,

I do have a question about how FME and S3 interact each other. Im working with big amount of data that will likely increase very fast (terabytes). In my way to test how could we efficiently store the data and later work it with FME Im doing some tests basic tests with S3 from amazon. My point is that I can store those terabytes in Amazon but when I try to work with them with S3objectlister and S3downloader I found that you need to really download data to process it so all in all you will need to have extra storage in your server or desktop. So for example if I want to work with 5 gb of cloud data that I have stored in S3 I need first download this on local and then work normally on FME? I have not found a way to connect S3downloader directly to feature reader so I dont need to download all the data and process on the fly: Not sure if I explained correctly or not, or if Im missing somehting in the FME documentation that could do in other way...

Sounds like you should look into FME Cloud. Since it runs on Amazon, it has lightening fast access to S3.


Sounds like you should look into FME Cloud. Since it runs on Amazon, it has lightening fast access to S3.

Had the same initial idea, but I am still wondering if it actually makes any difference....in the sense that the FME Cloud instance will still have to download the data first....

 

 


Had the same initial idea, but I am still wondering if it actually makes any difference....in the sense that the FME Cloud instance will still have to download the data first....

 

 

Yeah, that's the way FME works, everything happens locally...

 

 


There is no API for us to, in general, access data directly from S3 in FME readers. In all cases, we must first download whatever we're going to work on, then do the work, then we clean up. Even the new FME 2017 web - as - a - filesystem which lets us read "directly" from DropBox, Box, Google Drive, etc does this same 2 step dance. But it still simplifies things for users.


Yeah, that's the way FME works, everything happens locally...

 

 

If the download part of the translation is the bulk of the translation time, using an FME Cloud instance (or any FME Server running on AWS) would have a massive impact. When you pull data from S3 and stay within an AWS region, I have seen up to 60MB/s download speed (5GB of data would be about 90 seconds). When you compare this to downloading across the general internet, it obviously depends on your connection, but you are unlikely to ever get over 20 MB/s (probably more like 1-2MB) even with the quickest connection.

 

 


As Dale says, you will always have to pull the data down as you need to get as close to the data as possible. I would recommend using FME Cloud (which runs on AWS), or an FME running on AWS. This means when you pull data from S3, you are staying on the Amazon network and will get download/upload speeds up to 10x faster.

Failing that, here are a few other things to try. In FME 2017 we just completely overhauled all AWS transformers including migrating to the latest SDK so I would try it in 2017. We also added support for S3 acceleration which promises to speed download and upload up dramatically. I haven't done any benchmarks yet, but it would be worth a try.


Thanks all, I will evaluate then the different options shown here.


As Dale says, you will always have to pull the data down as you need to get as close to the data as possible. I would recommend using FME Cloud (which runs on AWS), or an FME running on AWS. This means when you pull data from S3, you are staying on the Amazon network and will get download/upload speeds up to 10x faster.

Failing that, here are a few other things to try. In FME 2017 we just completely overhauled all AWS transformers including migrating to the latest SDK so I would try it in 2017. We also added support for S3 acceleration which promises to speed download and upload up dramatically. I haven't done any benchmarks yet, but it would be worth a try.

Great news :)

 

I'm currently reading files into FME directly from S3 (within AWS) over HTTP since that offered the cleanest solution for us and the performance was also quite good. With the new transformers, would you recommend using the S3Downloader first, followed by a FeatureReader? So the question is: will the new transformer beat regular HTTP transfer in terms of speed?

 

 


Great news :)

 

I'm currently reading files into FME directly from S3 (within AWS) over HTTP since that offered the cleanest solution for us and the performance was also quite good. With the new transformers, would you recommend using the S3Downloader first, followed by a FeatureReader? So the question is: will the new transformer beat regular HTTP transfer in terms of speed?

 

 

It should. I think the Java SDK we use is a wrapper on top of HTTP, but they might optimize how they chunk and send the data. The S3 acceleration should definitely improve things. Maybe give it a go and let me know :-)

 

 

To read files in I would use a S3Downloader and a FeatureReader yes.

 

 

 


It should. I think the Java SDK we use is a wrapper on top of HTTP, but they might optimize how they chunk and send the data. The S3 acceleration should definitely improve things. Maybe give it a go and let me know :-)

 

 

To read files in I would use a S3Downloader and a FeatureReader yes.

 

 

 

Thanks Stewart, that's good to know. We'll try that out soon!

There is no API for us to, in general, access data directly from S3 in FME readers. In all cases, we must first download whatever we're going to work on, then do the work, then we clean up. Even the new FME 2017 web - as - a - filesystem which lets us read "directly" from DropBox, Box, Google Drive, etc does this same 2 step dance. But it still simplifies things for users.

S3 is an object storage, not block storage like a traditional file server. There are a number of o

Thanks all, I will evaluate then the different options shown here.

One other thing to consider -- a package like expandrive could be helpful. This still (behind the scenes) pulls the file down and then simulates direct access to it, but it does manage the temporary file space. FME will work just fine on any data accessed via the simulated drive that expandrive (or any tool like it) will serve up.

 

 


Reply