Solved

Speeding up a workspace runner


Badge +1

Hi folks,

This is my first post here so apologies if this is a basic question. I have a quick one about a workspace runner operation I am using. I have a project that makes sense and is working but I'm looking at trying to make it more efficient.

It takes 2 files, workspace runner passes it file A, the workspace runs and matches variables in A to another static file, file B. then outputs the resulting file in File C. This all makes sense. It works beautifully.

However it would be nice if I could read file B once as oppose to every time the work space runner fires off a process. It's a static file and quite big, and I have about 17,000+ file As to process, loading in File B for each of these is a massive drain on time.

Is it possible to 'preload' a file for a workspace runner? I've looked everywhere for an answer but not seen anything that leads to an answer.

 

 

icon

Best answer by erik_jan 26 January 2017, 17:47

View original

14 replies

Userlevel 2
Badge +12

Could you read file B and create a ffs file? ffs is the internal file format for FME and is read faster than any other format.

Create a (temporary) workspace and read file B and write to the ffs Writer. Then use the ffs file in your process instead of file B.

Userlevel 4

No, you can't, unfortuately.

But you should consider the format of file B, as it can make a big difference as some formats are an order of magnitude faster than others. You could try pre-processing file B into the native FFS format and see if that makes a difference.

If that doesn't improve the performance, a more general analysis of your workflow is in order. For instance, if you're repeatedly using a FeatureMerger on large datasets, you might benefit from first loading your data into a relational database and using SQL join queries using indexed fields (either as a view or from an SQLExecutor/SQLCreator), it can make significant difference.

Userlevel 2
Badge +12

As an addition you might want to look at the FeatureReader transformer.

That transformer allows you to read the part of file B that you need in comparison with file A (and potentially not the complete file B).

Badge +16

I can feel a blog post coming on from Mark Ireland about memoisation of workspaces via FFS or SQLite files.....

Badge +1

Could you read file B and create a ffs file? ffs is the internal file format for FME and is read faster than any other format.

Create a (temporary) workspace and read file B and write to the ffs Writer. Then use the ffs file in your process instead of file B.

Thanks, I'll give this a go, I started reading file B as a SHP and realised that was a silly thing to do, I started using a CSV and it was a lot faster, i'll try FFS and report back.
Badge +1

No, you can't, unfortuately.

But you should consider the format of file B, as it can make a big difference as some formats are an order of magnitude faster than others. You could try pre-processing file B into the native FFS format and see if that makes a difference.

If that doesn't improve the performance, a more general analysis of your workflow is in order. For instance, if you're repeatedly using a FeatureMerger on large datasets, you might benefit from first loading your data into a relational database and using SQL join queries using indexed fields (either as a view or from an SQLExecutor/SQLCreator), it can make significant difference.

Thanks,

 

I'll bear this in mind for the future, may change my process if this is something we start doing more than once a year.

 

 

Badge +1

As an addition you might want to look at the FeatureReader transformer.

That transformer allows you to read the part of file B that you need in comparison with file A (and potentially not the complete file B).

thanks, will do

 

 

Userlevel 4
Badge +25

Well... I think it partly depends on what the workspace is doing. You say it "matches variables in A to another static file, file B" - does that mean a join on an attribute key? Maybe with the FeatureMerger? And how many features are there? Are there fewer features in A than B? eg 1,000 features in A are matched against 1,000,000 in B?

If so you could replace the FeatureMerger with a Joiner transformer. You'd have to put B into a database format of some sort, even just temporarily, but then you wouldn't have to read all 1,000,000 features from B just to match 1,000 of them.

Similarly you could use a FeatureReader or SQLExecutor or any other transformer that could retrieve selected features from dataset B. So that's one technique: reading B the same number of times but reducing the amount of data you need to read from it.

That's the first law of performance, btw. Performance is defined as "useful work" carried out, so if you're reading data that isn't used, it's not useful work!

Anyway, to read B only once, then I think you would need to just have one run of the workspace, and load all of the data from all of the A files at one time. Basically one big process. How many features are in each A file? Are they non-spatial or very complex geometry? That will be the big issue.

I'm assuming you have your current setup - to read each A file separately - because combined there is too much data. But if not, and you are only doing it as a quick way to batch process data, there are other solutions. Considering reading all of the A files at once, then using a Dataset Fanout on the output to split them back out again. Group-by parameters in transformers can help keep each A file separate if necessary.

So that's a second technique: reading all of the A files at once, so you only need to read B once.

Finally, and this is where my imagination starts to exceed my knowledge, I wonder if you can set up a continuously running workspace. That workspace reads file B and then receives features from all of the file As one at a time. It keeps running as long as you are passing file A features to it.

We do have transformers like that - TCPIPReceiver, SQSReceiver, etc. I'm partly unsure because I don't know how a workspace like that deals with a reader. I assume it reads the data once and then - if you have a group-based transformer - holds that data for processing against any incoming feature, but I don't know for sure. Also, of course, I don't really know what action you are carrying out or how many features you have, or how often you run this process (cause this would take a lot of setting up, but could save you lots of time if you repeat the process daily).

So, that's the third technique: a continuous process that just listens for new A features to compare against B.

There are some other things like this, but they're fairly specialized; for example if you are doing something with a DEM, the SurfaceModeller can save its internal workings as a file for re-use at a later point. And I think there are some web formats/transformers that will cache data locally, so you don't have to keep rereading it from a remote source again and again. But I don't know if they will help you here.

Anyway, I hope something here helps. If you can give us a few more details about what you are doing exactly, then it might be that there's something we can do to help. Other than that, as others have said, use a good, fast format. I seem to recall trying out different formats at one time, and MicroStation v8 was the fastest, but whether that's still true, and whether I included a comparison with FFS, I don't know.

If I have any more bright (or even not-so-bright) ideas, I'll let you know,

Regards

Mark

PS: Thanks for letting me know about this one Bruce. Not quite a blog, but getting on for blog length!

Userlevel 4
Badge +25
This is a good question, by the way. The sort that I would think other folk here should vote up as being interesting (hint, hint). I see lots of votes for answers, but few for questions, which is a shame because I think it encourages quality questions.

 

 

Userlevel 2
Badge +12
This is a good question, by the way. The sort that I would think other folk here should vote up as being interesting (hint, hint). I see lots of votes for answers, but few for questions, which is a shame because I think it encourages quality questions.

 

 

Okay Mark, I got the hint and followed your advise.

 

 

Badge +1

Well... I think it partly depends on what the workspace is doing. You say it "matches variables in A to another static file, file B" - does that mean a join on an attribute key? Maybe with the FeatureMerger? And how many features are there? Are there fewer features in A than B? eg 1,000 features in A are matched against 1,000,000 in B?

If so you could replace the FeatureMerger with a Joiner transformer. You'd have to put B into a database format of some sort, even just temporarily, but then you wouldn't have to read all 1,000,000 features from B just to match 1,000 of them.

Similarly you could use a FeatureReader or SQLExecutor or any other transformer that could retrieve selected features from dataset B. So that's one technique: reading B the same number of times but reducing the amount of data you need to read from it.

That's the first law of performance, btw. Performance is defined as "useful work" carried out, so if you're reading data that isn't used, it's not useful work!

Anyway, to read B only once, then I think you would need to just have one run of the workspace, and load all of the data from all of the A files at one time. Basically one big process. How many features are in each A file? Are they non-spatial or very complex geometry? That will be the big issue.

I'm assuming you have your current setup - to read each A file separately - because combined there is too much data. But if not, and you are only doing it as a quick way to batch process data, there are other solutions. Considering reading all of the A files at once, then using a Dataset Fanout on the output to split them back out again. Group-by parameters in transformers can help keep each A file separate if necessary.

So that's a second technique: reading all of the A files at once, so you only need to read B once.

Finally, and this is where my imagination starts to exceed my knowledge, I wonder if you can set up a continuously running workspace. That workspace reads file B and then receives features from all of the file As one at a time. It keeps running as long as you are passing file A features to it.

We do have transformers like that - TCPIPReceiver, SQSReceiver, etc. I'm partly unsure because I don't know how a workspace like that deals with a reader. I assume it reads the data once and then - if you have a group-based transformer - holds that data for processing against any incoming feature, but I don't know for sure. Also, of course, I don't really know what action you are carrying out or how many features you have, or how often you run this process (cause this would take a lot of setting up, but could save you lots of time if you repeat the process daily).

So, that's the third technique: a continuous process that just listens for new A features to compare against B.

There are some other things like this, but they're fairly specialized; for example if you are doing something with a DEM, the SurfaceModeller can save its internal workings as a file for re-use at a later point. And I think there are some web formats/transformers that will cache data locally, so you don't have to keep rereading it from a remote source again and again. But I don't know if they will help you here.

Anyway, I hope something here helps. If you can give us a few more details about what you are doing exactly, then it might be that there's something we can do to help. Other than that, as others have said, use a good, fast format. I seem to recall trying out different formats at one time, and MicroStation v8 was the fastest, but whether that's still true, and whether I included a comparison with FFS, I don't know.

If I have any more bright (or even not-so-bright) ideas, I'll let you know,

Regards

Mark

PS: Thanks for letting me know about this one Bruce. Not quite a blog, but getting on for blog length!

Thanks for taking the time out to write such a detailed response @Mark2AtSafe!

 

I took your first two paragraphs (as well as the comments of the others) on board, you guessed the parameters and scale of my problem correctly. each file A is anything from 10 to 100 records and file B is close to 500k records.

 

I created a simplified CSV version of file B and used the joiner and had much better results, it's zipping along quite fast now.

 

 

I will definitely explore the use of a database if this becomes a regular task, right now I'm happy to let my 18,000 files take 24 hours to run. this is a once a year task at the moment.

 

 

Reading all file As at once was my initial plan, but as they are XML and contain many nodes, that I want to extract it gets a bit tricky and was finding the system just couldn't handle my ambition, although I put this down to the system I have FME on being under resourced. This is how I got to my current workspace runner setup!

 

 

Thanks once again!

 

Omar
Badge
Indeed a very good question. Upvote!

 

 

Badge

how are you performing the match? If you store this B file in a database and you can you a joiner to perform the match, the speed will probably increase a lot!

Badge

Could you read file B and create a ffs file? ffs is the internal file format for FME and is read faster than any other format.

Create a (temporary) workspace and read file B and write to the ffs Writer. Then use the ffs file in your process instead of file B.

Great suggestion indeed!

 

 

Reply