Skip to main content

Hello,

I wonder if anyone has experience in both FME and Pentaho. I have several years experience in FME, but know nothing about Pentaho.

Currently our team of Pentaho users considers the switch to FME. I have shown the possibilities and when it comes to functionality, FME seems very promising! However, Pentaho seems to be 6 times faster than FME... This might be a showstopper and we want to know whats is causing this. Does anyone have experience to get better performance writing to (or reading from) Oracle datasets? I never had the experience that FME is slow in reading/writing to Oracle, but when I see the speed of Pentaho, I was surprised of its speed.

We did a test of a ETL-script that reads about 912.000 records from an Oracle Database and writes the result directly to another Oracle table. In short: SQLCreator (type Oracle) --> FeatureWriter (type Oracle). Duration in FME: 5 - 6 minutes; duration in Pentaho is about 45 seconds. We were reading and writing to the very same databases, using the very same SQL-statement. Pentaho en FME were both running on comparable hardware (VM's), same OS, same CPU's, same amount of RAM, disk space (and speed), same network, etc.

After this first test I have done several tests reading from and writing to different FileFormats (JDBC (Oracle), ESRI FileGeoDatabase, FFS) and checking various types of settings in FME Desktop. It didn't surprise me that the fastest process was the FFS --> FFS translation, timed at 2 minutes 0 seconds. The translation from Oracle Non Spatial to Oracle Non Spatial took 5 minutes 16 seconds run on FME Desktop, and about 4 minutes and 9 seconds on FME Server. Performance via JDBC dropped to 13 - 14 minutes.

Is FME simply 'slow' compared to Pentaho - which I cannot imagine - or am I sim[ply missing some important 'tuning' settings for Oracle. Can anybody help me?

Thanks in advance,

Frank van Doorne

Hoi @frankvandoorne,

Don't know much about Pentaho, but I do (a little) about FME...a possible explanation is that FME consumes all the features into memory and it could be that Pentaho does not do that.

Hopefully somebody with more knowledge of both applications can chime in..

Itay


I am very interested to see how others have tuned FME for greater performance as well. I have found it to be slow when reading the entire database, in my case reading from a ESRI geodatabase to a PostGIS database. But if I use the FeatureReader transformer with a spatial filter rather than the direct reader it seems to work so much quicker.


Hi @frankvandoorne, thanks for sharing.

What version of Oracle and FME are we talking about?

It seems like you have done the FME justice by considering the usual suspect factors (disk/ram/cpu/os/hardware/network). Not much I can add there. Seems like none of these are the cause.

I am wondering how Pentaho works with the data. Is it actually reading the data back to its self (to the local machine it resides on) or is it possibly extracting the data directly from the database to the other database, in other words, using the database resources. As an example, if I used FME to read the 912,000 records and write it to a CSV file, this would take more time than asking Oracle to output the data to a csv file, and then having the other database suck in this CSV file. Infact, in working with point cloud data, using CSV files, you can load GB's data in no time.

SO, in this case I am also wondering, is Pentaho using a database link between the two Oracle databases? If I were to ask FME to send an SQL to the source database, that utilizes a database link to the other system, it would also be a lot faster.

Other things I might consider is the block read size on the Reader (try tweaking this) and the Writer settings for Bulk Write Size & Features Per Transaction. These are both found in the Advanced section, respectively. Try separating the Read and Write's to maximize the performance.

Perhaps, Pentaho has been able to find the sweet spot for these databases... as they are the incumbent data integration tool or they may not be reading the data in the same way FME is (full retrieval, full transfer).

You might want to see if you can approach the Oracle DBA to monitor/trace the sessions of both tools... this would certainly shed light on what SQL is being sent to the database.

I am very curious to hear more about what you learn.

Cheers,

 

Steve

I am very interested to see how others have tuned FME for greater performance as well. I have found it to be slow when reading the entire database, in my case reading from a ESRI geodatabase to a PostGIS database. But if I use the FeatureReader transformer with a spatial filter rather than the direct reader it seems to work so much quicker.

Hi @deanhowell2009, Reading from Esri Geodatabase? Is this using the Geodatabase Reader? If so, yes, we go through ArcObjects and it does seem to be slow when compared to other formats and the same data volume.

In years gone past, the ArcSDE30 Reader & Writer were always much more performant but it is not supported any longer. I don't know what happens with the Geodatabase Reader/Writer but there seems to be a lot more stuff happening at the database/client then reading and writing when using those formats.

Could you expand a little on your use of the spatial filter with the FeatureReader? Are you passing several polygons and the FeatureReader returns the data within or are you just passing one large polygon that returns all the records?


Hi @deanhowell2009, Reading from Esri Geodatabase? Is this using the Geodatabase Reader? If so, yes, we go through ArcObjects and it does seem to be slow when compared to other formats and the same data volume.

In years gone past, the ArcSDE30 Reader & Writer were always much more performant but it is not supported any longer. I don't know what happens with the Geodatabase Reader/Writer but there seems to be a lot more stuff happening at the database/client then reading and writing when using those formats.

Could you expand a little on your use of the spatial filter with the FeatureReader? Are you passing several polygons and the FeatureReader returns the data within or are you just passing one large polygon that returns all the records?

Thanks for your reply @steveatsafe, in this instance I am using features to refine the data searched and returned. This approach rather than loading in the whole dataset and then searching really speeds up the whole process.


It seems that Pentaho has a bulk write mode: https://wiki.pentaho.com/display/EAI/Oracle+Bulk+Loader I am guessing that this makes it fast for Oracle to Load. I guess if you can find this file and create a similar file with FME and use the Oracle SQL Loader the performance might be better. https://www.safe.com/integrate/oracle-sql-loader/ @frankvandoorne

This is similar as mentioned by others (CSV-approach) to load data fast into a database. Postgres + CSV is equally faster than using Postgres writer. But you do create the files and start the loading with FME!


Hoi @frankvandoorne,

Don't know much about Pentaho, but I do (a little) about FME...a possible explanation is that FME consumes all the features into memory and it could be that Pentaho does not do that.

Hopefully somebody with more knowledge of both applications can chime in..

Itay

Hello @itay,

I thought about that explanation; it seems the most probable reason. Unfortunately, our organization lacks under-the-hood-technical-knowledge of Pentaho.

I'll do some more research, based on reactions of @steveatsafe and @sigtill.

Thanks, Frank


Hi @frankvandoorne, thanks for sharing.

What version of Oracle and FME are we talking about?

It seems like you have done the FME justice by considering the usual suspect factors (disk/ram/cpu/os/hardware/network). Not much I can add there. Seems like none of these are the cause.

I am wondering how Pentaho works with the data. Is it actually reading the data back to its self (to the local machine it resides on) or is it possibly extracting the data directly from the database to the other database, in other words, using the database resources. As an example, if I used FME to read the 912,000 records and write it to a CSV file, this would take more time than asking Oracle to output the data to a csv file, and then having the other database suck in this CSV file. Infact, in working with point cloud data, using CSV files, you can load GB's data in no time.

SO, in this case I am also wondering, is Pentaho using a database link between the two Oracle databases? If I were to ask FME to send an SQL to the source database, that utilizes a database link to the other system, it would also be a lot faster.

Other things I might consider is the block read size on the Reader (try tweaking this) and the Writer settings for Bulk Write Size & Features Per Transaction. These are both found in the Advanced section, respectively. Try separating the Read and Write's to maximize the performance.

Perhaps, Pentaho has been able to find the sweet spot for these databases... as they are the incumbent data integration tool or they may not be reading the data in the same way FME is (full retrieval, full transfer).

You might want to see if you can approach the Oracle DBA to monitor/trace the sessions of both tools... this would certainly shed light on what SQL is being sent to the database.

I am very curious to hear more about what you learn.

Cheers,

 

Steve

Hi @steveatsafe,

Thanks for your reaction. For what it is worth, I used the last build of FME 2018 Desktop and an Oracle 12 database. The exact question is indeed 'how Pentaho works with the data'. I'll follow your advice to approach our DBA!!! Why didn't I think of it myself 😉.

I'll let you know, when I have learned something more.

Frank


It seems that Pentaho has a bulk write mode: https://wiki.pentaho.com/display/EAI/Oracle+Bulk+Loader I am guessing that this makes it fast for Oracle to Load. I guess if you can find this file and create a similar file with FME and use the Oracle SQL Loader the performance might be better. https://www.safe.com/integrate/oracle-sql-loader/ @frankvandoorne

This is similar as mentioned by others (CSV-approach) to load data fast into a database. Postgres + CSV is equally faster than using Postgres writer. But you do create the files and start the loading with FME!

This seems to be very useful information! Thanks a lot.


I am very interested to see how others have tuned FME for greater performance as well. I have found it to be slow when reading the entire database, in my case reading from a ESRI geodatabase to a PostGIS database. But if I use the FeatureReader transformer with a spatial filter rather than the direct reader it seems to work so much quicker.

@deanhowell2009, You wrote that you used a spatial filter with the FeatureReader. Did you also used a spatial filter as well with the direct Reader? Or, in other words, were the same amount of records retrieved? Using a spatial filter dramatically speeds retrieval.


@deanhowell2009, You wrote that you used a spatial filter with the FeatureReader. Did you also used a spatial filter as well with the direct Reader? Or, in other words, were the same amount of records retrieved? Using a spatial filter dramatically speeds retrieval.

Hello @frankvandoorne, when i started the model i was reading in three main datasets, one had close to 1 million records, another 600,000 and the final one close to 300000. Reading directly was taking minutes as it processed each record but changing to a spatial filter took seconds to achieve the same result.


Hello @itay,

I thought about that explanation; it seems the most probable reason. Unfortunately, our organization lacks under-the-hood-technical-knowledge of Pentaho.

I'll do some more research, based on reactions of @steveatsafe and @sigtill.

Thanks, Frank

Hoi @frankvandoorne,

The suggestion of @sigtill is very interesting (mind you the SQL Loader only works with non-spatial data) and @steveatsafe has some great pointers.

I'll keep an eye on this topic since I am very interested to know what you find.

Groet,

Itay

 


Hello @frankvandoorne, when i started the model i was reading in three main datasets, one had close to 1 million records, another 600,000 and the final one close to 300000. Reading directly was taking minutes as it processed each record but changing to a spatial filter took seconds to achieve the same result.

Hello @deanhowell2009, that seems logical to me. I think you retrieve records in two different ways. When you use a Normal Reader WITHOUT any spatial filtering method and compare that with a FeatureReader WITH a spatial filter, the latter will be much faster. You can - if supported by the specific Reader - apply a spatial filter on a normal Reader by setting the 'bounding box coordinates'. But the FeatureReader is better, since you can use any geometry as a sptial filter, not only a bounding box.


Hello,

I thought it would be polite to inform you about the following... a sort of closing this case...

After some more investigation, I am convinced that FME could do the job for our BI-department. Using FME however would require a fundamental different way of thinking and processing data. FME or Pentaho is 'just two different worlds'. A different architecture requires a different approach from the beginning.

It all comes to 'parallel processing' and dividing workload. No big deal of course. For various reasons (legal, corporate IT-policy), we decided to use another ETL-tool for BI that we already have. So I am NOT going to investigate more time comparing "theoretical performance issues". FME suits our GIS demands very well! ;-)

Greetz,

Frank


Even after 3 years this thread seems relevant to me. I have done tons of Pentaho integrations myself, and recently in my current role, I am implementing FME Server ( I don't know anything about FME Desktop). I have implemented FME server in my org using docker on AWS. I have a very simple use case to read data from denodo and load it to the oracle table, I was going to do that in Pentaho and thought wait why not try my hands on FME. With help of my SME on FME, I build a workspace that reads data from denodo (nothing fancy) 1000 rows (features as FME calls it) with 10 columns. Loading this data in oracle takes painful 12 mins in FME. I built the same flow in Pentaho and it was done in a few seconds.

 

2022-04-01_14-48-052022-04-01_14-46-34Now I am not sure if I am missing something basic with the FME workbench, but I did try to play with bulk write size and features per transaction without any success. And my data has nothing to do with spatial, it is plain old int, date, and string data.

 

I hope I am missing something here because if we have such a huge difference in performance we have to pay closer looks to alternatives.


Based on your screenshots, I understand you are useing FME Desktop (not Server) and Pentaho.

It all comes to where the data is located and where the processes are running; they should be close of each other.

FME: CPU was only taking 3.5 sec. and user 47 sec, so the rest of the time was consumed 'outside' FME. Transferring data over network or Internet, waiting for database. When I work in our local network, loading 1000 rows to Oracle will cost will cost me a few seconds in FME too.


Based on your screenshots, I understand you are useing FME Desktop (not Server) and Pentaho.

It all comes to where the data is located and where the processes are running; they should be close of each other.

FME: CPU was only taking 3.5 sec. and user 47 sec, so the rest of the time was consumed 'outside' FME. Transferring data over network or Internet, waiting for database. When I work in our local network, loading 1000 rows to Oracle will cost will cost me a few seconds in FME too.

agree, in my test case both tests were performed under similar conditions same source, same destination, and on the same host. FME took 12 minutes because it was processing 10 rows at a time when loading data in oracle.


agree, in my test case both tests were performed under similar conditions same source, same destination, and on the same host. FME took 12 minutes because it was processing 10 rows at a time when loading data in oracle.

Strange. Not having the slightest knowledge of Pentaho, I could have a lot of questions:

  • Did you actually installed FME Desktop on the same machine as Pentaho?
  • Does the Oracle database have a lot of (business) logic, check-constraints, etc. that takes a lot of time?
  • May be Pentaho disables indexes automatically and rebuilds them without any user/developers notice?
  • In FME you can execute SQL-statements before / after Writer to remove/rebuild indexes of disable/enable constraints

May be Oracle Client issues?

 

 

 


Reply