Skip to main content

Note:
The suggested idea Is created as a poc and are used with combinations of fme transformers and it works fast.   To summarize
1. Create temp table
2. Bulk insert into temp table
3. SQL queries between temp table and other tables within the server.
4. remove temp table

 



Problem Statement:

When FME handles large datasets and performs joins with features from FME flow with a databaes table, particularly with databases containing many attributes, the process can become time-consuming. This is due to the row-by-row attribute comparison FME typically employs, and the potential communication overhead when interacting with the database.

Current State:

FME handles joins within its own environment. While functional, this approach can lead to performance bottlenecks when dealing with substantial data volumes and complex join conditions, especially when interacting with external databases. This one by one comparision applies for both the existing databas joiner and sql executor ( SQL executor is fast if all the data is within the sql server, not if you want to join data from the fme flow with sql, then its one by one)

Proposed Feature:

I propose the introduction of a new transformer, proposed name: the " Large Dataset Join ", which would seamlessly integrate with various database systems. This transformer would leverage the database's native capabilities for handling joins, leading to significant performance.

Here's how it would work:

  1. Bulk Insert into Temporary Table: The transformer would efficiently transfer large datasets from FME to a temporary table within the target database using bulk insert operations. This feature already exist in FME and can be re-used.
  2. Database-Side Join: The join operation itself would be executed directly within the database, utilizing the temporary table and the target table. Users would have the flexibility to specify join conditions using either SQL queries or a visual interface.
  3. Retrieve Joined Data: Only the necessary joined data would be retrieved back into FME, minimizing data transfer overhead.
  4. Automatic Temporary Table Management: The transformer would handle the creation and cleanup of the temporary table, streamlining the process for the user.

Key Benefits:

  • Significant performance improvement for large dataset joins, particularly with databases containing many attributes.
  • Reduced processing burden on FME, freeing up resources for other tasks.
  • Streamlined workflow for leveraging the database's optimized join capabilities.

 

Interesting idea, but conceptually it sounds a lot like what the InlineQuerier does. Did you run a benchmark against that?


@redgeographics  thanks for your answer.

Well conceptually yes. Inlinequirer does not read any databases though .But it exists due to speed of course. So my suggestion/idea is a hybrid between inlinequirer and sql executor I would say. 


For us we boosted the time really much. We are using alot of data. I would say from 10 minutes to 2-3 minutes.  It was a major improvement 


@ronnie.utter if you have a routine that works for you, why does it need to be implemented as a transformer?

How I would do it: get the keys from features in the workspace, and assemble them in batches of 1000, concatenated. The query the database to get the joinees using SqlExecutor (in batches, instead of one by one). The use a FeatureJoiner to join joiner/joinee.

An argument to do the joining in the workspace instead of your proposed method: catch errors, mismatched joined etc


@nordpil 

 

 

Thank you for your suggestion! I think your approach with batching keys and using SqlExecutor and FeatureJoiner is a solid method for improving performance without needing new transformers. It offers flexibility for handling errors and mismatches directly within FME, which is valuable for users needing that level of control.

 

However, while I have a working solution myself, creating a custom transformer would make it easier for others to adopt this method without needing to manually set up batch processing, SQL queries, and joins. A dedicated transformer would streamline the process, making it more user-friendly and accessible to a broader audience, especially those who may not be as familiar with the technical aspects with queries. 

 

In short, both approaches have their merits—yours offers flexibility, while mine prioritizes performance and ease of use. Perhaps a balance between the two could be ideal!