Skip to main content

I am researching options for a GIS data catalog and data lineage system. There are some GIS data catalog tools out there (Ex. Voyager) but they lack a method to store the ETL lineage information. For example, I'd want to map that the Streets feature class in my Oracle geodatabase is sync'd to an AGOL feature layer named "City Streets" every night at 3AM.

 

Some of the data lineage tools on the market can connect to ETL applications (Informatica, Azure Data Factory, etc.) and scan them to get source/target information from all the tasks configured in the system. I haven't seen Safe/FME show up as being supported by any of these products. So I'm curious if anyone's come across a product that can scan for this stuff. Alternatively there's always manually populating these systems by reviewing all the FME Server job logs and investigating source/targets in the workspaces.

Hi, A bit random messaging a year later, but I was wondering if you got any further on scanning FME Data Lineage, other than via manual population.

 

I have the same use case requirement and struggling to find any answers beyond manual.

 

Did you end up building a manual population process?


Hello! Also hoping anyone has anything on this. I'm working on a solution that reads FMW files to determine data sources and create lineage to send to the Apache Atlas API for our data catalog. However this requires custom development, and all organizational FMW's to be stored in an accessible place. If there is anything from SAFE on a way to catalog all workbenches that would be a huge help.


I never made any progress on this and it's fallen to the wayside as a priority. Never heard anything from Safe, but I haven't specifically asked our reps yet. Haven't heard that there is a lineage/catalog tool that comes preconfigured to scan something like an FME server and harvest lineage info.

 

Interesting that you're reading FMW files. For us one issue is we have some workspaces in our FME server that are dynamic. For those the different scheduled runs of the same workspace will move different datasets. So at 8:01 am, Workspace 1 reads dataset A and writes to dataset B. Then at 8:02am, Workspace 1 reads dataset C and writes to dataset D. Therefore the nuance of sources and targets is in FME server's scheduling system.


I've raised a support ticket with Safe to see if this is something they've looked into before, couldn't find anything in other documentation on this portal or online in general, I'll let you know when they respond!

 

The dynamic workspace problem isn't something I considered, and I wonder (fear) that's something I'll come across here as well. My first thought is to create a log file from within the workspace, so whenever it runs it records the start time, each dataset it reads from, then each dataset it writes to, and the end time. Then creating a script that reads that log file each day to see what datasets are connected where, and enter that information into the data catalog via API. Would require a catalog that can have manual entries, which is something that Apache Atlas and Microsoft Purview can do from my research, but not sure how that would fit into any other data catalog system.


Hi, Great to see this issue is coming up a bit more in the community.

I have had a chat with Safe Technical staff while running a trial of FME Form. My own finding was that the Workspace Reader is the most likely candidate for extracting the structure of Workspaces etc, essentially so you can build a CSV/JSON file of the structure you need to represent in our Data Catalog product.

Safe Technical support stated that they also felt the Workspace Reader is probably the best fit for our needs, rather then the Flow REST APIs.

 

As mentioned by others, there are not many (if any?) applications that can scan FME. so we are currently faced with developing a custom solution. In the case of Informatica Cloud Data Governance and Catalog (CDGC), there is mechanism to build a custom scanner. Essentially, this means you need to define the metadata model representing the FME Strucuture (Workspace, Reader, Writer, Feature, Feature Types etc), develop another process in FME that can extract the Workspace components, then map that to the data model you built earlier. This can then be built as a scheduled process that re-populates the data model on a regular basis.


Hi, Great to see this issue is coming up a bit more in the community.

I have had a chat with Safe Technical staff while running a trial of FME Form. My own finding was that the Workspace Reader is the most likely candidate for extracting the structure of Workspaces etc, essentially so you can build a CSV/JSON file of the structure you need to represent in our Data Catalog product.

Safe Technical support stated that they also felt the Workspace Reader is probably the best fit for our needs, rather then the Flow REST APIs.

 

As mentioned by others, there are not many (if any?) applications that can scan FME. so we are currently faced with developing a custom solution. In the case of Informatica Cloud Data Governance and Catalog (CDGC), there is mechanism to build a custom scanner. Essentially, this means you need to define the metadata model representing the FME Strucuture (Workspace, Reader, Writer, Feature, Feature Types etc), develop another process in FME that can extract the Workspace components, then map that to the data model you built earlier. This can then be built as a scheduled process that re-populates the data model on a regular basis.

Thats good to know. Wasn't aware of that reader. Seems like it just reads .fmw files though and doesn't have an option to read workspaces from an FME Server endpoint. The ones our analysts have loaded into server are the only ones I'd consider enterprise/production and would want in a data catalog. Maybe there's a way to scan the backend of the server since I'm assuming the .fmw files behind the services are there. And then the FME server scheduler would be where the metadata for the data-sync frequency would be stored.


Reply