Open

Intoduce a Python Dataframe Creator/Transformer

Related products:Transformers

1 year ago
June 28, 2023
7 replies
169 views

vlroyrenn
Supporter
62 replies

FME's Python interface for feature attribute manipulation seems to be mainly oriented towards small datasets, as attributes are always only accessed field by field and feature by feature. Python has a very large ecosystem around data science and data processing of very large tables, so it's often practical to load data in a dataframe and run all your computations on the dataset as a whole, to benefit from various performance optimizations around vectorization, not to mention use of fammiliar tools for people in the data science field.

Right now, loading features into a dataframe (I'm talking hundreds of columns and hundreds of thousands of rows, millions of cells) is both slow and very error-prone for a few reasons:

Loss of schema information: getAttributeType() in the Python API can't return the full set of FME feature types, so dynamic workbenches can't properly handle things like dates unless a lot of work is done around schema detection and handling.
Difficult to parse timestamp format: Python's standard library and most dataframe libraries can only parse fractionnal seconds up to 6 decimals, while FME does up to 9 and DatetimeNow() notably provides 7 decimals. This is a difficult problem on its own, but time series are common in the data science field, so it's especially noticeable here.
Attribute access is per-row and field-by-field (slow): With no way to do bulk access on features, the data is accessed and converted value by value (save for lists, which can be accessed as a whole). If you have several million fields to read, that's a lot of slow python code running in very tight loops.
Null and missing values are distinct and need to be checked separately: getAttribute() only returns None on missing values. When encountering a null value, it returns an empty string, which can throw off dataframe libraries and need to be checked for each value suceptible of returning null, adding to the overhead.
Feature output is also per row and per feature: It's also very slow, and I belive it also blocks on waiting for downstream.
Using files to pass data between nodes is a lot to ask: You can use feature readers and writers to create temporary files so that Python could load and dump these features in one shot, but even if I could get it to work (I haven't), it wouldn't be worth the dozen of extra nodes needed to make this work and all the visual clutter that creates.

The easy way this could be resolved, in my eyes, is to simply provide a new Python transformer (the titular DataframeTransformer) that already does the conversion of features to and from dataframes and only provides the user's code with said dataframe.

There have been efforts lately to create a standard dataframe interchange protocol throughout the Python ecosystem to encourage interoperability across libraries and avoid locking-in users into using a specific library, which Pandas supports converting to and reading from. To avoid depending on Pandas and/or its APIs, fmeobjects could expose some hypothetical "FeatureDataframe" object and leave it up to the users to load it into their dataframe library of choice.

Related questions

david_r
8314 replies
1 year ago
June 29, 2023

Some very interesting ideas here!

+12

vlroyrenn
Author
Supporter
62 replies
1 year ago
June 29, 2023

I can't figure out how to edit this question, but I forgot to mention how an alternative to introducing a separate kind of transformer would be to add a support flag to indicate that this feature processor expects to get a dataframe, and either use a new input_dataframe(self, df) method, or change the input type of input() and expect users to not mix those up.

I can also foresee cases where a "Dataframe transformer" would recieve a small set of individual features, possibly in sequential mode (such as file names) and return dataframes either on input ("exploding" each input row) or on close (one arbitrarily-shaped dataframe "replacing" the input), so having mechanisms to output in bulk even when the input isn't bulked could be useful. A self.dfoutput() method would allow doing this on PayhonCaller transformers instead of adding multiple new ones for each case.

davidrasner5
Contributor
28 replies
1 year ago
November 10, 2023

I don't use much Python in FME and just realized this is not a thing. 😅

Other ETL tool i'm famliar with do something like "fme.output = fme.from_pandas(df)".

Would be indeed very welcome.

+12

vlroyrenn
Author
Supporter
62 replies
1 year ago
November 22, 2023

That's basically what the RCaller does (loading the rows in an SQLite database and then having R load that database as a dataframe), so yeah, it's weird that that's not a thing with the PythonCaller.

+12

vlroyrenn
Author
Supporter
62 replies
8 months ago
June 27, 2024

I recently published the TempParquetWriter transformer, which addresses most of the listed issues here (although not the first two, namely incomplete type recognition and lack of date parsing) and gets the Dataframe manipulation experience to a point I would consider comfortable. You can see for yourself in the transformer documentation.

It is my understanding based on discussions with some Safe staffers that proper built-in support for dataframes like this idea is requesting would require some pretty large changes to FME’s Python API, which aren’t currently planned. In the meantime, doing FME-Python interchange via Parquet (and GeoParquet and possibly DuckDB, for which FME recently added support) is likely to be the best available option.

antoine
Contributor
57 replies
7 months ago
August 6, 2024

Thanks for the ideas and the custom transformer! I think this has been asked for many years, one way or another.. About Pandas I am more skeptical but in general, pushing to some temp format (database, column based etc...) depending on your use case is the way atm. The key is then fast writing, which can be very hard for “dynamic” workspace allowing any schema input. Btw in your example you rename a transformer.. I would advise not to if ever you want your work to be used by others in the future ^^.

+12

vlroyrenn
Author
Supporter
62 replies
7 months ago
August 8, 2024

antoine wrote:

Pandas is actually pretty performant for big number crunching, given how it’s backed by Numpy arrays. I’ve personally been using Pola-rs for dataframe maths, though, which is also very performant and has what I consider a much more intuitive Python interface.

I’m probably going to need to update the transformer at some point, though, given how I’ve noticed that the home-brewed schema detection I made works well for flat features, but breaks down when encountering lists and may behave strangely when dealing with structs with optionnal fields. I’m probably going to have to use a SchemaScanner for this after all. My ideal solution would be what I’m suggesting in the Idea below, but I’m not holding my breath.

Introduce a manual equivalent to SchemaScanner (SchemaCreator) Open
3 Votes

I’ll also have to look into geometry support when I get arround upgrading to FME 2024.1.

@antoine: You’re right regarding renaming “PythonCaller” into “ReadFileAndCrunchData”. While it may make my own workspaces more readable, it was probably a bad call for a usage example screenshot.

Reply

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos

Intoduce a Python Dataframe Creator/Transformer