Skip to main content
Open

Support for Apache Arrow for Python Caller

Related products:Transformers

raean
Contributor
  • Contributor

Support for apache arrow in memory data exchange format as input to Pythn script caller and allow an apache arrow object to be returned by the python caller as well would be great. There would be a number of advantages to this.

  1. Remove the need to install fme objects into my python environment.
  2. Make the python interface cleaner and easier to use. Just convert the apacje arrow object into your data frame, Reducing the barrier to using python in workflows.
  3. Scripts could more easily be developed and tested in IDEs like vscode, and pycharm making the development of scripts far nicer.
  4. Supported by many libraries. e.g. tensorflow, pandas, polars, duckdb, pyspark etc.
  5. Would make it easier to embed machine learning models into workflows.
  6. Potentially opens up the integration of other languages into FME due to the arrow format being language an gnostic. 

4 replies

vlroyrenn
Enthusiast
Forum|alt.badge.img+13
  • Enthusiast
  • June 11, 2024

Related Idea:

There are many ways FME could go about this, and it’s unclear to me at this time what can be done without breaking compatibility with existing flows. Another alternative to passing Apache Arrow file handles/paths would be for FME to create some hypothetical FMEPythonFeatureTable object that’s compatible with the Dataframe Interchange Protocol, and pass that instead. This protocol defines a Python-level interface to a dataframe object that is deliberately very similar to Apache Arrow, but without actually depending on the library or any other specific one.


raean
Contributor
Forum|alt.badge.img+1
  • Author
  • Contributor
  • June 12, 2024

I think changing the current python caller would not be good. I did a bit more thinking about this and you could have something like the InLineQuerier where maybe a new transformer the PythonInLineQuerier where you generate an apache arrow interface. You then select your virtual env. You could maybe have an advanced option to provide a conda-lock file that could build the environment if it did not exist useful for sharing and deployment. Possible uses would be that I had developed a machine learning algorithm that predicted missing values in a table. I could feed the table into the model and get the results. I know I could do other ways. Like write the data out and then use a system caller to a bat or power shell script and run it that way or use a workspace runner with a startup script, but this is clunky. It could possibly be language agnostic.


vlroyrenn
Enthusiast
Forum|alt.badge.img+13
  • Enthusiast
  • June 12, 2024
raean wrote:

I think changing the current python caller would not be good. I did a bit more thinking about this and you could have something like the InLineQuerier where maybe a new transformer the PythonInLineQuerier where you generate an apache arrow interface. You then select your virtual env. You could maybe have an advanced option to provide a conda-lock file that could build the environment if it did not exist useful for sharing and deployment. Possible uses would be that I had developed a machine learning algorithm that predicted missing values in a table. I could feed the table into the model and get the results. I know I could do other ways. Like write the data out and then use a system caller to a bat or power shell script and run it that way or use a workspace runner with a startup script, but this is clunky. It could possibly be language agnostic.

Python execution environment would be mostly orthogonal to how feature input and output is handled. Virtual environment support has been requested for a while, but no news, still:

hkingsbury

Until then, you’ll have to deal with having a single Python environment (per user) for FME Form and a separate one for FME Flow (remember to keep them in sync, FME doesn’t keep track of execution dependencies) that you manually install Python libs into. For local development, the best you can do is use fme-packager to connect your venv with your local FME import path.

The Apache arrow advantages I have in mind are more performance-oriented and focused on efficient  and practical dataframe processing, because correctly loading data into a dataframe is a very error-prone and convoluted process for something that’s probably not an uncommon use-case.


LizAtSafe
Safer
Forum|alt.badge.img+15
  • Safer
  • August 9, 2024
NewOpen

Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings