We're currently integrating FME with Databricks. In a vanilla Databricks environment, everything works as expected—both reading from and writing to Unity Catalog tables using the Databricks Reader and Writer in FME.
However, in our organizational Databricks environment, we're encountering an issue:
Reading works fine using the same credentials (tested with both OAuth and PAT).
Writing, however, gets stuck in an infinite polling loop in FME, with no error message returned.
The same credentials and target table work fine for writing when using a Python script in the same environment, so it seems to be specific to how FME writes data.
Given that writing works in the vanilla setup but not in our organizational environment, I suspect the issue is related to permissions or Unity Catalog access controls that are more tightly restricted in the corporate setup.
My questions:
What exact permissions are required in Databricks for FME's Databricks Writer to write to Unity Catalog tables?
What method does FME use under the hood to perform write operations to Databricks/Unity Catalog
Environment details:
FME version: 2024.2.2
Platform: Azure Databricks
Authentication: Both OAuth and PAT tested
Target: Unity Catalog tables
Thanks for any guidance or insight you can provide!
Staging Upload Credentials When writing features to a table in Databricks, the writer requires a staging location. For the Databricks writer, there is the option to use a Unity Catalog Volume, Amazon S3 or Azure Data Lake Storage Gen2 (ADLS).
If you decide to use Amazon S3 or Azure Data Lake Storage Gen2, your Databricks workspace MUST have access to the cloud storage location you use.
There appears to be grant by UI or SQL statements which I’m sure you could build into FME via SQLexecutor/creator
I guess I found the reason why it didn't work. With 1’394 tables and 33,000 columns, the catalog was too large for the FME Databrick Writer. Because before the actual writing, the DatabrickWriter reads all columns and times out doing this query:
DEBUG: http.client send: b'{"language": "sql", "clusterId": "", "contextId": "", "command": "SELECT\\n ST.schema_name, ST.table_name, ST.table_type, ST.is_insertable_into, ST.data_source_format,\\n C.column_name, C.is_nullable, C.full_data_type, C.ordinal_position, C.is_updatable\\nFROM\\n (SELECT S.schema_name, T.* FROM\\n (`dev_sandbox`.information_schema.schemata as S)\\n LEFT JOIN\\n (SELECT table_schema, table_name, table_type, is_insertable_into, data_source_format FROM `dev_sandbox`.information_schema.tables) AS T\\n ON S.schema_name = T.table_schema WHERE S.schema_name IN (\\"sandbox\\")\\n ) AS ST\\nLEFT JOIN\\n (SELECT table_schema, table_name, column_name, is_nullable, full_data_type, ordinal_position, is_updatable FROM `dev_sandbox`.information_schema.columns) AS C\\nON ST.schema_name = C.table_schema AND ST.table_name = C.table_name AND ST.data_source_format = \'DELTA\'\\nORDER BY schema_name, table_name, o
It works with a smaller catalog (10 tables). Now: What are the possibilities for making it work in the "large" catalog?