A bit of a subjective question, but what are fast file-based geospatial data formats to read in FME?
I have a set of geodatabases (10GB) covering the whole country, and it’s located locally on the C-drive of the FME Flow server. Users specify a bounding box to extract the data for their area of interest.
To make the current setup faster, I have played with indexes and made sure that I’m doing a bounding box intersect on the FeatureReader, etc. But it still takes at least 10-15 minutes to get the output.
One other idea that I’m playing with is to use a different format for the source data instead of geodatabase. GeoPackage got a mention somewhere; FlatGeoBuf apparently is fast; and I have heard good stories about GeoParquet. Any other formats I should consider?
Has anyone done any comparison on this? Any pitfalls / drawbacks I should be aware of?
At the moment I’m looking more at file-based formats, as formats that require additional (database) software would involve IT too much.
I will do testing myself as well, and post some findings here later.
Hi
Its a great question. Particularly I like to use Database (Oracle, PostGis).
IMO, you want speed, efficiency and flexibility from a file based storage: Use Spatialite
Spatialite:
- Natively compresses data and automatically varies field width storage space, no matter what the Table Field width definition is. It will consistently be disk space effieicnt, and in testing has been nearly as good a Parquet with max compression setting.
- It uses SQL in the backend to read the data. There are LOTs of ways to parameterise the queries to it. Way more than ESRI has provided with their FGDB interface. Can even use it in SQLExecutor as opposed to a Reader.
- It can natively be used within QGIS, ArcGIS (although ArcMap before 10.6 does not use the Spatial Index properly when layer rendering…..haven’t tried in Eg. ArcGIS Pro)
- Is portable. All data in one file.
SQL Functions can use/be parameterised within SQLExecutor
https://www.gaia-gis.it/gaia-sins/spatialite-sql-5.1.0.html
Surprisingly also, it appears from testing, Safe has enabled searching the spatial index in Spatialite…..this is not a straightforward query to build, and if you want to do this manually in a SQLExecutor then see Alessandros’ guide (Spatialite’s lead developer) here in using the Virtual Spatial Index inside a SQL statement:
https://www.gaia-gis.it/fossil/libspatialite/wiki?name=SpatialIndex
Below I’ve set up a monster Cadastral dataset, with a 9,600,000 polygon feature table with 38 fields all populated with data and written into both a Spatialite test file and a test FGDB to compare.
Given a FeatureReader a 1 km x 1 km box to search for Intersecting Polygons
This returns 27,500 intersecting polygons from the search of the 9,600,000 features.
- Spatialite FeatureReader = 2 seconds (!)
- FileGeoDB Open API FeatureReader = 403 seconds (to be honest, it appears the API didn’t use any spatial index at all)
- FileGeoDB Proprietary (ie. Need ArcGIS installed/licenced on FME system) = 5 seconds
Expanding to a 10 km x 10 km box get 77,200 intersecting features
- Spatialite FeatureReader = 4 seconds
- FileGeoDB Open API FeatureReader = 455 seconds (Confirms API through FeatureReader does not appear to use any spatial index)
- FileGeoDB Proprietary FeatureReader (ie. Need ArcGIS installed/licenced on FME system) = 7 seconds
Other formats more in progress to watch for the future will be DuckDB (haven’t tested if FME have enabled spatial index search), which may eventually pass Spatialite’s capabilities...it doesn’t quite yet have as flexible spatial API. GeoParquet from what can see developers have not yet decided if/when to implement a Spatial Index and how to include that in the standard for writing and accessing and would suspect be behind Spatialite and DuckDB for the near-future.
Beyond just its spatial capabilities below, it has proven to be hugely scalable and performant from its SQLite base. I have separately built a 1 TB SQLite database time series event logger with 100s of millions of timestamped events in it. If indexed properly in the way SQLite optimises its query searches (ie. Build proper Multi-field indices and give hints to the query on how to use it), it finds and returns rows within the date/time WHERE clauses within seconds. Although it does take quite a bit of experimentation/learning how SQLite works to get to this scale of query performance.
Expanding to a 10 km x 10 km box get 77,200 intersecting features
- Spatialite FeatureReader = 4 seconds
- FileGeoDB Open API FeatureReader = 455 seconds (Confirms API through FeatureReader does not appear to use any spatial index)
- FileGeoDB Proprietary FeatureReader (ie. Need ArcGIS installed/licenced on FME system) = 7 seconds.
Sounds like this could be my issue! We can’t use the Esri GDB Reader on our Flow server, because Esri wants another ‘million bucks’ for their licence, so we’re using the OpenAPI GDB Reader, which - like in your tests - takes several minutes to process rather than seconds.
Still curious to hear from people who have used GeoPackage, GeoParquet or FlatGeoBuf (or other?). My main criteria is the spatial filtering speed: getting a 10x10km bounding box extract from a collection of 20 to 50 feature classes.
Noting
Running the same 9,600,000 features from a GeoPackage database and querying with a FeatureReader 10 x 10 km box gave exactly the same execution times as a Spatialite FeatureReader: 4 seconds. To be expected given the fundamental SQL run in the background both FME is hitting a SQLite R*Tree spatial index in what would be expected to be in almost identical ways.
Also as expected, the data encoding is also virtually identical: The difference in file size between the Spatialite extended SQLite database, and the GeoPackage extended SQLite database was less than 0.1%
The only difference is in application intent: Spatialite was meant to be a flexible enterprise type database file that supports a comprehensive range of SQL Spatial functions that can be executed directly against its API and do this at database level rather than at client application level. Hence why FME also has the alternative of using any of its Spatial SQL functions within the WHERE Parameter or in a custom SELECT statement run through SQLExecutor.
GeoPackage is instead aimed primarily at being an application data store rather than an enterprise database. Due to its SQLite base, it does retain the ability to run SQL against it, but without the ability to use the full Spatialite Library against it…….although interestingly in FME there is partial support to do this, as an FME Spatialite Reader/SQLExecutor will read/select from a GeoPackage encoded SQLite database table.
I wish I could give more than one upvote to
I wish I could give more than one upvote to
This :)
As promised, I did some testing today. I took one of the geodatabases containing 23 different feature classes all covering the whole country. And I measured how long it took to extract a small bounding box (~5 x 5km).
Open API GDB Reader = 6 minutes
Esri GDB Reader = 1 minute
This really seems to indicate that the Open API GDB Reader doesn’t seem to take advantage of the spatial indexing.
But unfortunately, the Esri GDB Reader is not available to us due to licence/financial constraints. Hence this question for alternative formats
GeoParquet = 15 minutes. Size-wise this blew everything else out of the water, with a 75%(!) reduction. But performance-wise it was far too slow. I didn’t try it with DuckDB. But it looks like GeoParquet still has a bit to go before becoming a serious contender.
GeoPackage = 10 seconds(!) if reading all from one .gpkg file; 15 seconds if reading 23 separate .gpkg files.
FlatGeoBuf = 15 seconds.
Both are lightning fast. Very impressive. File sizes were also in similar order of magnitude, so I could’ve used either. I decided to go for FlatGeoBuf, because GeoPackage converts all attribute names to lowercase, which may be an issue for the users. It’s not something I could change, so I assume it’s a spec of the .gpkg format.
I still need to implement the final solution, but something that took about 10 minutes, now will take less than 1 minute. Great!
And finally, a big shout out for
I decided to go for FlatGeoBuf, because GeoPackage converts all attribute names to lowercase, which may be an issue for the users. It’s not something I could change, so I assume it’s a spec of the .gpkg format.
Hmmm, I wonder if that is an FME version problem? SQLite does not have an field name case limitation, it’s pretty flexible, and neither does GeoPackage. The encoding standard shows all example tables with lower case, but there is no explicit requirement in the encoding standard to use lower case.
It is recommended to use lower case names and no spaces etc. , but that is not a GeoPackage constraint, but rather this gets recommended for maximum interoperability for applications consuming the data where some applications may not work properly given GeoPackage is aimed as being a kind of universal data packager/data interchange format.
https://www.geopackage.org/guidance/getting-started.html#:~:text=Creating%20a%20GeoPackage
Eg. Can write to GeoPackage with Upper Case + Spaces fine (and read it back into FME)
This includes Esri-specific attribute names, like OBJECTID, Enabled, Shape_Length, which I could see as being a potential issue if users want their cut-out in GDB or shapefile format.
We’re using FME 2024.1.1.0
Just to add my 10cents: afaik Spatialite cannot handle arcs (they get stroked) whereas Geopackage can handle arcs.
This includes Esri-specific attribute names, like OBJECTID, Enabled, Shape_Length, which I could see as being a potential issue if users want their cut-out in GDB or shapefile format.
Interesting, that almost sounds like a bit of a bug with FME2024 then? As above, there is no requirement in the Geopackage specification that I can see for changing the case, removing spaces etc. and the format technology supports it.
It sounds like a misinterpretation of the specification by Safe. The spec says “SHOULD” but this is not “MUST”. If going to do this in the Writer, then this should be a Parameter option set by user rather than enforce it. As you say, one of the main interchange / interops Geopackage will need to work with is ESRI Geodatabases that don’t enforce this as mandatory either, so why it shouldn’t be enforced on Geopackage writing by FME.