Skip to main content

We have the following stack:

  • FME reads XML, inserts in and deletes from SDE.
  • Data in non versioned SDE on Azure SQL Server.
  • ArcGIS Enterprise serving MapService based on SDE.
  • Interal WebGIS using MapService.

Most of this works fine but when we clean up large datasets it takes a long time to complete. What I know is mostly self-taught so I wonder if I missed something vital.

 

The task is to delete records from 20ish quite big featureclasses where projectname = 'projectname'.

What I do now:

  • Using a FeatureReader, I read from all featureclasses using a where statement, reading only one feature.
  • Now I know which featureclasses contain data from this specific project. Next I can use these features in a FeatureWriter, where I Row Selection = Columns ProjectName.

I think this is the right thing to do, let the database do the work. I expect it does delete from featureclass where projectname = 'projectname'.

 

The thing is, the column 'projectname' has an attributeindex to accelerate performance in the webviewer. So now I wonder, does it delete a row, updates the index, delete another row, etc etc, and is this killing performance?

 

Alternative I can think of:

  • Use SQLExecutors instead of FeatureWriters, maybe more control but probably downsides as well? No sure of this.
  • Remove attributeindices before deleting records, then delete, then restore attributeindices.
  • Get objectid's from the to be deleted features using a SQLExecutor and a select statement, then delete the records using a SQLExecutor.

For the best possible performance in such scenarios, it is vital to reduce as much as possible the amount of data transferred between FME and your database

Consider three different alternatives in how to approach this scenario:

  1. Read the entire contents of the database tables, filter with e.g. a Tester and delete the selected records in the database
  2. Read only the primary key (e.g. OBJECTID) and as few fields as needed for the Tester, then continue as above
  3. Instruct the database, with SQL, which records to delete. Example "delete from my_table where my_column = 1"

 

For number one, you'll probably end up retrieving a lot of attributes that simply aren't needed. Even worse, if the table contains complex geometries and you're not using them in your workspace, there is a lot of wasted effort (=time) just to handle these. If there is a lot of data (think volume rather than number of records), this will translate into a saturated database connection and more memory needed by FME. This is a scenario that works great for very small datasets and is very easy to implement for beginners, but it can be a performance disaster for large data volumes.

 

For number two, you're now transferring (much) less data, hopefully, but if you have, let's say 20 million records and you only want to delete a small number of them, you'll be heavily penalized since you still have to wait for all the 20 million records to be transferred over the network and processed locally by FME, in the worst case even exhausting local memory and forcing FME into the dreaded "optimizing memory" messages in the log. This is a scenario that typically gets worse the bigger the initial data volume is, and when the disparity between the records to read and records to delete is very high (i.e. deleting small amounts from huge datasets).

 

Finally, for number three you're practically reading nothing into FME (other than perhaps the criteria for the WHERE-clause), meaning that FME can get straight to telling the database what to do. The database will then do it's merry little dance while FME twiddles it fingers and then announce that it's done. If the fields referenced in the WHERE-clause are properly indexed, this can be extremely fast. This is a best-case scenario, but not always possible: you may not know in advance which records to delete unless you compare it with some external dataset, e.g. from an XML file.

 

Since alternative 3 sometimes isn't possible, what option do we have to avoid alternative 2? Let's say you receive an XML file that contains all the records to delete. If the dataset is very large, but the records to delete is fairly limited, it might actually be much more efficient to write the list of objects to delete into a temporary table in the database, then letting the database do the deletion based on a join. E.g. if we have a list of 100 primary keys to delete from a database containing 20 million records. We could then write the 100 keys to a temporary table and let the database use that as a filter, rather than reading the 20 million records into FME first. Example SQLExecutor, executed only once after having written those 100 records:

delete from my_big_table
join my_temp_table on my_big_table.pk = my_temp_table.pk

So in this case we've managed to "inverse" the methodology compared to scenario 1 above: we've only treated the 100 records to delete using FME, while not touching the 20 million records in the database. The performance improvement can, in many cases, be mind-blowing.

 

This doesn't really reply directly to your question, but hopefully it can still be helpful in how to approach issues like these :-)


For the best possible performance in such scenarios, it is vital to reduce as much as possible the amount of data transferred between FME and your database

Consider three different alternatives in how to approach this scenario:

  1. Read the entire contents of the database tables, filter with e.g. a Tester and delete the selected records in the database
  2. Read only the primary key (e.g. OBJECTID) and as few fields as needed for the Tester, then continue as above
  3. Instruct the database, with SQL, which records to delete. Example "delete from my_table where my_column = 1"

 

For number one, you'll probably end up retrieving a lot of attributes that simply aren't needed. Even worse, if the table contains complex geometries and you're not using them in your workspace, there is a lot of wasted effort (=time) just to handle these. If there is a lot of data (think volume rather than number of records), this will translate into a saturated database connection and more memory needed by FME. This is a scenario that works great for very small datasets and is very easy to implement for beginners, but it can be a performance disaster for large data volumes.

 

For number two, you're now transferring (much) less data, hopefully, but if you have, let's say 20 million records and you only want to delete a small number of them, you'll be heavily penalized since you still have to wait for all the 20 million records to be transferred over the network and processed locally by FME, in the worst case even exhausting local memory and forcing FME into the dreaded "optimizing memory" messages in the log. This is a scenario that typically gets worse the bigger the initial data volume is, and when the disparity between the records to read and records to delete is very high (i.e. deleting small amounts from huge datasets).

 

Finally, for number three you're practically reading nothing into FME (other than perhaps the criteria for the WHERE-clause), meaning that FME can get straight to telling the database what to do. The database will then do it's merry little dance while FME twiddles it fingers and then announce that it's done. If the fields referenced in the WHERE-clause are properly indexed, this can be extremely fast. This is a best-case scenario, but not always possible: you may not know in advance which records to delete unless you compare it with some external dataset, e.g. from an XML file.

 

Since alternative 3 sometimes isn't possible, what option do we have to avoid alternative 2? Let's say you receive an XML file that contains all the records to delete. If the dataset is very large, but the records to delete is fairly limited, it might actually be much more efficient to write the list of objects to delete into a temporary table in the database, then letting the database do the deletion based on a join. E.g. if we have a list of 100 primary keys to delete from a database containing 20 million records. We could then write the 100 keys to a temporary table and let the database use that as a filter, rather than reading the 20 million records into FME first. Example SQLExecutor, executed only once after having written those 100 records:

delete from my_big_table
join my_temp_table on my_big_table.pk = my_temp_table.pk

So in this case we've managed to "inverse" the methodology compared to scenario 1 above: we've only treated the 100 records to delete using FME, while not touching the 20 million records in the database. The performance improvement can, in many cases, be mind-blowing.

 

This doesn't really reply directly to your question, but hopefully it can still be helpful in how to approach issues like these :-)

Thanks, I did not think about the last solution. I like it and will try.


Thanks, I did not think about the last solution. I like it and will try.

I'm glad if my longer-than-usual rambling was helpful :-)

Since you're storing you main data on SQL Server, you could look into its built-in mechanism for temporary tables, they're very easy to use and simplifies your work: https://www.sqlservertutorial.net/sql-server-basics/sql-server-temporary-tables/

 


I'm facing a similar situation currently - Some of the SDE tables are verisoned and archived, so have to go through SDE. However soem of the tables are non-versioned and not archived. For the later, deleting straight from the database (SQLExecute) is much faster than going through SDE


I'm facing a similar situation currently - Some of the SDE tables are verisoned and archived, so have to go through SDE. However soem of the tables are non-versioned and not archived. For the later, deleting straight from the database (SQLExecute) is much faster than going through SDE

Thanks! Any downside on deleting without SDE knowing? Do I need to recalculate statistics after running the deletes? Indexes?


Thanks! Any downside on deleting without SDE knowing? Do I need to recalculate statistics after running the deletes? Indexes?

In theory it shouldn't be necessary, the database handles both indexes and statistics automatically. There are some cases, however, where rebuilding the spatial index could make sense, in particular if, let's say, the changes to the content would trigger a change in grid size. Realistically speaking, you'll need to benchmark both with and without rebuilding and see what makes sense for your specific case.


Thanks! Any downside on deleting without SDE knowing? Do I need to recalculate statistics after running the deletes? Indexes?

Very much thanks again. I'm trying really hard to not demolish anything and my understanding of SDE is limited. Gridsize won't be affected as the extents are set to country size.


Reply