Question

massive number of points and polygon area

  • 23 September 2015
  • 17 replies
  • 9 views

Badge
Hi,

 

 

I have a massive amount of points within a country (++15Million) which represents addresses. [1massive csv file with lat/lon]

 

and I also have a huge amount of Polygons Area (++30k) which covers a number of  area location.  [multiple .tab files]

 

 

I need to get the number of points inside each polygon area

 

 

What would be the most optimised way to achieve this?

 

 

I am thinking of using a workspacerunner approach towards selecting 1-by-1 each polygon area mapping it out on the 15++million points however It would take a very long time to achieve this. 

 

 

Is there a quick spatial approach on filtering to minimize the reading of the number of points, say it will only read within a certain area and not throught the 15million points?

 

 

thanks in advance

 

 

kind regards,

 

Jubert

 

 

 

 

17 replies

Userlevel 4
Badge +13
Hi,

 

I would comvert the csv points into FFS (with spatial index) and use that with a bounding box as input for the overlay.
Userlevel 2
Badge +17
Hi,

 

 

In this case, since the number of polygons is not so many, I would try the PointOnAreaOverlayer transformer.

 

 

First, add a reader to read the 30k+ polygons from the MapInfo files.

 

Second, add a reader to read the 15+ million points from the CSV file. Create point geometries through CSV reader parameter setting or the VertexCreator. And, it's better that you remove unnecessary attributes from the point features.

 

 

Make sure that the MapInfo reader is on the top most in the Navigator, so that the polygons will be read first. It's very important.

 

 

Add a PointOnAreaOverlayer transformer,

 

- send the polygons to the Area port,

 

- send the points to the Point port,

 

- set "Yes" to the Area First parameter, to avoid consuming huge memory for caching the points.

 

 

Then, every polygon output from the Area port will have "_overlaps" attribute which stores the number of inside points.

 

 

Takashi
Badge
Hi Takashi/ETLS-Itay,

 

 

Thanks, though i am still not clear yet but the idea is I would need to get the output as multiple files [csv's or tabfiles] of the in dividual polygons with the points included in it. 

 

 

thus the output desired is 30k++ csv files or tabs.

 

 

I would need to know also the data contained on each points and not only the summary of how many points was covered. (which is presented in the "_overlap" field.

 

 

see below:

 

 

 

note that also the 30k polygons overlap thus i wont have the option to combine it into 1 single file

 

 

the idea is to create a "spatial based filter" but not reading the 15million lines every polygon.

 

 

potentially I am thinking of using the x/y of the polygon and writing a circle boundary on the polygon as i know the distance limits and instead of reading the 15million lines. FME would read the csv spatially and just read only on the circle that circle bounday and not read to the whole lot.

 

 

is that something FME can do? so basically there is a focus area instead of reading to the whole lot.

 

 

how would i do it? what is the optimized way...

 

 

thanks in advance!

 

 

kind regards,

 

Jubert 
Userlevel 2
Badge +17
Both CSV and Mapinfo Tab don't support any spatial querying mechanism, so you cannot avoid reading the 15+ million points at least once. But just once is enough. FME performs spatial indexing internally when testing spatial relationships is needed.

 

Therefore, I think the PointOnAreaOverlayer would be one of the most efficient solutions in this case, if you read the polygons first and also set "Yes" to the "Area First" parameter.
Badge
Thanks Takashi, 

 

 

the 30k polygons that i am going to use also changes every now and then and the 15+million points only changes once every 3months. which would relatively be ok to read at least once..

 

 

Can i force FME to convert the 15+million points to a file which do support spatial querying?or save the one time run internally in FME? how do i do that?

 

 

so if I use the 30k++ polygons as my input on the Area input of the PointOnAreaOverlay

 

and the newly converted file +15million point(with spatial query mechanism support).. would that do the trick?

 

 

thank you so much in advance!

 

 

kind regards,

 

Jubert 
Badge
Hi ETLS-Itay/Takashi,

 

 

could you elaborate more on the "bounding box" part?

 

 

so what I did is to convert the +15million points to an FFS file. (it was split into 6 files - 5files (.ffs) and 1 (.fsi)). it finished successfuly. (with spatial index). [took about ~20mins to run successfully]

 

 

thus my input on my area overlay would be the FFS file(which contains 15million points) and 

 

the 30k++ polygons which serves as input on my area.

 

 

these would be my input on the PointOnAreaOverlay

 

 

 

 

I tried running only few sample polygon files (30 only) but it took forever to run..

 

 

I tried looking for a bounding box transformer but can only find the following:

 

1. BoundingBoxAccumulator

 

2. BoundingBoxReplacer

 

but doesnt make sense to me at the moment.

 

 

I also included Takashi's recommendation to tick "Area First" to "yes" and it reads the input initially which works fine.

 

 

would appreciate your expertise on this... thanks in advance.

 

 

kind regards,

 

Jubert

 

 
Userlevel 2
Badge +17
Sorry, I was wrong. MapInfo TAB supports Spatial Index, according to the help doc.

 

 

Anyway, if you saved the points into a dataset supporting Spatial Index, you can perform relatively efficient spatial querying with a FeatureReader by each polygon. See the help on the FeatureReader (Spatial Filter parameter).

 

The performance depends on the format. I guess that a spatial database (e.g. PostGIS, Oracle Spatial etc.) would be faster than file-based formats.

 

 

However, I'm not sure whether "30k times spatial querying" is better than the PointOnAreaOverlayer approach...
Userlevel 2
Badge +17
Have you tried spatial database + FeatureReader?
Badge
Hi Takashi, 

 

 

so I tried converting the 15million points to FFS which supports  Spatial indexing and added a FeatureReader between my Reader and Writer.

 

 

 

for 30 polygons it went in for 2mins to read everything. which I believe is acceptable.

 

 

the output was as CSV file with a "_matches_records" field added from the original polygons. I am assuming that this is the number of matched records or points within the polygons.

 

 

but what I need though, is a CSV that includes all the parameters/details of each of the matched records and not the number of records it included.

 

 

so for example:  

 

 

say that 23796 was the matched records...I am assuming that these are the points coinciding the polygon.

 

 

how do I put that into the csv to make 23796 lines with the same column but with the details of each record that matches? (making that column A,B,C,D same with all and additional columns which matches the columns on the 15million points which I loaded in the FeatureReader)?

 

 

thanks in advance!

 

 

kind regards,

 

Jubert

 

 
Userlevel 2
Badge +17
If you need to write attributes of every point corresponding to a polygon into the csv file, you should use the point features which will be output from the <Generic> port of the FeatureReader.

 

 

In this case, I think you will have to merge attributes of the polygon feature (Initiator) to the point features, because you are using a polygon attribute as the destination csv file name. You can control it with parameters in the "Advanced > Attribute and Geometry Handling" section of the FeatureReader parameters dialog.

 

And, note that there could be confliction of attribute names between polygon and point. If there were confliction, you may have to resolve them appropriately depending on your requirement.
Badge
Hi Takashi, I think I was able to pull it off..however it took 1hour++ to provide 30 polygons.. I will try to minimize the number of attributes on the 15million points hopefully it would enhance the performance.

 

 

it only takes very quick to get the summary "_matches_records". about roughly 5 records per second to get that single line.

 

 

I am running another round now..I guess this is a trial and error phase i am in at the moment.
Userlevel 2
Badge +17
That's too slow. 30k polygons cannot be completed forever...

 

 

Another approach that uses database.

 

?I expect that database is much faster than ffs.

 

 

Preparing:

 

Save the points into a PostGIS database table (e.g. "point_table").

 

 

Workspace:

 

(1) Read the polygons with a MapInfo reader.

 

(2) Extract the polygon geometry with the GeometryExtractor as OGC Well Known Text and store it to an attribute (e.g. "_geometry").

 

(3) Add a SQLExecutor to query against the "point_table" to fetch the records of the points which are inside of the polygon. The SQL statement looks like this.

 

-----

 

select attr1, attr2, attr3,

 

'@Value(fme_basename)' as dest_csv_name

 

from point_table

 

where ST_Contains('@Value(_geometry)'::geometry, geom)

 

-----

 

 

attr1, attr2, ... indicate the attributes of the points. You can replace them with the actual field names. Select minimal desired attributes here to avoid geometry creation and increase performance.

 

 

"dest_csv_name" stores the fme_basename value of the Initiator polygon. It can be used as the destination CSV file name in the CSV wirter feature type. If you need other polygon attributes, you can add them by the same way.

 

 

The where clause performs the spatial filtering using PostGIS function. See here to learn more about the function.

 

ST_Contains (http://postgis.net/docs/ST_Contains.html)

 

 

And then, expose the attribute names manually with the "Attributes to Expose" parameter.

 

You can set "Result Attributes Only" to the "Combine Attributes" parameter, because required polygon attributes (i.e. fme_basename and others) have been merged already by the SQL statement.
Userlevel 2
Badge +17
And set "Result Geometry Only" to the "Combine Geometry" parameter to prevent copying polygon geometry for every point record. Then, the SQLExecutor creates no geometries.
Badge
Hi TAkashi, yes it is slow..i think it consumes most time in writing the csv file. I will try to run a higher number of sample polygons over the weekend and i will try the PostGIS on monday method. hopefully it solves the problem. thanks so much for your help. really appreciate it.

 

 

kind regards,

 

Jubert 
Badge +3

I would advise using a Tiler set to number of tiles.

Then do a Spatialrealtor between the Tiles and yuor polygons.

Those Tiles wich intersect u send to a PointonareaOverlayer. Test this using a tester and testclause related_candidates > 0.

Then use a PointOnAreaOverlayer grouped by column and row attribute from the tiler.

This allows for use of paralell processing and speeds things up considerably. I mostly set it to "aggresive" so i can still do some decent surfin..:)

.

Userlevel 2
Badge +16

I would Insert these records in some kind of Spatial database (if possible). You could use Oracle, SQL Server or PostGis for that. And after doing so I would use a SQL statement to calculate the join in the database.

Badge +2

...as for the desired output to "30k++ csv files". Use the Feature Type Fanout on the CSV writer. This will allow you to split each set of points into separate CSV files. There's a good tutorial on Feature Type Fanout or perhaps this article

Reply