Skip to main content

Hi,

 

I have a huge dataset to write in text files.

Each line has a file_id and an identifier. This identifier is not unique and can be repeated.

 

I must write everything in a maximum of 100 files.

I must write a maximum of 250 lines per file.

To do that, since each file must have an id, I thought I could simply use the file_id in the file name when writing.

Here is what I did :

  • I use a Counter to add a _count attribute to each line.
  • I divide this _count attribute by 250 to create the file_id attribute.
  • I use AttributeRounder to round the file_id.
  • I write the text files with a FeatureWriter and by setting the name : "file_@Value(file_id)".

Once I've done this, I can write a maximum of 250 lines in each text file.

However, all lines with the same identifier must be in the same file, and with just the steps above, that's not the case.

So I thought about :

  • using Agregator to group by the identifier and create a list of file_id.
  • using ListRangeExtractor to extract the maximum file_id in the list.
  • using AttributeManager to set file_id with the maximum.

I thought I was done with this, but then I realised that, since the file_id was already computed, I was just adding more lines at the beginning of the next text file. So I will have more than 250 lines...

 

What should I do since I can't use loops in FME ? Is there a more simple way to do things ?

 

Here is an image to help you understand, in case my explanations are not clear :points

Can you you sort by Identifier and also File_id, with File_id in descending order. Then you just need to look for when identifier has changed from the previous record, using the Adjacent Features capability of the AttributeManager and increment a file name counter when it changes.


Are there more than 100 unique identifiers? Is it possible to have more than 250 records with the same identifier?

 

If the answer to both is no, you could fanout on the identifier

SM.txt: a,b,c

TER.txt: def

If you need the filename to be an incrementing number, then sort your data by identifier, and then use adjacent feature handling to create a file id that increments every time the identifier changes value.

 

 

incrementOnChange 

If your 100 file constraint means that you need more than one type of identifier in the same file, than things become more complicated. It essentially becomes a bin packing problem where your bins have a capacity of 250 and each unique identifier is an item with a weight equal to the number of features of that identifier.

 


Are there more than 100 unique identifiers? Is it possible to have more than 250 records with the same identifier?

 

If the answer to both is no, you could fanout on the identifier

SM.txt: a,b,c

TER.txt: def

If you need the filename to be an incrementing number, then sort your data by identifier, and then use adjacent feature handling to create a file id that increments every time the identifier changes value.

 

 

incrementOnChange 

If your 100 file constraint means that you need more than one type of identifier in the same file, than things become more complicated. It essentially becomes a bin packing problem where your bins have a capacity of 250 and each unique identifier is an item with a weight equal to the number of features of that identifier.

 

I'm not sure about the number of occurences, but I doubt there would be more than 10 data lines with the same identifier. Still, I cannot fannout the identifier, since I should have several different identifiers in one file. It's only when approaching the 250 lines that I must be carefull, that data with the same identifier are not separated in different files.


Are there more than 100 unique identifiers? Is it possible to have more than 250 records with the same identifier?

 

If the answer to both is no, you could fanout on the identifier

SM.txt: a,b,c

TER.txt: def

If you need the filename to be an incrementing number, then sort your data by identifier, and then use adjacent feature handling to create a file id that increments every time the identifier changes value.

 

 

incrementOnChange 

If your 100 file constraint means that you need more than one type of identifier in the same file, than things become more complicated. It essentially becomes a bin packing problem where your bins have a capacity of 250 and each unique identifier is an item with a weight equal to the number of features of that identifier.

 

Indeed, I have started using a "weight" attribute that contains the number of occurences of each identifier, but so far, none of my processes work...


Can you you sort by Identifier and also File_id, with File_id in descending order. Then you just need to look for when identifier has changed from the previous record, using the Adjacent Features capability of the AttributeManager and increment a file name counter when it changes.

Thank you for the Adjacent Feature, I did not know that.

I tried your method, however, when changing the file id, I have the following problem : if nearing the 250th line of data, there are data lines with the same identifier, they are indeed moved from file N to file N+1 and stay together. However, in that case, file N+1 has more than 250 lines, since some lines were added at the beginning. I'm trying to manage this using some "weight" attribute : that attribute contains the number of occurences of an identifier, but so far, it dos not work...


This can be done with a recursive SQL command (see example below), using the InlineQuerier.

In the attached workspace I use a simple data set, containing a row number as identifier (RowNumber, starting with 1) and a number of items (NumItems). Also the workspace contains a Private Parameter defining the maximum number of items in a bin $(MaxItemsInBin), set to 100.

The idea for the SQL command (for SQLServer) I found here; I simply rewrote it for my dataset in SQLite.

with recursive
    C ("cRowNumber", "cNumItems", "cCustomSum")
as ( --Get first record
    select "RowNumber", "NumItems", "NumItems"
    from "Data"
    where "Data"."RowNumber" = 1
    
    union all
    --You add the new value of NumItems to the sum of the previous record
    --If the running sum remains less than the maximum capacity of the bin or it reaches exactly the maximum capacity of the bin 
    --   then there is nothing to do
    --If the running sum overtakes the maximum capacity of the bin, the new value will be part of a new bin and so 
    --   the new value is added to the multiple of $(MaxItemsInBin) that has been overtaken
    select 
        "RowNumber",
        "NumItems",
        case 
            when (((C."cCustomSum" + "Data"."NumItems") / $(MaxItemsInBin) = C."cCustomSum" / $(MaxItemsInBin)) or ((C."cCustomSum" + "Data"."NumItems") % $(MaxItemsInBin) = 0)) 
                then (C."cCustomSum" + "Data"."NumItems") 
            else ((C."cCustomSum" + "Data"."NumItems") / $(MaxItemsInBin) * $(MaxItemsInBin)) + "Data"."NumItems" 
        end
    from "Data"
    inner join C on "Data"."RowNumber" = C."cRowNumber" + 1
)
 
select 
    "cRowNumber", "cNumItems", "cCustomSum", ("cCustomSum" - 1) / $(MaxItemsInBin) as cBinNumber
from C

bin_packing_resultsSpecial thanks to @jdh​ for mentioning "bin packing", this triggered my search for a SQL based solution.


Thank you for the Adjacent Feature, I did not know that.

I tried your method, however, when changing the file id, I have the following problem : if nearing the 250th line of data, there are data lines with the same identifier, they are indeed moved from file N to file N+1 and stay together. However, in that case, file N+1 has more than 250 lines, since some lines were added at the beginning. I'm trying to manage this using some "weight" attribute : that attribute contains the number of occurences of an identifier, but so far, it dos not work...

I'm thinking you need a running total of now many records are in the current file. Using the adjacent feature functionality you can look ahead at the next group size and decide if it will fit in the current file. If not the start a new file and reset the running total to zero. Using VariableSetter and VariableRetriever to keep the running total.


Reply