Question

How to filter a csv file based on the last row in FME

  • 28 September 2017
  • 28 replies
  • 11 views

I have a csv file looking as below

ATT1 ATT2 ATT3 New Attribute Fri Jun 24 08:01:00 2016 1.46676E+12 A Sat Jun 25 08:01:00 2016 1.46684E+12 A Thu Jul 07 08:01:00 2016 1.46788E+12 A Fri Jul 08 08:01:00 2016 1.46796E+12 A Sat Jul 09 08:01:00 2016 1.46805E+12 A Sun Jul 10 08:01:00 2016 1.46814E+12 B

 

Mon Jul 11 08:01:00 2016 1.46822E+12 B

 

Sat Sep 24 08:01:00 2016 1.4747E+12 B

 

Sun Sep 25 08:01:00 2016 1.47479E+12 B

 

I want to set the "New Attribute" to 'Yes' based on the "ATT3" value of the last row.

Or i want to filter the records based on the ATT3 value of the last row.


28 replies

Userlevel 2
Badge +16

AttributeCreator has the option to look a number of rows ahead or back.

Have a look at the Adjacent feature handling:

In a Tester you can then do: If ATTR3 from this feature != ATTR3 from next feature the ATTR4 = "Yes" or use the Conditional value when creating ATTR4

Badge +14

Or use the Text Reader and set the parameter to read from the bottom, then use the Sampler to sample off the first feature (actually the last).

Userlevel 2
Badge +16

Or use the Text Reader and set the parameter to read from the bottom, then use the Sampler to sample off the first feature (actually the last).

This will get you the last record of the entire dataset, not per ATTR3 value.

 

 

Userlevel 2
Badge +17

Hi @sobanmughal, a pair of VariableSetter and VariableRetriever in conjunction with a FeatureHolder can be used to add the last ATT3 to every feature as an attribute. You can then compare ATT3 with the last ATT3 for each feature.

Hi @sobanmughal, a pair of VariableSetter and VariableRetriever in conjunction with a FeatureHolder can be used to add the last ATT3 to every feature as an attribute. You can then compare ATT3 with the last ATT3 for each feature.

Thanks @takashi I have a million files to read, is there any simpler method with less number of transformers?

 

AttributeCreator has the option to look a number of rows ahead or back.

Have a look at the Adjacent feature handling:

In a Tester you can then do: If ATTR3 from this feature != ATTR3 from next feature the ATTR4 = "Yes" or use the Conditional value when creating ATTR4

Thanks @erik_jan can you explain in bit more detail as I am not very proficient in FME

 

Hi @sobanmughal, a pair of VariableSetter and VariableRetriever in conjunction with a FeatureHolder can be used to add the last ATT3 to every feature as an attribute. You can then compare ATT3 with the last ATT3 for each feature.

I am reading multiple files at a time and this will filter all attributes based on the last attribute of the last file.
Badge +6
This will get you the last record of the entire dataset, not per ATTR3 value.

 

 

But I think it's a good idea, use AttributeSplitter can more easily from the entire line ATT3, if there are many lines of the original CSV, set the text reader's Max Features to Read=1, can only read a line as the last line of the ATT3, such execution efficiency will be higher.

 

filterbylastrowvalue.fmwt

 

 

Userlevel 2
Badge +17

Hi @sobanmughal, a pair of VariableSetter and VariableRetriever in conjunction with a FeatureHolder can be used to add the last ATT3 to every feature as an attribute. You can then compare ATT3 with the last ATT3 for each feature.

You are right. With the workflow I provided above, only the last ATT3 in the final file will be used when multiple CSV files are processed at once.

 

In order to process for each file individually, make the variable name unique to each file. For example, expose "fme_dataset" in the CSV reader and use the dataset name as suffix of the variable name, like this.

 

to be continued.
Userlevel 2
Badge +17

Hi @sobanmughal, a pair of VariableSetter and VariableRetriever in conjunction with a FeatureHolder can be used to add the last ATT3 to every feature as an attribute. You can then compare ATT3 with the last ATT3 for each feature.

0684Q00000ArLoUQAV.png

However, if the number of input files was very large, the workflow could consume huge memory space, since the FeatureHolder should hold all input features before starting output.

 

If there were millions files to be process at once, I would use a PythonCaller with this script. Note that you still need to expose "fme_dataset" in the CSV reader.

 

# PythonCaller Script Example
class FeatureProcessor(object):
    def __init__(self):
        self.features = []
        self.prevDataset = None
        
    def input(self, feature):
        dataset = feature.getAttribute('fme_dataset')
        if dataset != self.prevDataset:
            self.flush()
            self.prevDataset = dataset
        self.features.append(feature)
        
    def close(self):
        self.flush()
        
    def flush(self):
        if self.features:
            lastAtt3 = self.features[-1].getAttribute('ATT3')
            for feature in self.features:
                if feature.getAttribute('ATT3') == lastAtt3:
                    feature.setAttribute('New Attribute', 'Yes')
                self.pyoutput(feature)
            self.features = [] 
Userlevel 2
Badge +17
This will get you the last record of the entire dataset, not per ATTR3 value.

 

 

Agree with @taojunabc. In addition, if there are multiple CSV files to be processed at once, you can expose "fme_dataset" in the reader feature types and set it to the Group By parameter in the FeatureMerger.

 

Hi @sobanmughal, a pair of VariableSetter and VariableRetriever in conjunction with a FeatureHolder can be used to add the last ATT3 to every feature as an attribute. You can then compare ATT3 with the last ATT3 for each feature.

Where the Python caller needs to be inserted?

 

My workspace looks like below.

 

capture5.png

 

 

Userlevel 4
Badge +30
Where the Python caller needs to be inserted?

 

My workspace looks like below.

 

capture5.png

 

 

@sobanmughal, i believe after your Reader CSV

 

Userlevel 2
Badge +17

Hi @sobanmughal, a pair of VariableSetter and VariableRetriever in conjunction with a FeatureHolder can be used to add the last ATT3 to every feature as an attribute. You can then compare ATT3 with the last ATT3 for each feature.

You can replace the VariableSetter, FeatureHolder, and VariableRetriever in your workspace with a single PythoCaller containing the script I posted above. The script adds a new attribute called 'New Attribute' which stores 'Yes' if ATT3 of the feature is equal to the ATT3 of the last feature for each file.

 

Expose the attribute name 'New Attribute' via the 'Attributes to Expose' parameter in the PythonCaller. You can then filter the features output from the PythonCaller by testing whether the 'New Attribute' stores 'Yes' or not.

 

Userlevel 2
Badge +17

Hi @sobanmughal, a pair of VariableSetter and VariableRetriever in conjunction with a FeatureHolder can be used to add the last ATT3 to every feature as an attribute. You can then compare ATT3 with the last ATT3 for each feature.

0684Q00000ArLoUQAV.png

Besides, if all the features whose ATT3 were NOT equal to the last ATT3 can be discarded, you can modify the 'flush' method like this, so that the PythonCaller outputs only features whose ATT3 were equal to the last ATT3, rather than adding 'New Attribute'. If you do so, the subsequent filtering process can also be removed.

 

    def flush(self):
        if self.features:
            lastAtt3 = self.features[-1].getAttribute('ATT3')
            for feature in self.features:
                if feature.getAttribute('ATT3') == lastAtt3:
                    self.pyoutput(feature)
            self.features = []

 

Badge +2
Thanks @erik_jan can you explain in bit more detail as I am not very proficient in FME

 

Hi @sobanmughal, I think takashi and 1spatialDave's suggestions were based on the understanding that you want to use the row at the end of a file to set the new attribute's value. But if by last row, you mean the row above the row you want to set value for, then, @erik_jan has got the right idea for you. The documentation would have some more details, but here is a screenshot on how you'd set this up in the AttributeCreator:

 

Hope this helps clarify things a bit.

 

Xiaomeng

 

Userlevel 4
Badge +13

Given that you're facing huge data volumes, I'd be VERY tempted to read all the data twice.

I'd suggest using @takashi's idea of the variables, but extending to having the data read twice.

The first time (in the first reader you put down), you'd set a Variable whose name was the fme_dataset and whose value was the value of Attr3. When the datasets are exhausted, you've got your variables all set.

The second time (in the second reader) you'd retrieve that Variable and do your test and then act accordingly.

Basically, reading CSV is very cheap (especially in FME 2017 and later). So its better NOT to hold on to lots of data in FME land and instead just read the CSVs twice.

My sample workspace & data are here: lastrecord.fmwt

Given that you're facing huge data volumes, I'd be VERY tempted to read all the data twice.

I'd suggest using @takashi's idea of the variables, but extending to having the data read twice.

The first time (in the first reader you put down), you'd set a Variable whose name was the fme_dataset and whose value was the value of Attr3. When the datasets are exhausted, you've got your variables all set.

The second time (in the second reader) you'd retrieve that Variable and do your test and then act accordingly.

Basically, reading CSV is very cheap (especially in FME 2017 and later). So its better NOT to hold on to lots of data in FME land and instead just read the CSVs twice.

My sample workspace & data are here: lastrecord.fmwt

Great suggestion @daleatsafe.

 

However, if the number of input files was very large, the workflow could consume huge memory space, since the FeatureHolder should hold all input features before starting output.

 

If there were millions files to be process at once, I would use a PythonCaller with this script. Note that you still need to expose "fme_dataset" in the CSV reader.

 

# PythonCaller Script Example
class FeatureProcessor(object):
    def __init__(self):
        self.features = []
        self.prevDataset = None
        
    def input(self, feature):
        dataset = feature.getAttribute('fme_dataset')
        if dataset != self.prevDataset:
            self.flush()
            self.prevDataset = dataset
        self.features.append(feature)
        
    def close(self):
        self.flush()
        
    def flush(self):
        if self.features:
            lastAtt3 = self.features[-1].getAttribute('ATT3')
            for feature in self.features:
                if feature.getAttribute('ATT3') == lastAtt3:
                    feature.setAttribute('New Attribute', 'Yes')
                self.pyoutput(feature)
            self.features = [] 
@takashi the script works fine. However, I have come across with another problem where aggregator holds on to all the data before it outputs to the writer. I want to take average for each file and the pass on to the writer.capture6.png

 

Userlevel 2
Badge +17
@takashi the script works fine. However, I have come across with another problem where aggregator holds on to all the data before it outputs to the writer. I want to take average for each file and the pass on to the writer.capture6.png

 

If you are using the Aggregator to perform something for each source file, you can set "Yes" to the "Input is Ordered by Group" parameter in the Aggregator to process for each group.

 

 

@takashi the script works fine. However, I have come across with another problem where aggregator holds on to all the data before it outputs to the writer. I want to take average for each file and the pass on to the writer.capture6.png

 

That was easy:)

 

Thanks!!!!

 

 

Hi @sobanmughal, a pair of VariableSetter and VariableRetriever in conjunction with a FeatureHolder can be used to add the last ATT3 to every feature as an attribute. You can then compare ATT3 with the last ATT3 for each feature.

0684Q00000ArLoUQAV.png

This is the final script for PythonCaller Transformer

 

 

import fme
import fmeobjects


class FeatureProcessor(object):
    def __init__(self):
        self.features = []
        self.prevDataset = None
        
    def input(self, feature):
        dataset = feature.getAttribute('fme_dataset')
        if dataset != self.prevDataset:
            self.flush()
            self.prevDataset = dataset
        self.features.append(feature)
        
    def close(self):
        self.flush()
        
    def flush(self):
        if self.features:
            lastAtt3 = self.features[-1].getAttribute('Ordered Rate')
            for feature in self.features:
                if feature.getAttribute('Ordered Rate') == lastAtt3:
                    self.pyoutput(feature)
        self.features = []
 

 

Userlevel 4
Badge +13
Great suggestion @daleatsafe.

 

BTW I just wish there was a way to use the Sampler to get the very last record only. That would be even a bit more efficient. I'm pretty tempted to make that change to our codebase I must say...

 

 

Badge +6

Except for the advice above, I also thought of a approach that uses InlineQuerier.

Of course, this approach is not the most efficient, but provides another way to think about it.

if you have a lot of csv files to read, I think that using WorkspaceRunner would be more efficient than using Group By.

filterbylastrowvalue2.fmwt

Userlevel 1
Badge +4

Hi!

 

I wonder if using Pointclouds could have a performance-impact positively/negatively here ?

For instance like the screenshot below: You can get the last value in the text-file with PointCloud-operations. (Don't think you can do the actual filtering without changing the component to some Integer though. Would be cool though.).

 

 

Now - this if the case really is the last record in the file.

@daleatsafe , @xiaomengatsafe?

Reply