Skip to main content

Hi,

I'm processing almost 200K records with more than 30 attributes and at one point of my workflow I validate if they're a unique by some attribute with an AttributeValidator. This validation is taking around 4020 seconds.

Is there any other transformer that is faster than this one to do such validations ?

PD: I added the profile result after running the workspace.

Thanks

Update: @MattAtSafe

I added the workspace simplified with only the validations. This process is taking only 17 min for 198k records only.

Here: I uploaded the file that I'm using to test in Google Drive because it's too heavy.

https://drive.google.com/open?id=0B44BAUsLmbR8ZVkxRGl6SGotNGc

Any help would be appreciated

Hi @rrdlpl 4000 seconds is pretty slow!! Instead of an AttributeValidator you could try using a DuplicateFilter. You might find this to be more performant as I think it would work in a different way. Would you be able to share the workspace and the original log file with us?

Using the DuplicateFilter may not give you exactly what you're after, however, it should give you a starting place and remove the blockage.


Hello rrdlpl,

Have you tried concatenating the attributes that needs to be unique to one temporary attribute and subsequently validate on the new temporary attribute ?

 

I think this can already enhance the performance of your fme workspace.
Hi @rrdlpl great to hear that this is now taking less time. Are you now getting the result you expect?

 


Hi @rrdlpl great to hear that this is now taking less time. Are you now getting the result you expect?

 

 

17min for simple validations for only that number of records is too slow imho.

I made some testing to see what was taking so long in the PrepareFields (AttributeManager) transformer.

I used FME 2017.1and profiled each of the following:

1- Disabled everything past the DuplicateFilter

2- Removed attribute trimming in PrepareFields

3- Added attribute trimming in the DataMapping PythonCaller using a set for the list of attribute to trim

At this point, I just tried to identify what time was taken by specific tasks in PrepareFields

4- Removed Regex in PrepareFields

5- Removed the two date related attribute creation in PrepareFields

6- Removed conditionnal date creation in PrepareFields

At this point, the only stuff remaining in PrepareFields is ReplaceString and UpperCase calls.

StepTimer ContextUser CPU Time1

 

FileReader_PrepareFields284.4990241f_4 (PythonCaller)111.4471142FileReader_PrepareFields223.6118332f_4 (PythonCaller)112.33632

 

3FileReader_PrepareFields222.3794253f_4 (PythonCaller)117.6247544FileReader_PrepareFields220.8662164f_4 (PythonCaller)118.935162

 

5FileReader_PrepareFields99.871845f_4 (PythonCaller)115.487546FileReader_PrepareFields40.0298576f_4 (PythonCaller)115.955543

Maybe using some specific transformers like the AttributeTrimmer is a first step but you look good at python and if top performance is your goal, it may worth something to do the date stuff using python.

Latest version of the python code:

import fme
import fmeobjects
import json
# Template Function interface:
# When using this function, make sure its name is set as the value of
# the 'Class or Function to Process Features' transformer parameter
def processFeature(feature):
    pass


# Template Class Interface:
# When using this class, make sure its name is set as the value of
# the 'Class or Function to Process Features' transformer parameter
class FeatureProcessor(object):
    def __init__(self):
        self.dataMapping = json.loads(fme.macroValues 'DataMappingJSON'])
        self.loadFields()
        pass
    def loadFields(self):
       self.customFieldsAttributes = {}
       self.sourceAttributes = {}
       for field in self.dataMapping: 
           source = field 'sourceColumn']
           destiny = fieldp'id']
           if fieldd'isCustom']: 
               self.customFieldsAttributes}source] = destiny
           else: 
               self.sourceAttributesÂsource] = destiny
       self.attributesToTrim = Â'ATTENDING_SCHOOL_CODE', 'ZONED_SCHOOL_CODE', 'ENROLLED_SCHOOL_CODE', 'SUMMER_SCHOOL_CODE', 'STUDENT_ID', 'GRADE_CODE', 'GENDER', 'ETHNICITY_CODE', 'DATE_OF_BIRTH', 'STREET', 'SPECIAL_,ED' 'WHEEL_CHAIR', 'SAFETY_VEST_HARNESS', 'SEATBELT', 'CAR_SEAT', 'BOOSTER_SEAT', 'WALKER', 'TINTED_WINDOWS', 'AIR_COND', 'AIDE_REQ', 'NURSE', 'OXYGEN', 'SERVICE_ANIMAL']
    def input(self,feature):
        newFeature = fmeobjects.FMEFeature()
        customs = ,]
        #We read all column names of the incoming feature
        incomingColumnNames = feature.getAllAttributeNames()
        for sourceField in incomingColumnNames:
             #Trim fields
             if sourceField in self.attributesToTrim:
                 feature.setAttribute(sourceField, feature.getAttribute(sourceField).strip())
             #If the source field we stored in the array to convert it in JSON Later
             cellValue = feature.getAttribute(sourceField)
             if sourceField in self.customFieldsAttributes.keys(): 
                 destiny = self.customFieldsAttributesÂsourceField] # we map the sourceColumn to id
                 customs.append({ 'key': destiny, 'value': cellValue })
             else:
                 if sourceField in self.sourceAttributes.keys():
                     destiny = self.sourceAttributesÂsourceField] # we map the sourceColumn to id
                     newFeature.setAttribute(destiny, cellValue)
        newFeature.setAttribute('custom_fields_json', json.dumps(customs))        
        newFeature.setAttribute('csv_line_number', feature.getAttribute('csv_line_number')) #csv_line_number comes as feature after reading the file      
        self.pyoutput(newFeature)
    def close(self):
        pass


I made some testing to see what was taking so long in the PrepareFields (AttributeManager) transformer.

I used FME 2017.1and profiled each of the following:

1- Disabled everything past the DuplicateFilter

2- Removed attribute trimming in PrepareFields

3- Added attribute trimming in the DataMapping PythonCaller using a set for the list of attribute to trim

At this point, I just tried to identify what time was taken by specific tasks in PrepareFields

4- Removed Regex in PrepareFields

5- Removed the two date related attribute creation in PrepareFields

6- Removed conditionnal date creation in PrepareFields

At this point, the only stuff remaining in PrepareFields is ReplaceString and UpperCase calls.

StepTimer ContextUser CPU Time1

 

FileReader_PrepareFields284.4990241f_4 (PythonCaller)111.4471142FileReader_PrepareFields223.6118332f_4 (PythonCaller)112.33632

 

3FileReader_PrepareFields222.3794253f_4 (PythonCaller)117.6247544FileReader_PrepareFields220.8662164f_4 (PythonCaller)118.935162

 

5FileReader_PrepareFields99.871845f_4 (PythonCaller)115.487546FileReader_PrepareFields40.0298576f_4 (PythonCaller)115.955543

Maybe using some specific transformers like the AttributeTrimmer is a first step but you look good at python and if top performance is your goal, it may worth something to do the date stuff using python.

Latest version of the python code:

import fme
import fmeobjects
import json
# Template Function interface:
# When using this function, make sure its name is set as the value of
# the 'Class or Function to Process Features' transformer parameter
def processFeature(feature):
    pass


# Template Class Interface:
# When using this class, make sure its name is set as the value of
# the 'Class or Function to Process Features' transformer parameter
class FeatureProcessor(object):
    def __init__(self):
        self.dataMapping = json.loads(fme.macroValues 'DataMappingJSON'])
        self.loadFields()
        pass
    def loadFields(self):
       self.customFieldsAttributes = {}
       self.sourceAttributes = {}
       for field in self.dataMapping: 
           source = field 'sourceColumn']
           destiny = fieldp'id']
           if fieldd'isCustom']: 
               self.customFieldsAttributes}source] = destiny
           else: 
               self.sourceAttributesÂsource] = destiny
       self.attributesToTrim = Â'ATTENDING_SCHOOL_CODE', 'ZONED_SCHOOL_CODE', 'ENROLLED_SCHOOL_CODE', 'SUMMER_SCHOOL_CODE', 'STUDENT_ID', 'GRADE_CODE', 'GENDER', 'ETHNICITY_CODE', 'DATE_OF_BIRTH', 'STREET', 'SPECIAL_,ED' 'WHEEL_CHAIR', 'SAFETY_VEST_HARNESS', 'SEATBELT', 'CAR_SEAT', 'BOOSTER_SEAT', 'WALKER', 'TINTED_WINDOWS', 'AIR_COND', 'AIDE_REQ', 'NURSE', 'OXYGEN', 'SERVICE_ANIMAL']
    def input(self,feature):
        newFeature = fmeobjects.FMEFeature()
        customs = ,]
        #We read all column names of the incoming feature
        incomingColumnNames = feature.getAllAttributeNames()
        for sourceField in incomingColumnNames:
             #Trim fields
             if sourceField in self.attributesToTrim:
                 feature.setAttribute(sourceField, feature.getAttribute(sourceField).strip())
             #If the source field we stored in the array to convert it in JSON Later
             cellValue = feature.getAttribute(sourceField)
             if sourceField in self.customFieldsAttributes.keys(): 
                 destiny = self.customFieldsAttributesÂsourceField] # we map the sourceColumn to id
                 customs.append({ 'key': destiny, 'value': cellValue })
             else:
                 if sourceField in self.sourceAttributes.keys():
                     destiny = self.sourceAttributesÂsourceField] # we map the sourceColumn to id
                     newFeature.setAttribute(destiny, cellValue)
        newFeature.setAttribute('custom_fields_json', json.dumps(customs))        
        newFeature.setAttribute('csv_line_number', feature.getAttribute('csv_line_number')) #csv_line_number comes as feature after reading the file      
        self.pyoutput(newFeature)
    def close(self):
        pass

@larry This is great! Very interesting to see the effect of removing the Date stuff.

 


Reply