Question

How to improve attribute validator unique performance?

8 years ago
July 27, 2017
6 replies
12 views

rrdlpl
12 replies

Hi,

I'm processing almost 200K records with more than 30 attributes and at one point of my workflow I validate if they're a unique by some attribute with an AttributeValidator. This validation is taking around 4020 seconds.

Is there any other transformer that is faster than this one to do such validations ?

PD: I added the profile result after running the workspace.

Thanks

Update: @MattAtSafe

I added the workspace simplified with only the validations. This process is taking only 17 min for 198k records only.

Here: I uploaded the file that I'm using to test in Google Drive because it's too heavy.

https://drive.google.com/open?id=0B44BAUsLmbR8ZVkxRGl6SGotNGc

Any help would be appreciated

+19

fmelizard
Safer
3725 replies
8 years ago
July 27, 2017

Hi @rrdlpl 4000 seconds is pretty slow!! Instead of an AttributeValidator you could try using a DuplicateFilter. You might find this to be more performant as I think it would work in a different way. Would you be able to share the workspace and the original log file with us?

Using the DuplicateFilter may not give you exactly what you're after, however, it should give you a starting place and remove the blockage.

+19

verdoodtdries
Supporter
53 replies
7 years ago
July 31, 2017

Hello rrdlpl,

Have you tried concatenating the attributes that needs to be unique to one temporary attribute and subsequently validate on the new temporary attribute ?

I think this can already enhance the performance of your fme workspace.

+19

fmelizard
Safer
3725 replies
7 years ago
August 1, 2017

Hi @rrdlpl great to hear that this is now taking less time. Are you now getting the result you expect?

rrdlpl
Author
12 replies
7 years ago
August 1, 2017

fmelizard wrote:

Hi @rrdlpl great to hear that this is now taking less time. Are you now getting the result you expect?

17min for simple validations for only that number of records is too slow imho.

larry
173 replies
7 years ago
August 2, 2017

I made some testing to see what was taking so long in the PrepareFields (AttributeManager) transformer.

I used FME 2017.1and profiled each of the following:

1- Disabled everything past the DuplicateFilter

2- Removed attribute trimming in PrepareFields

3- Added attribute trimming in the DataMapping PythonCaller using a set for the list of attribute to trim

At this point, I just tried to identify what time was taken by specific tasks in PrepareFields

4- Removed Regex in PrepareFields

5- Removed the two date related attribute creation in PrepareFields

6- Removed conditionnal date creation in PrepareFields

At this point, the only stuff remaining in PrepareFields is ReplaceString and UpperCase calls.

StepTimer ContextUser CPU Time1

FileReader_PrepareFields284.4990241f_4 (PythonCaller)111.4471142FileReader_PrepareFields223.6118332f_4 (PythonCaller)112.33632

3FileReader_PrepareFields222.3794253f_4 (PythonCaller)117.6247544FileReader_PrepareFields220.8662164f_4 (PythonCaller)118.935162

5FileReader_PrepareFields99.871845f_4 (PythonCaller)115.487546FileReader_PrepareFields40.0298576f_4 (PythonCaller)115.955543

Maybe using some specific transformers like the AttributeTrimmer is a first step but you look good at python and if top performance is your goal, it may worth something to do the date stuff using python.

Latest version of the python code:

import fme
import fmeobjects
import json
# Template Function interface:
# When using this function, make sure its name is set as the value of
# the 'Class or Function to Process Features' transformer parameter
def processFeature(feature):
    pass


# Template Class Interface:
# When using this class, make sure its name is set as the value of
# the 'Class or Function to Process Features' transformer parameter
class FeatureProcessor(object):
    def __init__(self):
        self.dataMapping = json.loads(fme.macroValues['DataMappingJSON'])
        self.loadFields()
        pass
    def loadFields(self):
       self.customFieldsAttributes = {}
       self.sourceAttributes = {}
       for field in self.dataMapping: 
           source = field['sourceColumn']
           destiny = field['id']
           if field['isCustom']: 
               self.customFieldsAttributes[source] = destiny
           else: 
               self.sourceAttributes[source] = destiny
       self.attributesToTrim = ['ATTENDING_SCHOOL_CODE', 'ZONED_SCHOOL_CODE', 'ENROLLED_SCHOOL_CODE', 'SUMMER_SCHOOL_CODE', 'STUDENT_ID', 'GRADE_CODE', 'GENDER', 'ETHNICITY_CODE', 'DATE_OF_BIRTH', 'STREET', 'SPECIAL_,ED' 'WHEEL_CHAIR', 'SAFETY_VEST_HARNESS', 'SEATBELT', 'CAR_SEAT', 'BOOSTER_SEAT', 'WALKER', 'TINTED_WINDOWS', 'AIR_COND', 'AIDE_REQ', 'NURSE', 'OXYGEN', 'SERVICE_ANIMAL']
    def input(self,feature):
        newFeature = fmeobjects.FMEFeature()
        customs = []
        #We read all column names of the incoming feature
        incomingColumnNames = feature.getAllAttributeNames()
        for sourceField in incomingColumnNames:
             #Trim fields
             if sourceField in self.attributesToTrim:
                 feature.setAttribute(sourceField, feature.getAttribute(sourceField).strip())
             #If the source field we stored in the array to convert it in JSON Later
             cellValue = feature.getAttribute(sourceField)
             if sourceField in self.customFieldsAttributes.keys(): 
                 destiny = self.customFieldsAttributes[sourceField] # we map the sourceColumn to id
                 customs.append({ 'key': destiny, 'value': cellValue })
             else:
                 if sourceField in self.sourceAttributes.keys():
                     destiny = self.sourceAttributes[sourceField] # we map the sourceColumn to id
                     newFeature.setAttribute(destiny, cellValue)
        newFeature.setAttribute('custom_fields_json', json.dumps(customs))        
        newFeature.setAttribute('csv_line_number', feature.getAttribute('csv_line_number')) #csv_line_number comes as feature after reading the file      
        self.pyoutput(newFeature)
    def close(self):
        pass

+19

fmelizard
Safer
3725 replies
7 years ago
August 2, 2017

larry wrote:

I made some testing to see what was taking so long in the PrepareFields (AttributeManager) transformer.

I used FME 2017.1and profiled each of the following:

1- Disabled everything past the DuplicateFilter

2- Removed attribute trimming in PrepareFields

3- Added attribute trimming in the DataMapping PythonCaller using a set for the list of attribute to trim

At this point, I just tried to identify what time was taken by specific tasks in PrepareFields

4- Removed Regex in PrepareFields

5- Removed the two date related attribute creation in PrepareFields

6- Removed conditionnal date creation in PrepareFields

At this point, the only stuff remaining in PrepareFields is ReplaceString and UpperCase calls.

StepTimer ContextUser CPU Time1

FileReader_PrepareFields284.4990241f_4 (PythonCaller)111.4471142FileReader_PrepareFields223.6118332f_4 (PythonCaller)112.33632

3FileReader_PrepareFields222.3794253f_4 (PythonCaller)117.6247544FileReader_PrepareFields220.8662164f_4 (PythonCaller)118.935162

5FileReader_PrepareFields99.871845f_4 (PythonCaller)115.487546FileReader_PrepareFields40.0298576f_4 (PythonCaller)115.955543

Latest version of the python code:

import fme
import fmeobjects
import json
# Template Function interface:
# When using this function, make sure its name is set as the value of
# the 'Class or Function to Process Features' transformer parameter
def processFeature(feature):
    pass


# Template Class Interface:
# When using this class, make sure its name is set as the value of
# the 'Class or Function to Process Features' transformer parameter
class FeatureProcessor(object):
    def __init__(self):
        self.dataMapping = json.loads(fme.macroValues['DataMappingJSON'])
        self.loadFields()
        pass
    def loadFields(self):
       self.customFieldsAttributes = {}
       self.sourceAttributes = {}
       for field in self.dataMapping: 
           source = field['sourceColumn']
           destiny = field['id']
           if field['isCustom']: 
               self.customFieldsAttributes[source] = destiny
           else: 
               self.sourceAttributes[source] = destiny
       self.attributesToTrim = ['ATTENDING_SCHOOL_CODE', 'ZONED_SCHOOL_CODE', 'ENROLLED_SCHOOL_CODE', 'SUMMER_SCHOOL_CODE', 'STUDENT_ID', 'GRADE_CODE', 'GENDER', 'ETHNICITY_CODE', 'DATE_OF_BIRTH', 'STREET', 'SPECIAL_,ED' 'WHEEL_CHAIR', 'SAFETY_VEST_HARNESS', 'SEATBELT', 'CAR_SEAT', 'BOOSTER_SEAT', 'WALKER', 'TINTED_WINDOWS', 'AIR_COND', 'AIDE_REQ', 'NURSE', 'OXYGEN', 'SERVICE_ANIMAL']
    def input(self,feature):
        newFeature = fmeobjects.FMEFeature()
        customs = []
        #We read all column names of the incoming feature
        incomingColumnNames = feature.getAllAttributeNames()
        for sourceField in incomingColumnNames:
             #Trim fields
             if sourceField in self.attributesToTrim:
                 feature.setAttribute(sourceField, feature.getAttribute(sourceField).strip())
             #If the source field we stored in the array to convert it in JSON Later
             cellValue = feature.getAttribute(sourceField)
             if sourceField in self.customFieldsAttributes.keys(): 
                 destiny = self.customFieldsAttributes[sourceField] # we map the sourceColumn to id
                 customs.append({ 'key': destiny, 'value': cellValue })
             else:
                 if sourceField in self.sourceAttributes.keys():
                     destiny = self.sourceAttributes[sourceField] # we map the sourceColumn to id
                     newFeature.setAttribute(destiny, cellValue)
        newFeature.setAttribute('custom_fields_json', json.dumps(customs))        
        newFeature.setAttribute('csv_line_number', feature.getAttribute('csv_line_number')) #csv_line_number comes as feature after reading the file      
        self.pyoutput(newFeature)
    def close(self):
        pass

@larry This is great! Very interesting to see the effect of removing the Date stuff.

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

How to improve attribute validator unique performance?