Skip to main content
Solved

Generate list of all unique characters contained in (huge free text) field.


Forum|alt.badge.img

Hi,

Can anyone provide some advice as to the most efficient way to explode a huge free text field or fields into all of its character elements retain a single instance of each. I am essentially trying to complete a pre-flight check in order to understand whether there are any ‘odd’ or ‘unexpected’ characters in an ever expanding data set, over which I have no control.

 

I have created a process below which completes the task; however, it is very inefficient and as the number of records increases it will become too slow.

 

1. Derive string length of free text field

 

2. clone by number derived in 1 (clone number created in process)

 

3. substring extract using clone number to obtain character at that position

 

4. Duplicate remover to create my list.

 

5. Expose character code.

 

6. Output list

Thanks in advance,

Rob

Best answer by takashi

1. You can expose the list name "_char{}" with the Attributes to Expose parameter in the PythonCaller parameters dialog.

0684Q00000ArKzHQAV.png

2. This script creates a list from all the input features, then outputs a single feature having the list at last.

# PythonCaller Script Example 2
import fmeobjects
class FeatureProcessor(object):
    def __init__(self):
        self.chars = set([])
        
    def input(self, feature):        
        self.chars |= set(feature.getAttribute('_text'))
        
    def close(self):
        feature = fmeobjects.FMEFeature()
        feature.setAttribute('_char{}', list(self.chars))
        self.pyoutput(feature)

In addition, if you finally need to explode the feature on the list, the close method can be modified like this, instead of using the ListExploder afterword. 

    def close(self):
        for i, c in enumerate(self.chars):
            feature = fmeobjects.FMEFeature()
            feature.setAttribute('_char', c)
            feature.setAttribute('_element_index', i)
            self.pyoutput(feature)
View original
Did this help you find an answer to your question?

5 replies

ebygomm
Influencer
Forum|alt.badge.img+38
  • Influencer
  • June 26, 2017

You could create your list initiallly by using a stringsearcher with regular expression . and creating a list name for all matches, then using a list duplicate remover to get a list of unique characters.

No idea on how that would compare performance wise


takashi
Influencer
  • June 26, 2017

Hi @rob14, I think using Python script could be more efficient. Assuming that an attribute called "_text" stores a text string, a PythonCaller with this script creates a list contains unique characters.

# PythonCaller Script Example
def processFeature(feature):
    s = set(feature.getAttribute('_text'))
    feature.setAttribute('_char{}', list(s))


Forum|alt.badge.img
  • Author
  • June 26, 2017
takashi wrote:

Hi @rob14, I think using Python script could be more efficient. Assuming that an attribute called "_text" stores a text string, a PythonCaller with this script creates a list contains unique characters.

# PythonCaller Script Example
def processFeature(feature):
    s = set(feature.getAttribute('_text'))
    feature.setAttribute('_char{}', list(s))

Hi @takashi,

 

 

Thanks very much, nearly there. but I have 2 questions;

 

 

1. the script has worked and I can see the unique chars, however, how do I expose and explode the list "_char". When I tried to use list exploder the list is not seen, do I need to complete additional configuration in the PythonCaller Trasnformer?.

 

 

2. Also if I needed to also do this globally across all records to find a unique list across all records, (instead/as well as unique to a given record), is there a quick way to do that as well? (rather than python caller -> list exploder-> duplicate remover.

 

 

I am interested in being able to do both.

 

 

Thanks,

 

 

Rob

 

 

 


takashi
Influencer
  • Best Answer
  • June 26, 2017

1. You can expose the list name "_char{}" with the Attributes to Expose parameter in the PythonCaller parameters dialog.

0684Q00000ArKzHQAV.png

2. This script creates a list from all the input features, then outputs a single feature having the list at last.

# PythonCaller Script Example 2
import fmeobjects
class FeatureProcessor(object):
    def __init__(self):
        self.chars = set([])
        
    def input(self, feature):        
        self.chars |= set(feature.getAttribute('_text'))
        
    def close(self):
        feature = fmeobjects.FMEFeature()
        feature.setAttribute('_char{}', list(self.chars))
        self.pyoutput(feature)

In addition, if you finally need to explode the feature on the list, the close method can be modified like this, instead of using the ListExploder afterword. 

    def close(self):
        for i, c in enumerate(self.chars):
            feature = fmeobjects.FMEFeature()
            feature.setAttribute('_char', c)
            feature.setAttribute('_element_index', i)
            self.pyoutput(feature)

Forum|alt.badge.img
  • Author
  • June 26, 2017

Hi @takashi

You are a Star.

Thanks very much that was lighting quick to run!.

Regards,

Rob


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings