Solved

Generate list of all unique characters contained in (huge free text) field.

  • 26 June 2017
  • 5 replies
  • 19 views

Badge

Hi,

Can anyone provide some advice as to the most efficient way to explode a huge free text field or fields into all of its character elements retain a single instance of each. I am essentially trying to complete a pre-flight check in order to understand whether there are any ‘odd’ or ‘unexpected’ characters in an ever expanding data set, over which I have no control.

 

I have created a process below which completes the task; however, it is very inefficient and as the number of records increases it will become too slow.

 

1. Derive string length of free text field

 

2. clone by number derived in 1 (clone number created in process)

 

3. substring extract using clone number to obtain character at that position

 

4. Duplicate remover to create my list.

 

5. Expose character code.

 

6. Output list

Thanks in advance,

Rob

icon

Best answer by takashi 26 June 2017, 14:42

View original

5 replies

Userlevel 1
Badge +21

You could create your list initiallly by using a stringsearcher with regular expression . and creating a list name for all matches, then using a list duplicate remover to get a list of unique characters.

No idea on how that would compare performance wise

Userlevel 2
Badge +17

Hi @rob14, I think using Python script could be more efficient. Assuming that an attribute called "_text" stores a text string, a PythonCaller with this script creates a list contains unique characters.

# PythonCaller Script Example
def processFeature(feature):
    s = set(feature.getAttribute('_text'))
    feature.setAttribute('_char{}', list(s))

Badge

Hi @rob14, I think using Python script could be more efficient. Assuming that an attribute called "_text" stores a text string, a PythonCaller with this script creates a list contains unique characters.

# PythonCaller Script Example
def processFeature(feature):
    s = set(feature.getAttribute('_text'))
    feature.setAttribute('_char{}', list(s))

Hi @takashi,

 

 

Thanks very much, nearly there. but I have 2 questions;

 

 

1. the script has worked and I can see the unique chars, however, how do I expose and explode the list "_char". When I tried to use list exploder the list is not seen, do I need to complete additional configuration in the PythonCaller Trasnformer?.

 

 

2. Also if I needed to also do this globally across all records to find a unique list across all records, (instead/as well as unique to a given record), is there a quick way to do that as well? (rather than python caller -> list exploder-> duplicate remover.

 

 

I am interested in being able to do both.

 

 

Thanks,

 

 

Rob

 

 

 

Userlevel 2
Badge +17

1. You can expose the list name "_char{}" with the Attributes to Expose parameter in the PythonCaller parameters dialog.

0684Q00000ArKzHQAV.png

2. This script creates a list from all the input features, then outputs a single feature having the list at last.

# PythonCaller Script Example 2
import fmeobjects
class FeatureProcessor(object):
    def __init__(self):
        self.chars = set([])
        
    def input(self, feature):        
        self.chars |= set(feature.getAttribute('_text'))
        
    def close(self):
        feature = fmeobjects.FMEFeature()
        feature.setAttribute('_char{}', list(self.chars))
        self.pyoutput(feature)

In addition, if you finally need to explode the feature on the list, the close method can be modified like this, instead of using the ListExploder afterword. 

    def close(self):
        for i, c in enumerate(self.chars):
            feature = fmeobjects.FMEFeature()
            feature.setAttribute('_char', c)
            feature.setAttribute('_element_index', i)
            self.pyoutput(feature)
Badge

Hi @takashi

You are a Star.

Thanks very much that was lighting quick to run!.

Regards,

Rob

Reply