Solved

Generate list of all unique characters contained in (huge free text) field.

6 years ago
26 June 2017
5 replies
19 views

rob14
27 replies

Hi,

Can anyone provide some advice as to the most efficient way to explode a huge free text field or fields into all of its character elements retain a single instance of each. I am essentially trying to complete a pre-flight check in order to understand whether there are any ‘odd’ or ‘unexpected’ characters in an ever expanding data set, over which I have no control.

I have created a process below which completes the task; however, it is very inefficient and as the number of records increases it will become too slow.

1. Derive string length of free text field

2. clone by number derived in 1 (clone number created in process)

3. substring extract using clone number to obtain character at that position

4. Duplicate remover to create my list.

5. Expose character code.

6. Output list

Thanks in advance,

Rob

icon

Best answer by takashi 26 June 2017, 14:42

View original

5 replies

Userlevel 1

+21

ebygomm
Contributor
3079 replies
6 years ago
26 June 2017

You could create your list initiallly by using a stringsearcher with regular expression . and creating a list name for all matches, then using a list duplicate remover to get a list of unique characters.

No idea on how that would compare performance wise

Userlevel 2

+17

takashi
Contributor
7538 replies
6 years ago
26 June 2017

Hi @rob14, I think using Python script could be more efficient. Assuming that an attribute called "_text" stores a text string, a PythonCaller with this script creates a list contains unique characters.

# PythonCaller Script Example
def processFeature(feature):
    s = set(feature.getAttribute('_text'))
    feature.setAttribute('_char{}', list(s))

rob14
Author
27 replies
6 years ago
26 June 2017

# PythonCaller Script Example
def processFeature(feature):
    s = set(feature.getAttribute('_text'))
    feature.setAttribute('_char{}', list(s))

Hi @takashi,

Thanks very much, nearly there. but I have 2 questions;

1. the script has worked and I can see the unique chars, however, how do I expose and explode the list "_char". When I tried to use list exploder the list is not seen, do I need to complete additional configuration in the PythonCaller Trasnformer?.

2. Also if I needed to also do this globally across all records to find a unique list across all records, (instead/as well as unique to a given record), is there a quick way to do that as well? (rather than python caller -> list exploder-> duplicate remover.

I am interested in being able to do both.

Thanks,

Rob

Userlevel 2

+17

takashi
Contributor
7538 replies
6 years ago
26 June 2017
Best Answer

1. You can expose the list name "_char{}" with the Attributes to Expose parameter in the PythonCaller parameters dialog.

2. This script creates a list from all the input features, then outputs a single feature having the list at last.

# PythonCaller Script Example 2
import fmeobjects
class FeatureProcessor(object):
    def __init__(self):
        self.chars = set([])
        
    def input(self, feature):        
        self.chars |= set(feature.getAttribute('_text'))
        
    def close(self):
        feature = fmeobjects.FMEFeature()
        feature.setAttribute('_char{}', list(self.chars))
        self.pyoutput(feature)

In addition, if you finally need to explode the feature on the list, the close method can be modified like this, instead of using the ListExploder afterword.

    def close(self):
        for i, c in enumerate(self.chars):
            feature = fmeobjects.FMEFeature()
            feature.setAttribute('_char', c)
            feature.setAttribute('_element_index', i)
            self.pyoutput(feature)

rob14
Author
27 replies
6 years ago
26 June 2017

Hi @takashi

You are a Star.

Thanks very much that was lighting quick to run!.

Regards,

Rob

Generate list of all unique characters contained in (huge free text) field.

5 replies

Reply

Community Stats

Reply

Community Stats

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded