Skip to main content

I have been given a huge spreadsheet containing countless series of identical columns that are duplicated many times, with a prefix in front of each that indicates some kind of property of the data. I now want to reformat these data so that the duplicate columns are replaced with unique columns and the attributes that were used to separate the columns become attributes in the data instead. Please see simplified attachment (and screenshot, below) which shows the "Current format" and "Desired format" sheets.

 

Current

Current dataDesired

Desired data 

I've tried several things, none of which are giving me the results I want:

  1. Adding an AttributeCreator to create a separate, duplicated pipeline for each of the types (Apple, Banana, Orange, etc.) and then BulkAttributeRenamer on each to remove the prefix of the column based on the attribute I created. In theory, this should create a single "Tree Height", etc. column per pipeline that can then be merged back together and the remaining attributes discarded but this doesn't work because it doesn't rename the columns correctly using the new attribute.
  2. Using an AttributeExploder to write the attributenames to attributes, then using an AttributeCreator to create a separate, duplicated pipeline for each of the types (Apple, Banana, Orange, etc.). I then merged these back together into a string replacer to remove the "type" from the attributes and used an AttributeKeeper the cleaned attributes, e.g. "Tree Height", etc. This takes too long to run and quickly reaches tens of millions of rows. It is also quite complicated to recreate the table in a sensible structure at the end
  3. Adding an AttributeCreator to create a separate, duplicated pipeline for each of the types (Apple, Banana, Orange, etc.) and then using a BulkAttributeRemover to only keep the columns that contain the value of the attribute I created. Unfortunately, the REGEX does not allow an attribute value to be used in the search criteria so this would require hard coding for every Apple, Banana, Orange, etc.

 

Ideally, I'd like to be able to read in a table of all of the prefixes (e.g. "Apple", "Banana", "Orange" in this case) and reformat the table automatically by using their values. Whatever the solution, it needs to be relatively scalable because the spreadsheet is huge! Any help gratefully received, cheers

Hi @bi​ ,

here is a workflow I created for you. I also try the AttributeExploder, maybe it help you:

 

Screenshot_1Result:

Screenshot_2


Hi @bi​ , interesting question! I was only able to solve it using the python called, but my solution might help someone else make an FME native solution!

 

*Python Caller*

 

import fme

import fmeobjects

 

from collections import defaultdict

 

cols = (

    'Tree Branches',

    'Tree Height',

    'Tree Leaves',

    'Tree Width'

)

 

groups_headers = (

    'Tree Species',

)

 

class FeatureProcessor(object):

    def __init__(self):

        self.features_list = list()

    def input(self,feature):

        self.features_list.append(feature)

    def close(self):

        for f in self.features_list:

            names = f.getAttribute('_attr_list{}._attr_name')

            values = f.getAttribute('_attr_list{}._attr_value')

 

            groups = defaultdict(dict)

            for name, val in zip(names, values):

                for col in cols:

                    if col in name:

                        gs = name.replace(col, '')

                        gl = gs.split(' ')

                        groups tuple(gl)]Âcol] = val

 

            for group, attrs in groups.items():

                for col_name, val in zip(groups_headers, group):

                    f.setAttribute(col_name, val)

                for col_name, val in attrs.items():

                    f.setAttribute(col_name, val)

                self.pyoutput(f)

 

MicrosoftTeams-image (1)


Hi @bi​ ,

here is a workflow I created for you. I also try the AttributeExploder, maybe it help you:

 

Screenshot_1Result:

Screenshot_2

Thanks for this, it does struggle at the quantity of data I have unfortunately. I really appreciate the response though


Hi @bi​ , interesting question! I was only able to solve it using the python called, but my solution might help someone else make an FME native solution!

 

*Python Caller*

 

import fme

import fmeobjects

 

from collections import defaultdict

 

cols = (

    'Tree Branches',

    'Tree Height',

    'Tree Leaves',

    'Tree Width'

)

 

groups_headers = (

    'Tree Species',

)

 

class FeatureProcessor(object):

    def __init__(self):

        self.features_list = list()

    def input(self,feature):

        self.features_list.append(feature)

    def close(self):

        for f in self.features_list:

            names = f.getAttribute('_attr_list{}._attr_name')

            values = f.getAttribute('_attr_list{}._attr_value')

 

            groups = defaultdict(dict)

            for name, val in zip(names, values):

                for col in cols:

                    if col in name:

                        gs = name.replace(col, '')

                        gl = gs.split(' ')

                        groups tuple(gl)]Âcol] = val

 

            for group, attrs in groups.items():

                for col_name, val in zip(groups_headers, group):

                    f.setAttribute(col_name, val)

                for col_name, val in attrs.items():

                    f.setAttribute(col_name, val)

                self.pyoutput(f)

 

MicrosoftTeams-image (1)

This is fantastic. Thank you​, Comelio!


Reply