Reformat data columns in Excel spreadsheet

Question

I have been given a huge spreadsheet containing countless series of identical columns that are duplicated many times, with a prefix in front of each that indicates some kind of property of the data. I now want to reformat these data so that the duplicate columns are replaced with unique columns and the attributes that were used to separate the columns become attributes in the data instead. Please see simplified attachment (and screenshot, below) which shows the "Current format" and "Desired format" sheets.

Current

Current data Desired

Desired data

I've tried several things, none of which are giving me the results I want:

Adding an AttributeCreator to create a separate, duplicated pipeline for each of the types (Apple, Banana, Orange, etc.) and then BulkAttributeRenamer on each to remove the prefix of the column based on the attribute I created. In theory, this should create a single "Tree Height", etc. column per pipeline that can then be merged back together and the remaining attributes discarded but this doesn't work because it doesn't rename the columns correctly using the new attribute.
Using an AttributeExploder to write the attributenames to attributes, then using an AttributeCreator to create a separate, duplicated pipeline for each of the types (Apple, Banana, Orange, etc.). I then merged these back together into a string replacer to remove the "type" from the attributes and used an AttributeKeeper the cleaned attributes, e.g. "Tree Height", etc. This takes too long to run and quickly reaches tens of millions of rows. It is also quite complicated to recreate the table in a sensible structure at the end
Adding an AttributeCreator to create a separate, duplicated pipeline for each of the types (Apple, Banana, Orange, etc.) and then using a BulkAttributeRemover to only keep the columns that contain the value of the attribute I created. Unfortunately, the REGEX does not allow an attribute value to be used in the search criteria so this would require hard coding for every Apple, Banana, Orange, etc.

Ideally, I'd like to be able to read in a table of all of the prefixes (e.g. "Apple", "Banana", "Orange" in this case) and reformat the table automatically by using their values. Whatever the solution, it needs to be relatively scalable because the spreadsheet is huge! Any help gratefully received, cheers

icon

Best answer by comelio 7 October 2021, 18:53

View original

comelio · Accepted Answer

Hi @bi​, interesting question! I was only able to solve it using the python called, but my solution might help someone else make an FME native solution!*Python Caller*importfmeimportfmeobjectsfromcollectionsimportdefaultdictcols=('TreeBranches','TreeHeight','TreeLeaves','TreeWidth')groups_headers=('TreeSpecies',)classFeatureProcessor(object):def__init__(self):self.features_list=list()definput(self,feature):self.features_list.append(feature)defclose(self):forfinself.features_list:names=f.getAttribute('_attr_list{}._attr_name')values=f.getAttribute('_attr_list{}._attr_value')groups=defaultdict(dict)forname,valinzip(names,values):forcolincols:ifcolinname:gs=name.replace(col,'')gl=gs.split('')groups[tuple(gl)][col]=valforgroup,attrsingroups.items():forcol_name,valinzip(groups_headers,group):f.setAttribute(col_name,val)forcol_name,valinattrs.items():f.setAttribute(col_name,val)self.pyoutput(f)

connecter · Answer

Hi @bi​,here is a workflow I created for you. I also try the AttributeExploder, maybe it help you:Result:

Reformat data columns in Excel spreadsheet

1 Attachment

4 replies

Reply

Community Stats

1 Attachment

Reply

Community Stats

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded