Skip to main content

I'm having a hard time even coming up with a title for this problem that I'm seeking help on. But I'll summarize and provide context as much as I can.

We are trying to parse publicly available medical data. This data is provided as a concatentation of multipe substrings (ex: 100 or more fields) in one long string. An accompanying document describes the start and end position of each substring that defines the value of a field. I was able to mock up a quick workflow to dynamically supply a single row Excel or CSV file with this data (see excerpt below that I used as a proof of concept) and then feed in the long string text data file:

The workspace reads the above and uses an AttributeExploder to "transpose" this table. I then clean out unwanted features and do some calculations to create an appropriate StartIndex and EndIndex (which I use later to extract the substrings). After reading each line in the text file, I then use a FeatureMerger to stitch this together with the above:

With some formulas in the last AttributeManager in the above, I then extract the needed substrings. This gets me very close to what I need, but not exactly. Since these datasets can have hundreds of fields, my goal is to make this as generic as possible (I don't want to define the exact attribute names in an AttributeManager). The user would simply need to get the "substring mapping" data from the supporting documentation and enter it into Excel (we'll call this the "Schema Table") and use a workspace to reformat the data. The problem is that I'm now ending up with data where some of the rows actually define the fields (ex: highlighted 6 features are really one feature with 6 attributes whose name is the value of "_attr_name") Notice how these repeat every 6 features in this example:

I am looking for help with restructuring this so I can get the data formatted dynamically, regardless of number of attributes etc. Any help would be greatly appreciated. I suspect this cannot be done directly with Transfoermers (I hope that is not the case!) so if Python is needed to solve this, I'm open to that suggestion as well. It's almost like I need something to organize the features in groups of "X" (6 in this example) based on the number of attributes in the Schema table, then transpose each group. Thanks.

Are you able to provide an example of the text string data and the accompanying document?


Are you able to provide an example of the text string data and the accompanying document?

Sure @ebygomm, I'm only attaching one page of the long document (just to keep focus on the issue at hand). I used just the first six from this page for my example above. Also attaching my "schema" file so you can see the relation to the document. Thanks.


 

So this takes your input data, and the schema, creates the attributes that are listed in the schema excel and then writes out an excel file using that schema

dynamic_parse.fmwt


 

So this takes your input data, and the schema, creates the attributes that are listed in the schema excel and then writes out an excel file using that schema

dynamic_parse.fmwt

I think you got it! I have a hard time with using Dynamic schemas...was experimenting on a similar route this morning but it was going to be long-winded. I'll study your solution and post back shortly. Kudos and thanks!


 

So this takes your input data, and the schema, creates the attributes that are listed in the schema excel and then writes out an excel file using that schema

dynamic_parse.fmwt

One of the key things I learned from your workspace is that you can create an attribute based on the value of an attribute; that was a crucial step that made the solution compact. In my previous experiment, I was fanning out to CSV by Line Number, then re-reading, transposing and writing back out to a single file but needless to say, it was unelegant and convoluted 🙂 In my Dynamic writing experiments, I now notice I was failing to type an input for "Schema Definition Name". Will continue experimenting and hopefully it'll sink in eventually. Big thanks again for your help.


I think you got it! I have a hard time with using Dynamic schemas...was experimenting on a similar route this morning but it was going to be long-winded. I'll study your solution and post back shortly. Kudos and thanks!

You could replace the ListExploder, AttributeCreator and Aggregator with a python caller which will give better performance, but if I was going to start using python I'd probably approach it in a different way

e.g.

import fme
import fmeobjects


def processFeature(feature):
    mylist = list(zip(feature.getAttribute('_attr_list{}._attr_name'),feature.getAttribute('_attr_list{}._attr_value')))
    for name,pos in mylist:
        if '-' in pos:
           split_list = pos.split('-')

           start_idx=int(split_listn0])

           end_idx = int(split_listo1])

        else:
            start_idx=int(pos)
            end_idx = int(pos)+1
       
        value = feature.getAttribute('text_line_data')/start_idx:end_idx]
        feature.setAttribute(name,value)

 

So this takes your input data, and the schema, creates the attributes that are listed in the schema excel and then writes out an excel file using that schema

dynamic_parse.fmwt

Just noticed that the parsing is slightly incorrect in your example. I'll post an updated workspace later today. Basically because the substring function uses a zero-based index, you have to account for that as it won't match up with the input in the schema file. So when it shows position 1-8 for the first field for example, the startindex should be 0 and the endindex should be 7. Everything is shifted down by 1.

Also thanks to your post, I am now noticing a nuanced difference between the AttributeCreator and the AttributeManager (at least up to 2019.1). I tend to use the latter almost exclusively and this option is NOT there! No wonder I missed out...why is that @daleatsafe? :)


Just noticed that the parsing is slightly incorrect in your example. I'll post an updated workspace later today. Basically because the substring function uses a zero-based index, you have to account for that as it won't match up with the input in the schema file. So when it shows position 1-8 for the first field for example, the startindex should be 0 and the endindex should be 7. Everything is shifted down by 1.

Also thanks to your post, I am now noticing a nuanced difference between the AttributeCreator and the AttributeManager (at least up to 2019.1). I tend to use the latter almost exclusively and this option is NOT there! No wonder I missed out...why is that @daleatsafe? :)

This is how I've done something similar previously, going down the python route. Pythoncaller builds the schema and divides up the text from a csv file structured as per your table

dynamic_parse_python.fmwt


This is how I've done something similar previously, going down the python route. Pythoncaller builds the schema and divides up the text from a csv file structured as per your table

dynamic_parse_python.fmwt

Nice, thanks a lot for sharing. Will definitely study this...I'm not fluent in Python at all but if performance becomes an issue, will have to implement something like this and seek help from other internal resources for Python expertise.


Posting my solution based on the great beedback by @ebygomm. There are some variations in how I parsed the Start & End substring indices (did not use Regex in my case) but on the whole it's the same concept. The biggest lesson that contributed to the solution was that the AttributeCreator and the AttributeManager are NOT the same in terms of features! And obviously using a Dynamic writer is the cherry on top. Thanks a lot for all the help.


Reply