Skip to main content

Last week I was contacted by a colleague to process a text based dataset that reminded me of something I had seen in the past for Fortran based datasets.

 

In particular, the text based dataset looks somewhat like a CSV dataset, only now columns are separated by a fixed width. See below for an example of how it looked:

imageI was wondering if there is maybe already some type of reader that can deal with such type of data?

 

Of course, if you get such a dedicated dataset, you can use a text_line reader, cut/split the text line into separate features (based on the known width (e.g. using a '#s#s#s' format string in an AttributeSplitter, see third example on the documentation page)), and then using e.g. an AttributeTrimmer to remove excessive whitespace.

 

However, this method needs to be configured for each individual dataset, so I was wondering if there was maybe already a tailored made reader that could do this (by first automatically detecting the column widths).

 

Out of curiosity I created a workspace myself that could deal with files like these a bit more dynamically (using the header line to detect the widths of the columns). But it's not that clean and requires manual exposing of the attributes at the end. Also it assumes the headerNames don't contain (white)space characters, and in this case also requires manually removing the header/data separating line (such a line was not present in the dataset I encountered earlier, so I considered this as a manual step). Nevertheless, see the attached workspace.

 

 

One alternative solution is to use a StringReplacer to replace multiple occurences of a whitespace character with a new character which you can use as a separator character for the AttributeSplitter.


The Column Aligned Text (CAT) reader will read fixed widths, but requires input to specify the widths. I don't think you can set the width dynamically.


One alternative solution is to use a StringReplacer to replace multiple occurences of a whitespace character with a new character which you can use as a separator character for the AttributeSplitter.

Hi @Hans van der Maarel​ ,

 

Thanks for your input. My main question was to find out if there is already a reader that can deal with such datasets, and I see that @ebygomm​ just responded below that the Column Aligned Text (CAT) reader is probably what I am looking for.

 

I understand that there are many alternative approaches, and nice to hear of your alternative solution. However, that would introduce the assumption that the data in a column never contains several whitespace characters (e.g. 'Daniel RadCliffe'), which probably won't happen, but yeah... you know what they say about 'when you assume ...' 😉

Therefore I personally prefer to impose as little additional assumptions to the data as possible. The main known here is that the columns are defined by a fixed width, so personally I think it is most robust to split the data on a fixed width.


The Column Aligned Text (CAT) reader will read fixed widths, but requires input to specify the widths. I don't think you can set the width dynamically.

Thanks, that's exactly what I was looking for!

 

I vaguely remember having heard of that reader/format before, but a little too vague to identify it as an opportunity here 🙂.

Too bad that the CAT reader requires user input to specify the widths, and can't determine this automatically. That said, from an architectural view I can understand such a choice.


Reply