Solved

Shape File Reader encoding Numeric fields as String.

  • 11 August 2017
  • 9 replies
  • 5 views

Hello,

I have some data in Shape File format that when read into FME the field type changes from Double to an encoded attribute. See the below screen shots from both Arc Catalog and FME's reader transformer. Below is the printout from the log file. So my end goal is to use the "getAttributeType()" type method to perform some data renaming using the attribute type. So is there something I'm missing with the Shape File Reader or Shape File specification in FME?If I import the data into a File Geodatabase it works fine where FME will maintain the field types.

Attribute(encoded: fme-system): `Length__fe' has value `10'

Attribute(encoded: fme-system): `Width__fee' has value `3

FME Build information below:

FME Database Edition

FME 2017.1.0.0 Build 17488 - WIN32.

Any help or information as to why FME does this would be very helpful!

icon

Best answer by takashi 13 August 2017, 06:44

View original

9 replies

Badge +6

Hello @j87foster,

At first, my thought was that it is merely defaulting to 'number' from 'double' due to the precision/scale being too large. For example, if FME defined a double data type as precision: '2', and width '3', if the shapefile feature coming out had a feature of precision '6' and width '10' it may default to the number type to accomodate the shapefile's values.

 

 

With that in mind though, would you mind sharing your shapefile that you are seeing this behavior with? If you don't feel comfortable sharing the shapefile here, you can submit a case here: https://www.safe.com/support/

 

 

Sharing the data will allow us to take a closer look to see exactly what is occurring.
Userlevel 2
Badge +17

Hi @j87foster, I don't know why FME interprets the "number" type (defined in the source dataset) into the "fme_system" type (internal data type), but I don't think it causes any problem.

The internal data types shown on FME Data Inspector (or Log) often do not match the data types defined in the source dataset, as you observed. However, it won't cause any problem, since the internal data types can be changed automatically (implicitly) as needed while translating, e.g. from numeric to character string, vise versa. That is, the internal data type of an attribute may vary depending on the situation.

FME reads and preserves the schema of source dataset separately from the features, and the source schema will be used to configure the destination schema if you use Dynamic Schema option. Individual features don't have the information about data type defined in the source dataset.

I guess "getAttributeType()" is a method from Python FME Objects API. The method returns a data type of specified attribute, but it doesn't mean the data type defined in the source dataset. It just indicates the internal data type when the method is called.If you are going to do something with the "getAttributeType()" method, you will have to understand that the internal data type returned from the method could be different from the data type defined in the source dataset.
@trentatsafe here's the data: bridge.zip

 

Hi @j87foster, I don't know why FME interprets the "number" type (defined in the source dataset) into the "fme_system" type (internal data type), but I don't think it causes any problem.

The internal data types shown on FME Data Inspector (or Log) often do not match the data types defined in the source dataset, as you observed. However, it won't cause any problem, since the internal data types can be changed automatically (implicitly) as needed while translating, e.g. from numeric to character string, vise versa. That is, the internal data type of an attribute may vary depending on the situation.

FME reads and preserves the schema of source dataset separately from the features, and the source schema will be used to configure the destination schema if you use Dynamic Schema option. Individual features don't have the information about data type defined in the source dataset.

I guess "getAttributeType()" is a method from Python FME Objects API. The method returns a data type of specified attribute, but it doesn't mean the data type defined in the source dataset. It just indicates the internal data type when the method is called.If you are going to do something with the "getAttributeType()" method, you will have to understand that the internal data type returned from the method could be different from the data type defined in the source dataset.
It seems that there's not consistency with how FME encodes fields from different file formats. Sometimes I'd like to use the field type to populate an attribute with a certain value depending on the field type. Therefore it would be nice if FME had a consistent rule for all formats it reads.

 

 

Userlevel 2
Badge +17

If you need to rename some attributes based on their data types defined in the source dataset, you will have to refer the schema feature, in the current FME.

The following is an example to add a common prefix to every numeric type attribute names, according to the Shapefile native data types.

Note: According to the native format specifications, Shapefile only have "number(w,p)" type for numeric data fields. I don't know why the current FME (2016 and 2017) Shapefile reader considers "short" and "long" as native data types of Esri Shapefile format. See also this thread:

short datatypes understood as long with new Shapefile reader (FME2016)

 


Example:

0684Q00000ArKQ6QAN.png

# PythonCaller_1: Modify Attribute Names
# e.g. Add a common prefix "n_" to every numeric type attributes.
# And create a global variable storing comma-separated [newName,srcName]+,
# which will be used as a parameter of the @RenameAttributes function.
# Assume that the source format is Esri Shapefile.
def modifySchema(feature):
    names = feature.getAttribute('attribute{}.name')
    types = feature.getAttribute('attribute{}.native_data_type')
    renames = []
    for i, (srcName, type) in enumerate(zip(names, types)):
        if type.startswith('number') or type in ['short', 'long']:
            newName = 'n_%s' % srcName
            feature.setAttribute('attribute{%i}.srcName', newName)
            renames += [newName, srcName]
    global g_renameParam
    g_renameParam = ','.join(renames)

# PythonCaller_2
def renameAttributes(feature):
    if g_renameParam:
        feature.performFunction('@RenameAttributes(%s)' % g_renameParam)

Userlevel 3
Badge +13
It seems that there's not consistency with how FME encodes fields from different file formats. Sometimes I'd like to use the field type to populate an attribute with a certain value depending on the field type. Therefore it would be nice if FME had a consistent rule for all formats it reads.

 

 

To answer this question, I need to explain some FME history. In the very beginning, our goal was to completely hide the internal representation of data from end users. We would look after any conversions that were ever going to be needed and do them automatically. And our original FME implementation was that we would handle any attribute we read as a string internally. This provided a very easy and consistent way of working. And all the early formats of FME were written accordingly -- they all created internal FME representations as strings. Shape is one of those, and this makes good sense because actually all attribute data in a shape file is actually stored as a string ultimately.

 

 

With time, we expanded the power of FME and for efficiency sake we added the ability to store attributes in the native type that we read them. So if a format could give us a number as a float or double or int, we'd just store that. And by so doing, avoid converting things into strings if we didn't need to.

 

 

For the most part we didn't go back and retrofit existing format code, because FME will do whatever conversions are necessary anyway and all this is hidden from the user. (Except at some point the python call was made available which is what you have stumbled upon).

 

 

 

Userlevel 3
Badge +13
To answer this question, I need to explain some FME history. In the very beginning, our goal was to completely hide the internal representation of data from end users. We would look after any conversions that were ever going to be needed and do them automatically. And our original FME implementation was that we would handle any attribute we read as a string internally. This provided a very easy and consistent way of working. And all the early formats of FME were written accordingly -- they all created internal FME representations as strings. Shape is one of those, and this makes good sense because actually all attribute data in a shape file is actually stored as a string ultimately.

 

 

With time, we expanded the power of FME and for efficiency sake we added the ability to store attributes in the native type that we read them. So if a format could give us a number as a float or double or int, we'd just store that. And by so doing, avoid converting things into strings if we didn't need to.

 

 

For the most part we didn't go back and retrofit existing format code, because FME will do whatever conversions are necessary anyway and all this is hidden from the user. (Except at some point the python call was made available which is what you have stumbled upon).

 

 

 

 

Note that we have no intention of updating our shape reader. Why? Well, in the case of Shape, the attribute data is actually all stored in the DBF file as...you guessed it...strings. Even if the DBF says it is a number. And so we don't want to take on the expense of converting those strings into numbers to pass through FME because there is an excellent chance we may never need to during the whole translation. If we go shape->shape for example, it is just going to get written out as a string ultimately anyway.

 

 

So Takashi's technique of inspecting/using the actual schema feature is truly the only way to reliably know what the original format intended the attribute's type to be. The storage format in FME is not guaranteed to be the same. I apologize for the confusion the python call may have caused, and hope we can find a way to get you the results you need without undo trouble.

 

 

If you need to rename some attributes based on their data types defined in the source dataset, you will have to refer the schema feature, in the current FME.

The following is an example to add a common prefix to every numeric type attribute names, according to the Shapefile native data types.

Note: According to the native format specifications, Shapefile only have "number(w,p)" type for numeric data fields. I don't know why the current FME (2016 and 2017) Shapefile reader considers "short" and "long" as native data types of Esri Shapefile format. See also this thread:

short datatypes understood as long with new Shapefile reader (FME2016)

 


Example:

0684Q00000ArKQ6QAN.png

# PythonCaller_1: Modify Attribute Names
# e.g. Add a common prefix "n_" to every numeric type attributes.
# And create a global variable storing comma-separated [newName,srcName]+,
# which will be used as a parameter of the @RenameAttributes function.
# Assume that the source format is Esri Shapefile.
def modifySchema(feature):
    names = feature.getAttribute('attribute{}.name')
    types = feature.getAttribute('attribute{}.native_data_type')
    renames = []
    for i, (srcName, type) in enumerate(zip(names, types)):
        if type.startswith('number') or type in ['short', 'long']:
            newName = 'n_%s' % srcName
            feature.setAttribute('attribute{%i}.srcName', newName)
            renames += [newName, srcName]
    global g_renameParam
    g_renameParam = ','.join(renames)

# PythonCaller_2
def renameAttributes(feature):
    if g_renameParam:
        feature.performFunction('@RenameAttributes(%s)' % g_renameParam)

Thanks a lot @takashi I really appreciate all the help!!

 

Userlevel 3
Badge +13

Okay, @takashi inspired me to try a bit of Python to see if I could craft a workaround.  (This old guy doesn't get to do much coding anymore).  Had some fun.

Based on Takashi's Python above, I used a similar technique. The FeatureReader always kicks out the schema feature first.  The first Python sniffs through the schema feature and makes us a list of attributes (and their types) that we want to coerce into being stored as per their intended types and not as strings.

    def input(self,feature):                # Make a global variable of the attribute names that are numeric        # We'll use this in another PythonCaller to coerce the actual values        # into ones stored numerically internally.        # Assume that the source format is Esri Shapefile.                names = feature.getAttribute('attribute{}.name')        types = feature.getAttribute('attribute{}.native_data_type')                # This global list will contain the names of attribute that should be        # coerced into internal storage as numeric types.        global g_numerics        g_numerics = []        for i, (srcName, type) in enumerate(zip(names, types)):            if type.startswith('number') or type in ['short', 'long', 'double', 'float']:                g_numerics += [ [srcName, type] ]                self.pyoutput(feature)  

Then the data features get pumped through a second PythonCaller that uses that list, retrieves each attribute that should be coerced, coerces it, and then sets it back.

    def input(self,feature):                # Coerce the actual values stored in the feature to reflect their        # schema types.                # Previously the g_numerics global list was set up to have        # its elements be (name, type) pairs.  We will use these to        # retrieve and then cast and store back the attributes.                             # This global list contains the names of attributes that should be        # coerced into internal storage as numeric types.  The Python FME API        # accept these types on the setAttribute call:        # PyInt ==> FME_Int32        # PyFloat ==> FME_Real64        # PyLong ==> FME_Int64        global g_numerics                for (srcName, type) in g_numerics:            value = feature.getAttribute(srcName)            if value != None:                if type == 'long':                    longVal = long(value)                    feature.setAttribute(srcName,longVal)                elif type == 'short':                    shortVal = int(value)                    feature.setAttribute(srcName,shortVal)                else:                    # Everything else we'll treat as float                    floatVal = float(value)                    feature.setAttribute(srcName,floatVal)                                  self.pyoutput(feature

A workspace which logs the features before and after "the treatment" is attached also.

Thanks for the fun diversion.

Attached workspace: typefixer.fmw

Reply