Question

Universal Reader (FME 2015) vs. csv module (Python 2.7) to read & process data stored in big *.csv files.


Badge

Dear community,

I have a hypothesis which states that using the csv module and Python 2.7 to perform a simple processing task using the data stored into a big *.csv is going to have a significant positive impact in the time needed to perform the task in contrast to using  the FME tools to perform the same task. I have assigned both software the following simple task into a "startup python script" in an empty workspace with one published parameter ('OSG_SDTF') which is used to acquire the path of the *.csv to be used in the task.

(A) Read a big *.csv file (i.e. 17,977,400 records with the values found in 19 columns which are stored within a single 2 GB comma-delimited *.csv file).

(B) Return a set (i.e. not duplicates are allowed) of the values found within one column of the big *.csv ('col0').

(C) Print the result into the 'Translation Log' in the FME workbench.

The code that I used is following:

### USING FME TOOLS ###
import fmeobjects, fme

def param_value(parameter_name):
    return fme.macroValues

def reader_params():
    return fmeobjects.FMEDialog().sourcePrompt('','',[])[2]

def universal_reader(reader_name,parameters):
    return fmeobjects.FMEUniversalReader(reader_name,False,parameters)

def csv_value(csv_reader,csv_parameters,csv_filepath):

    result = []
    o = csv_reader.open(csv_filepath,csv_parameters)
    feature = csv_reader.read()

    while feature != None:
        result.append(feature.getAttribute('col0'))

    return set(result)

CsvFilepath = param_value('OSG_SDTF')
CsvRParams = reader_params()
CsvReader = universal_reader('CSV',CsvRParams)
CsvValue = csv_value(CsvReader,CsvRParams,CsvFilepath)
print CsvValue

### USING PYTHON 2.7 & THE CSV MODULE ###
import fme
import csv

def param_value(parameter_name):
    return fme.macroValues[parameter_name]

def csv_reader(file_path_r):
    
    open_file = open(file_path_r,'rb')
    return csv.reader(open_file,delimiter = ',')
    
def record_ids(csv_reader,value):
    
    result = []
    
    for row in csv_reader:
        result.append(row[value])
        
    return set(result)

CsvFilepath = param_value('OSG_SDTF')
CsvReader = csv_reader(CsvFilepath)
RecordIds = record_ids(CsvReader,0)
print RecordIds

Results:

Using the Python 2.7 & the csv module prints a result in 1 minute and 19 seconds, and using the FME tools the startup script kept going for 19 minutes when I just stopped it because I had proved my hypothesis on the efficiency of the csv module. NOTE: I also tried repeating the process in the FME workspace using an AttributeFilter and using the same *.csv and the result was rather disappointing because a result hadn't been produced after leaving the machine running overnight and the temporary folder was full with 26 GB of data!

Insights?

I use FME in every day tasks to support data quality, data consolidation, and data processing jobs for different formats however efficiency is also important to what I do. Is there a bench-marking team in Safe that could provide some insights on the efficiency of different readers so I could look for open-source alternatives to support data cleaning before importing the data into an FME workspace?


12 replies

Badge +16

It should come as no surprise that CSV handling with Python will beat FME, but it's like tests between programming languages for which is faster at a task, the tests become contrived when very limited.

Badge

It should come as no surprise that CSV handling with Python will beat FME, but it's like tests between programming languages for which is faster at a task, the tests become contrived when very limited.

Hi bruceharold, Thanks for your reply. I am not familiar with any reasons which justify the csv module being so much more efficient than the FME csv reader. I would appreciate your insights to any reasons since I am keen acquiring an understanding of what I do and not just scratching the surface. I suppose that a significant percentage of people who use FME also use *.csv files to read and process data; although limited in scope the *.csv reader is of high importance because of that. That is why I am asking for insights from safe so I can use a combination of open source & FME to support my organisation's work.

Userlevel 2
Badge +17

This block may cause an infinite loop...

    while feature != None:
        result.append(feature.getAttribute('col0'))
Badge

This block may cause an infinite loop...

    while feature != None:
        result.append(feature.getAttribute('col0'))

Hi takashi,
Thanks for your contribution to the code.
How would you rephrase that snippet in a more pythonic syntax?

Userlevel 4
Badge +13

A fairer test would be to run the Python CSV module against a workspace that does what you want. In general, we strongly discourage using FME Objects instead of just workspaces.

If you did that, you could use the CSV reader, turn on the SORT option in its settings, and have it sort the fields by whatever the column was you wanted to remove duplicates by. Then, in FME 2016, route the results of that read into a DuplicateRemover and indicate that your input is ordered. That will perform drastically better.

I think the lesson is to work with FME's strengths as you approach problem solving with it. Watch out for things that block up features -- that will cause performance slowdowns. And try to thin out and remove data from the stream as early as possible.

Now, having said all that, I am still sure that the python raw CSV reading will beat us. For now. We are working on some revolutionary technology which will help us with this nasty-bad-boy-mega-CSV files and I so look forward to unveiling that. On a stage. With FME under a black tablecloth which I'll pull off with a flourish. Wearing a mock-turtleneck.

But that will have to wait for a while yet...

Userlevel 2
Badge +17

Hi takashi,
Thanks for your contribution to the code.
How would you rephrase that snippet in a more pythonic syntax?

The "read" method returns None when the reader has finished reading every feature from the source dataset. You have to read the next feature repeatedly, while the "read" method returns an instance of feature.

e.g. replace #17-20 lines of the original script with

    feature = csv_reader.read() # Read the first feature

    while feature != None:
        result.append(feature.getAttribute('col0'))
        feature = csv_reader.read() # Read the next feature

or

    while True:
        feature = csv_reader.read()
        if feature == None:
            break
        result.append(feature.getAttribute('col0'))
Userlevel 4

I've posted this before, but you could consider a middle way using a PythonCreator to read the CSV while still using a regular workbench:

import fmeobjects
import csv
class FeatureCreator(object):
    def __init__(self):
        self.inputfilename = FME_MacroValues['INPUT_CSV_FILE']
        self.csvdelimiter = ',' # Modify as needed
        self.csvquotechar = '"' # Modify as needed
        self.log = fmeobjects.FMELogFile()
        self.fieldnames = []
    def close(self):
        with open(self.inputfilename, 'rb') as csvfile:
            csvreader = csv.reader(csvfile, \
                                   delimiter=self.csvdelimiter, \
                                   quotechar=self.csvquotechar)
            for n, row in enumerate(csvreader):
                if n == 0:
                    self.fieldnames = row
                    self.log.logMessageString("CSV field names to expose in the PythonCreator:", fmeobjects.FME_WARN)
                    for field in row:
                        self.log.logMessageString(" "+field, fmeobjects.FME_WARN)
                else:
                    feature = fmeobjects.FMEFeature()
                    for m, value in enumerate(row):
                        feature.setAttribute(self.fieldnames[m], value)
                    self.pyoutput(feature)

This should get you the best of both worlds.

David

Badge

The "read" method returns None when the reader has finished reading every feature from the source dataset. You have to read the next feature repeatedly, while the "read" method returns an instance of feature.

e.g. replace #17-20 lines of the original script with

    feature = csv_reader.read() # Read the first feature

    while feature != None:
        result.append(feature.getAttribute('col0'))
        feature = csv_reader.read() # Read the next feature

or

    while True:
        feature = csv_reader.read()
        if feature == None:
            break
        result.append(feature.getAttribute('col0'))

Hi takashi,

Thanks for this! I think I will try the second option that you are kindly providing.

Badge

I've posted this before, but you could consider a middle way using a PythonCreator to read the CSV while still using a regular workbench:

import fmeobjects
import csv
class FeatureCreator(object):
    def __init__(self):
        self.inputfilename = FME_MacroValues['INPUT_CSV_FILE']
        self.csvdelimiter = ',' # Modify as needed
        self.csvquotechar = '"' # Modify as needed
        self.log = fmeobjects.FMELogFile()
        self.fieldnames = []
    def close(self):
        with open(self.inputfilename, 'rb') as csvfile:
            csvreader = csv.reader(csvfile, \
                                   delimiter=self.csvdelimiter, \
                                   quotechar=self.csvquotechar)
            for n, row in enumerate(csvreader):
                if n == 0:
                    self.fieldnames = row
                    self.log.logMessageString("CSV field names to expose in the PythonCreator:", fmeobjects.FME_WARN)
                    for field in row:
                        self.log.logMessageString(" "+field, fmeobjects.FME_WARN)
                else:
                    feature = fmeobjects.FMEFeature()
                    for m, value in enumerate(row):
                        feature.setAttribute(self.fieldnames[m], value)
                    self.pyoutput(feature)

This should get you the best of both worlds.

David

Cheers David,

I will certainly give this a try!

Userlevel 2
Badge +17

I've posted this before, but you could consider a middle way using a PythonCreator to read the CSV while still using a regular workbench:

import fmeobjects
import csv
class FeatureCreator(object):
    def __init__(self):
        self.inputfilename = FME_MacroValues['INPUT_CSV_FILE']
        self.csvdelimiter = ',' # Modify as needed
        self.csvquotechar = '"' # Modify as needed
        self.log = fmeobjects.FMELogFile()
        self.fieldnames = []
    def close(self):
        with open(self.inputfilename, 'rb') as csvfile:
            csvreader = csv.reader(csvfile, \
                                   delimiter=self.csvdelimiter, \
                                   quotechar=self.csvquotechar)
            for n, row in enumerate(csvreader):
                if n == 0:
                    self.fieldnames = row
                    self.log.logMessageString("CSV field names to expose in the PythonCreator:", fmeobjects.FME_WARN)
                    for field in row:
                        self.log.logMessageString(" "+field, fmeobjects.FME_WARN)
                else:
                    feature = fmeobjects.FMEFeature()
                    for m, value in enumerate(row):
                        feature.setAttribute(self.fieldnames[m], value)
                    self.pyoutput(feature)

This should get you the best of both worlds.

David

Hi @david_r, thanks for sharing the elegant script. Please check the indentation of line#14 -. The code block usage in this site is troublesome ;)

Badge

A fairer test would be to run the Python CSV module against a workspace that does what you want. In general, we strongly discourage using FME Objects instead of just workspaces.

If you did that, you could use the CSV reader, turn on the SORT option in its settings, and have it sort the fields by whatever the column was you wanted to remove duplicates by. Then, in FME 2016, route the results of that read into a DuplicateRemover and indicate that your input is ordered. That will perform drastically better.

I think the lesson is to work with FME's strengths as you approach problem solving with it. Watch out for things that block up features -- that will cause performance slowdowns. And try to thin out and remove data from the stream as early as possible.

Now, having said all that, I am still sure that the python raw CSV reading will beat us. For now. We are working on some revolutionary technology which will help us with this nasty-bad-boy-mega-CSV files and I so look forward to unveiling that. On a stage. With FME under a black tablecloth which I'll pull off with a flourish. Wearing a mock-turtleneck.

But that will have to wait for a while yet...

Hi Dale, Thanks for your comprehensive comment and suggestions; I hadn't used the "sort" tick box in the CSV reader before and I believe this will make a difference for my work in the future.

I agree that the test is **not** optimum or even scientifically sound (I am stressing this because I am using the work "significant" in my first post). This test is founded on making a **preliminary** comparison between a Python method that I know a lot of people use to process *.csv files and using FME to perform the same task (I think that you will agree with me that FME has multiple ways of achieving the same thing). That is why I posted this in the first instance; to gain insights on this preliminary observation and thank you for providing some.

I anticipate that your new CSV reader will be very efficient so I am looking forward to that day!

Note: Dale's suggested workspace in FME 2015 produces a result in 37 minutes 40.2 seconds.

Userlevel 4

Hello,

I have to connect to a WebSocket API (never even knew that excisted) which has an endpoint.

wss://<IP>/api/v2/data/main/metrics

Then I can send

{
"Reference": "Get_reference",
"Command": {
"Get": {
"Select": "Identifier",
"Filter": "parameter"
}
}
}

And I should get a JSON back

{
  "Reference": "Get_reference",
  "Data": [
    {
      // ... and so on
   }
}

I tried with WebSocketReceiver to connect to the API, I got the handshake and it keeps running. 

Then I created a workspace with a WebSocketSender to send the JSON but it only sends and I get no output. 

Is this even possible or should I use a PythonCaller?

 

TIA

 Edit: I'm testing this in FME Workbench - is this someting the FME server should do? I found this: http://docs.safe.com/fme/2018.1/html/FME_Server_Documentation/Content/ReferenceManual/WebSocket-Server.htm

Reply