Question

Universal Reader (FME 2015) vs. csv module (Python 2.7) to read & process data stored in big *.csv files.

9 years ago
December 15, 2015
12 replies
18 views

dimmihel
17 replies

Dear community,

I have a hypothesis which states that using the csv module and Python 2.7 to perform a simple processing task using the data stored into a big *.csv is going to have a significant positive impact in the time needed to perform the task in contrast to using the FME tools to perform the same task. I have assigned both software the following simple task into a "startup python script" in an empty workspace with one published parameter ('OSG_SDTF') which is used to acquire the path of the *.csv to be used in the task.

(A) Read a big *.csv file (i.e. 17,977,400 records with the values found in 19 columns which are stored within a single 2 GB comma-delimited *.csv file).

(B) Return a set (i.e. not duplicates are allowed) of the values found within one column of the big *.csv ('col0').

The code that I used is following:

### USING FME TOOLS ###
import fmeobjects, fme

def param_value(parameter_name):
    return fme.macroValues

def reader_params():
    return fmeobjects.FMEDialog().sourcePrompt('','',[])[2]

def universal_reader(reader_name,parameters):
    return fmeobjects.FMEUniversalReader(reader_name,False,parameters)

def csv_value(csv_reader,csv_parameters,csv_filepath):

    result = []
    o = csv_reader.open(csv_filepath,csv_parameters)
    feature = csv_reader.read()

    while feature != None:
        result.append(feature.getAttribute('col0'))

    return set(result)

CsvFilepath = param_value('OSG_SDTF')
CsvRParams = reader_params()
CsvReader = universal_reader('CSV',CsvRParams)
CsvValue = csv_value(CsvReader,CsvRParams,CsvFilepath)
print CsvValue

### USING PYTHON 2.7 & THE CSV MODULE ###
import fme
import csv

def param_value(parameter_name):
    return fme.macroValues[parameter_name]

def csv_reader(file_path_r):
    
    open_file = open(file_path_r,'rb')
    return csv.reader(open_file,delimiter = ',')
    
def record_ids(csv_reader,value):
    
    result = []
    
    for row in csv_reader:
        result.append(row[value])
        
    return set(result)

CsvFilepath = param_value('OSG_SDTF')
CsvReader = csv_reader(CsvFilepath)
RecordIds = record_ids(CsvReader,0)
print RecordIds

Results:

Using the Python 2.7 & the csv module prints a result in 1 minute and 19 seconds, and using the FME tools the startup script kept going for 19 minutes when I just stopped it because I had proved my hypothesis on the efficiency of the csv module. NOTE: I also tried repeating the process in the FME workspace using an AttributeFilter and using the same *.csv and the result was rather disappointing because a result hadn't been produced after leaving the machine running overnight and the temporary folder was full with 26 GB of data!

Insights?

I use FME in every day tasks to support data quality, data consolidation, and data processing jobs for different formats however efficiency is also important to what I do. Is there a bench-marking team in Safe that could provide some insights on the efficiency of different readers so I could look for open-source alternatives to support data cleaning before importing the data into an FME workspace?

+17

bruceharold
Contributor
338 replies
9 years ago
December 15, 2015

It should come as no surprise that CSV handling with Python will beat FME, but it's like tests between programming languages for which is faster at a task, the tests become contrived when very limited.

dimmihel
Author
17 replies
9 years ago
December 15, 2015

bruceharold wrote:

Hi bruceharold, Thanks for your reply. I am not familiar with any reasons which justify the csv module being so much more efficient than the FME csv reader. I would appreciate your insights to any reasons since I am keen acquiring an understanding of what I do and not just scratching the surface. I suppose that a significant percentage of people who use FME also use *.csv files to read and process data; although limited in scope the *.csv reader is of high importance because of that. That is why I am asking for insights from safe so I can use a combination of open source & FME to support my organisation's work.

takashi
7703 replies
9 years ago
December 15, 2015

This block may cause an infinite loop...

    while feature != None:
        result.append(feature.getAttribute('col0'))

dimmihel
Author
17 replies
9 years ago
December 15, 2015

takashi wrote:

This block may cause an infinite loop...

    while feature != None:
        result.append(feature.getAttribute('col0'))

Hi takashi,
Thanks for your contribution to the code.
How would you rephrase that snippet in a more pythonic syntax?

+19

fmelizard
Safer
3725 replies
9 years ago
December 15, 2015

A fairer test would be to run the Python CSV module against a workspace that does what you want. In general, we strongly discourage using FME Objects instead of just workspaces.

If you did that, you could use the CSV reader, turn on the SORT option in its settings, and have it sort the fields by whatever the column was you wanted to remove duplicates by. Then, in FME 2016, route the results of that read into a DuplicateRemover and indicate that your input is ordered. That will perform drastically better.

I think the lesson is to work with FME's strengths as you approach problem solving with it. Watch out for things that block up features -- that will cause performance slowdowns. And try to thin out and remove data from the stream as early as possible.

Now, having said all that, I am still sure that the python raw CSV reading will beat us. For now. We are working on some revolutionary technology which will help us with this nasty-bad-boy-mega-CSV files and I so look forward to unveiling that. On a stage. With FME under a black tablecloth which I'll pull off with a flourish. Wearing a mock-turtleneck.

But that will have to wait for a while yet...

takashi
7703 replies
9 years ago
December 15, 2015

dimmihel wrote:

Hi takashi,
Thanks for your contribution to the code.
How would you rephrase that snippet in a more pythonic syntax?

The "read" method returns None when the reader has finished reading every feature from the source dataset. You have to read the next feature repeatedly, while the "read" method returns an instance of feature.

e.g. replace #17-20 lines of the original script with

    feature = csv_reader.read() # Read the first feature

    while feature != None:
        result.append(feature.getAttribute('col0'))
        feature = csv_reader.read() # Read the next feature

    while True:
        feature = csv_reader.read()
        if feature == None:
            break
        result.append(feature.getAttribute('col0'))

david_r
8354 replies
9 years ago
December 16, 2015

I've posted this before, but you could consider a middle way using a PythonCreator to read the CSV while still using a regular workbench:

import fmeobjects
import csv
class FeatureCreator(object):
    def __init__(self):
        self.inputfilename = FME_MacroValues['INPUT_CSV_FILE']
        self.csvdelimiter = ',' # Modify as needed
        self.csvquotechar = '"' # Modify as needed
        self.log = fmeobjects.FMELogFile()
        self.fieldnames = []
    def close(self):
        with open(self.inputfilename, 'rb') as csvfile:
            csvreader = csv.reader(csvfile, \
                                   delimiter=self.csvdelimiter, \
                                   quotechar=self.csvquotechar)
            for n, row in enumerate(csvreader):
                if n == 0:
                    self.fieldnames = row
                    self.log.logMessageString("CSV field names to expose in the PythonCreator:", fmeobjects.FME_WARN)
                    for field in row:
                        self.log.logMessageString(" "+field, fmeobjects.FME_WARN)
                else:
                    feature = fmeobjects.FMEFeature()
                    for m, value in enumerate(row):
                        feature.setAttribute(self.fieldnames[m], value)
                    self.pyoutput(feature)

This should get you the best of both worlds.

David

dimmihel
Author
17 replies
9 years ago
December 16, 2015

takashi wrote:

e.g. replace #17-20 lines of the original script with

    feature = csv_reader.read() # Read the first feature

    while feature != None:
        result.append(feature.getAttribute('col0'))
        feature = csv_reader.read() # Read the next feature

    while True:
        feature = csv_reader.read()
        if feature == None:
            break
        result.append(feature.getAttribute('col0'))

Hi takashi,

Thanks for this! I think I will try the second option that you are kindly providing.

dimmihel
Author
17 replies
9 years ago
December 16, 2015

david_r wrote:

I've posted this before, but you could consider a middle way using a PythonCreator to read the CSV while still using a regular workbench:

import fmeobjects
import csv
class FeatureCreator(object):
    def __init__(self):
        self.inputfilename = FME_MacroValues['INPUT_CSV_FILE']
        self.csvdelimiter = ',' # Modify as needed
        self.csvquotechar = '"' # Modify as needed
        self.log = fmeobjects.FMELogFile()
        self.fieldnames = []
    def close(self):
        with open(self.inputfilename, 'rb') as csvfile:
            csvreader = csv.reader(csvfile, \
                                   delimiter=self.csvdelimiter, \
                                   quotechar=self.csvquotechar)
            for n, row in enumerate(csvreader):
                if n == 0:
                    self.fieldnames = row
                    self.log.logMessageString("CSV field names to expose in the PythonCreator:", fmeobjects.FME_WARN)
                    for field in row:
                        self.log.logMessageString(" "+field, fmeobjects.FME_WARN)
                else:
                    feature = fmeobjects.FMEFeature()
                    for m, value in enumerate(row):
                        feature.setAttribute(self.fieldnames[m], value)
                    self.pyoutput(feature)

This should get you the best of both worlds.

David

Cheers David,

I will certainly give this a try!

takashi
7703 replies
9 years ago
December 16, 2015

david_r wrote:

I've posted this before, but you could consider a middle way using a PythonCreator to read the CSV while still using a regular workbench:

import fmeobjects
import csv
class FeatureCreator(object):
    def __init__(self):
        self.inputfilename = FME_MacroValues['INPUT_CSV_FILE']
        self.csvdelimiter = ',' # Modify as needed
        self.csvquotechar = '"' # Modify as needed
        self.log = fmeobjects.FMELogFile()
        self.fieldnames = []
    def close(self):
        with open(self.inputfilename, 'rb') as csvfile:
            csvreader = csv.reader(csvfile, \
                                   delimiter=self.csvdelimiter, \
                                   quotechar=self.csvquotechar)
            for n, row in enumerate(csvreader):
                if n == 0:
                    self.fieldnames = row
                    self.log.logMessageString("CSV field names to expose in the PythonCreator:", fmeobjects.FME_WARN)
                    for field in row:
                        self.log.logMessageString(" "+field, fmeobjects.FME_WARN)
                else:
                    feature = fmeobjects.FMEFeature()
                    for m, value in enumerate(row):
                        feature.setAttribute(self.fieldnames[m], value)
                    self.pyoutput(feature)

This should get you the best of both worlds.

David

Hi @david_r, thanks for sharing the elegant script. Please check the indentation of line#14 -. The code block usage in this site is troublesome ;)

dimmihel
Author
17 replies
9 years ago
December 16, 2015

fmelizard wrote:

A fairer test would be to run the Python CSV module against a workspace that does what you want. In general, we strongly discourage using FME Objects instead of just workspaces.

But that will have to wait for a while yet...

Hi Dale, Thanks for your comprehensive comment and suggestions; I hadn't used the "sort" tick box in the CSV reader before and I believe this will make a difference for my work in the future.

I agree that the test is **not** optimum or even scientifically sound (I am stressing this because I am using the work "significant" in my first post). This test is founded on making a **preliminary** comparison between a Python method that I know a lot of people use to process *.csv files and using FME to perform the same task (I think that you will agree with me that FME has multiple ways of achieving the same thing). That is why I posted this in the first instance; to gain insights on this preliminary observation and thank you for providing some.

I anticipate that your new CSV reader will be very efficient so I am looking forward to that day!

Note: Dale's suggested workspace in FME 2015 produces a result in 37 minutes 40.2 seconds.

david_r
8354 replies
9 years ago
December 16, 2015

Hello,

I have to connect to a WebSocket API (never even knew that excisted) which has an endpoint.

wss://<IP>/api/v2/data/main/metrics

Then I can send

{
	"Reference": "Get_reference",
	"Command": {
		"Get": {
			"Select": "Identifier",
			"Filter": "parameter"
		}
	}
}

And I should get a JSON back

{
  "Reference": "Get_reference",
  "Data": [
    {
      // ... and so on
   }
}

I tried with WebSocketReceiver to connect to the API, I got the handshake and it keeps running.

Then I created a workspace with a WebSocketSender to send the JSON but it only sends and I get no output.

Is this even possible or should I use a PythonCaller?

TIA

Edit: I'm testing this in FME Workbench - is this someting the FME server should do? I found this: http://docs.safe.com/fme/2018.1/html/FME_Server_Documentation/Content/ReferenceManual/WebSocket-Server.htm

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Universal Reader (FME 2015) vs. csv module (Python 2.7) to read & process data stored in big *.csv files.

12 replies

Reply

Helpful Members This Week

Recently Solved Questions

FME Flow version control how to use different branch

Parameters within group parameters not available in a webhook?

How to restart a REST Server in ArcGIS Server?

Remove last CR/LF from a CSV

1019 error with change detector and polygons

Community Stats

Latest FME

Cookie policy

Cookie settings

Reply

Related Topics

Using Tester Transformer to query Yesterday (Datetime field)icon

What's Up Wednesday: FME New for the Week Ending 4-June-2021icon

Datetime formatting for SQL to SQL Translation - DatetimeParse or DateTimeConverter ?icon

Filter data using a date fieldicon

Query A Date Field To Only Return <=90 Days From Current Dateicon

Helpful Members This Week

Recently Solved Questions

FME Flow version control how to use different branch

Parameters within group parameters not available in a webhook?

How to restart a REST Server in ArcGIS Server?

Remove last CR/LF from a CSV

1019 error with change detector and polygons

Popular Tags

Community Stats

Latest FME

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings