I want to read from a csv file and put the contents in an SQL database table. The issue is that there are duplicate lines in the CSV file, so I want to select distinct into the SQL table to remove duplicate rows. Is there a slick way to do this, or do I need to read the csv into a temporary SQL Table holding the duplicates, and re-read the data out of that table using perform sql executor transform into its final destination? .
You can use the DuplicateFilter to remove duplicate rows.
The DuplicateFilter that @geomancer suggests is a great way. There's some alternatives:
- A Sampler, grouping on the attribute(s) you want to get the unique values for, a sampling rate of 1 and set to let the first feature through
- An InlineQuerier, in which you can do a SELECT DISTINCT (what it does is build an internal database of your input data against which you can perform SQL queries and then it outputs FME features again)
I haven't benchmarked this myself but a gut feeling says that the Sampler is probably not going to be the fastest option.
Another idea to throw into the mix!:
You can also use a python startup script using pandas and perform the equivalent to SQL's select distinct and then just write out a copy and then read the csv copy into your workspace:
import pandas as pd
input = "your in csv path.csv here"
output = "your out copy csv path.csv here"
## read csv; index_col=False will remove the index to prevent a read of index col
df = pd.read_csv(input,index_col=False)
## df.drop_duplicates() = Selecting distinct values in a column of a SQL table, translated to Python's pandas.
data = df.drop_duplicates()
## write back out as copy so we don't disturb original raw data; don't write index col
data.to_csv(output, index=False)
### after this python startup script runs, just use a fme csv reader to read into the workspace the copy we just wrote out
To import distinct values from a CSV file into an SQL table while eliminating duplicates, create a staging table, bulk insert the CSV data, and use INSERT INTO ... SELECT DISTINCT to transfer unique records to the target table. Drop the staging table if no longer needed for a streamlined process.