Skip to main content

CSVJoiner2.fmw

I have many CSV files in a folder that I want to join together based on attribute in first column of every file. Each file has about 300 records and 50 attributes. I want my output excel sheet to have 300 rows with all attributes from all the input files, joined by the code in "Col0"

I've tried DatabaseJoiner and my output has all the input column names, but only the values from the first input table. See attached two CSV files and workspace.

If all CSV files have the same attribute schema, you could use 2 text readers.

The first reader reads all lines and write them to a new text file.

The second reader reads all other text files, line by line, and skips the first line (header), writing the lines to the same text output file.

At the end you will have one big text (csv) file.

Hope this helps.


@dmeaney I split the CSV file you attached into smaller parts so I could simulate multiple source CSV files you're trying to process.

It would be simple if you had unique column names in all your CSV data. If you did, then FeatureMerger would solve it for you.

@erik_jan is on the right track with using text_line reader. But in addition, I think you need to join the records and concatenate them so I've chosen Aggregator to do that. The workspace looks like this:

I've attached the example workspace (2019.2) along with the sample data I created: csvjoiner.fmwt


@dmeaney I split the CSV file you attached into smaller parts so I could simulate multiple source CSV files you're trying to process.

It would be simple if you had unique column names in all your CSV data. If you did, then FeatureMerger would solve it for you.

@erik_jan is on the right track with using text_line reader. But in addition, I think you need to join the records and concatenate them so I've chosen Aggregator to do that. The workspace looks like this:

I've attached the example workspace (2019.2) along with the sample data I created: csvjoiner.fmwt

@dmeaney I notice that the last two lines in your file are what appears to be a header record and then a report line. If you could get the header record to the header of your file then the FeatureMerger would be all you'd need.

"GEO_ID","NAME","B01001_001E","B01001_001M","B01001_002E","B01001_002M","B01001_003E", ...

"id","Geographic Area Name","Estimate!!Total","Margin of Error!!Total", ....


@dmeaney I split the CSV file you attached into smaller parts so I could simulate multiple source CSV files you're trying to process.

It would be simple if you had unique column names in all your CSV data. If you did, then FeatureMerger would solve it for you.

@erik_jan is on the right track with using text_line reader. But in addition, I think you need to join the records and concatenate them so I've chosen Aggregator to do that. The workspace looks like this:

I've attached the example workspace (2019.2) along with the sample data I created: csvjoiner.fmwt

Thanks. I'll give it a shot.

FYI - I had sorted my CSVs as part of another approach. We weren't certain that they were all in the same order by ID so we sorted. Unfortunately, the 2 header rows ended up at the bottom. Looks like I'll want to put them back manually (20 times), then try FeatureMerger.


@dmeaney I notice that the last two lines in your file are what appears to be a header record and then a report line. If you could get the header record to the header of your file then the FeatureMerger would be all you'd need.

"GEO_ID","NAME","B01001_001E","B01001_001M","B01001_002E","B01001_002M","B01001_003E", ...

"id","Geographic Area Name","Estimate!!Total","Margin of Error!!Total", ....

geography.csvACSDT5Y2013.B01003_data_with_overlays_2020-01-06T114007.csvACSDT5Y2013.B01001_data_with_overlays_2020-01-06T113401.csvACSDT5Y2013.B03002_data_with_overlays_2020-01-06T115027.csvACSDT5Y2013.B08141_data_with_overlays_2020-01-06T125604.csv

ACSDT5Y2013.B03002_data_with_overlays_2020-01-06T115027.csv

csvjoiner2.fmw

I moved my column headers to the top of each CSV file and then tried the FeatureMerger tool. I used a simplified geography.csv as the Supplier (just a few fields, including the GEO_ID used for joining), and then another reader for all the other CSVs as the Requestor. I've attached three of those here

In the output, I am getting all the field names and values correctly outputting, but it is not merging on the GEO_ID field, so I'm getting 8,484 lines instead of 303.


Reply