Skip to main content

This question is related to this earlier post on handling several CSV files output to the server shared data directory: https://knowledge.safe.com/questions/79707/fme-cloud-workflow-for-moving-media-zip-files-betw.html

@GerhardAtSafe has greatly helped me with several aspects and I'm continuing to research and learn what I can, but I'm having trouble figuring out the best way to handle multiple CSV files output to a single directory under the shared data directory on server. Not all the CSV files will be written on the scheduled jobs, and the number of separate CSVs that could be written is between 0-23 files. These are output from a series of jobs that run on a weekly schedule to update tables in a PostGIS database.

I want to use these CSV files to trigger a job which parses media pulls from an S3 bucket, zips that up and pushes to another S3 bucket or UNC path. The problem I'm struggling with is how to read each of these CSV files individually in order to select the correct columns for parsing the media ids. I'm not sure if reading these in dynamically with a CSV reader or a FeatureReader transformer is best possible solution. I've attached 2 screenshots below of the workspaces which perform perfectly on individual CSV files for parsing media ids and using that output in a FMEServerJobSubmitter on the next workspace which pulls that media from the S3 bucket and uploads to a different one:

- FeatureReader reads in CSV file>Parse Media string and concatenation>Write to a "photo_ids.csv" file>FMEServerJobSubmitter inits next workspace below.

- FeatureReader reads in "photo_ids.csv" file>S3Downloader of media pull to server temp directory>S3Uploader to new bucket.

I could set up these jobs to run up to 23 separate workspaces and triggered based on each job having a unique directory watch trigger and compartmentalized directories for output. But I believe there is a better and more efficient way to perform this workflow, so I'm posting here to this community in hopes someone can recommend me with a better solution.

I'm just having trouble thinking it through clearly, since I'm fairly new to server notification services and using the generated log files to parse string text for multiple jobs completed and triggering downstream workspaces. Thanks in advance for any help provided to this rather long and complex question.

Hi @tjpollard

Just to make sure I'm clear on the pain points you're running into with this, I see the StringConcatenator in there is marked invalid due to the fields that aren't present on all CSV files. Does the workspace run without error in that state?

I think you should be able to modify the workspace in the first screenshot to be able to work with any of the files by making use of the Dynamic reader/writer options within the FeatureReader/Writer. Here are some ideas for changes that could help:

  1. In the FeatureReader, set the Output Ports option to Single Output Port (so the data exits through the <Generic> port. Also make sure it's set to read both the Schema and Data features (under Schema/Data Features in the parameters).
  2. If the StringConcatenator is set up the way you need to cover all possible attribute names, leave that as it is. Otherwise, try using an AttributeExposer to manually expose the attributes you want to work with in the workspace.
  3. On the FeatureWriter, first, connect the <Schema> output port from the FeatureReader directly to the FeatureWriters input. This will ensure the original columns are written out.
    1. In the Parameters for that transformer, check the box for Dynamic Schema Definition and set the source to "Schema from Schema Feature"
    2. Under User Attributes in there, change Attribute Definition to Dynamic.
    3. If you need to add new attributes into the output that didn't exist in the original file, you can manually add them in the User Attributes table. Just click the + button in there to add them in. Those will be added to the output in addition to the original columns.

No, you definitely don't want to have 23 different workspaces. I can say that with absolute certainty.

Dynamic workspaces are really only for when you're writing data, and when the attribute names are different with each CSV file.

So if you have CSV1 (Attributes = A,B,C) and CSV2 (Attributes = C,D,E) then that's when you would use dynamic. If you have CSV1 (Attributes = A,B,C) and CSV2 (Attributes = A,B,C), then you don't need to use dynamic translations.

I can't tell from your screenshots which is the case here, but hopefully this helps you to decide whether dynamic is required or not.

The other thing I would do is to create a different file name for the photos created by each CSV. For example, add a TimeStamper transformer to get the current time and write to photo_ids_<time>.csv - then pass that as a parameter to the workspace being run.

The reason I suggest that is because if you have multiple notifications happen at once, and you run the same workspace multiple times, I imagine it could create a conflict (two or more workspaces could try to read or write photo_ids.csv at the same time). So by giving it a unique name, you avoid that issue. It doesn't have to be time-based, but I do find that is one of the simplest ways.

I hope this helps some. Please do let us know if you have more questions.


No, you definitely don't want to have 23 different workspaces. I can say that with absolute certainty.

Dynamic workspaces are really only for when you're writing data, and when the attribute names are different with each CSV file.

So if you have CSV1 (Attributes = A,B,C) and CSV2 (Attributes = C,D,E) then that's when you would use dynamic. If you have CSV1 (Attributes = A,B,C) and CSV2 (Attributes = A,B,C), then you don't need to use dynamic translations.

I can't tell from your screenshots which is the case here, but hopefully this helps you to decide whether dynamic is required or not.

The other thing I would do is to create a different file name for the photos created by each CSV. For example, add a TimeStamper transformer to get the current time and write to photo_ids_<time>.csv - then pass that as a parameter to the workspace being run.

The reason I suggest that is because if you have multiple notifications happen at once, and you run the same workspace multiple times, I imagine it could create a conflict (two or more workspaces could try to read or write photo_ids.csv at the same time). So by giving it a unique name, you avoid that issue. It doesn't have to be time-based, but I do find that is one of the simplest ways.

I hope this helps some. Please do let us know if you have more questions.

@Mark2AtSafe, thanks for the response and confirming that I should not set up 23 different workspaces. Although that could work, it is a maintenance nightmare which I don't want to induce on myself if all possible. Also, the note you mention about dynamic workspaces being really only for writing data is noteworthy for me, as it helps me to understand some workflows much better.

 

 

So, the injection of a timestamp, (or I'm thinking it could be the fme_basename or fme_feature_type) name inserted into the filename to make that photo_id<someuniqueparameter>.csv a unique file for downstream processing.

 

 

What is really throwing me is how to control the notification(s) as you mention, ("create a conflict (two or more workspaces could try to read or write photo_ids.csv at the same time)" ), having multiple notifications happening because there is more than one CSV file uploaded to the directory being watched and triggering it. (And I've experimented with a few files to see what the behavior is if I upload one csv, wait 5 minutes upload a different, second csv etc.) The directory watch behaves as expected, triggering the workspace, but unless I can "tell" the workspace to act on the CSV which was just uploaded and triggered, I'm afraid instead it will start at the "top", using the same or only the first CSV file that was uploaded to the directory.

 

 

This is not the behavior I'm expecting from this workflow. I need to "compartmentalize" those different CSVs which enter that directory and trigger the directory watch topic, then run the workspace on each of those sequentially. I can see the timestamps and feature_type_name being involved in that somehow. I've just not advanced enough yet in my FME Server savvy to get it done, but believe after this week I will.

 

 

Yes, I'm convinced I need to provide unique names to the generated "photo_ids.csv" as it stands now. Thank you!

 


Hi @tjpollard

Just to make sure I'm clear on the pain points you're running into with this, I see the StringConcatenator in there is marked invalid due to the fields that aren't present on all CSV files. Does the workspace run without error in that state?

I think you should be able to modify the workspace in the first screenshot to be able to work with any of the files by making use of the Dynamic reader/writer options within the FeatureReader/Writer. Here are some ideas for changes that could help:

  1. In the FeatureReader, set the Output Ports option to Single Output Port (so the data exits through the <Generic> port. Also make sure it's set to read both the Schema and Data features (under Schema/Data Features in the parameters).
  2. If the StringConcatenator is set up the way you need to cover all possible attribute names, leave that as it is. Otherwise, try using an AttributeExposer to manually expose the attributes you want to work with in the workspace.
  3. On the FeatureWriter, first, connect the <Schema> output port from the FeatureReader directly to the FeatureWriters input. This will ensure the original columns are written out.
    1. In the Parameters for that transformer, check the box for Dynamic Schema Definition and set the source to "Schema from Schema Feature"
    2. Under User Attributes in there, change Attribute Definition to Dynamic.
    3. If you need to add new attributes into the output that didn't exist in the original file, you can manually add them in the User Attributes table. Just click the + button in there to add them in. Those will be added to the output in addition to the original columns.
@LauraAtSafe, to answer your first question, yes, I have successfully ran the workspace with several of the transformer fields registering as invalid but allowing the workspace to complete.

 

 

For your #2 above: the StringConcatenator only needs to operate on any fields that contain media (photo or video) UUIDs. All the other attribute names can be simply ignored as part of the final output CSV file.

 

 

For #3, I actually don't want the original attribute column names to be carried forward. What the workspace is doing is taking teh values from those attribute fields which may contain multiple values i.e.

 

and processes all those columns through the attribute trimmer and string replacers to create a single column list of photo_ids i.e.

 

This single column list is what I use as input to the S3 downloader to pull teh media files from the bucket and upload to the alternate bucket.

 

 

So, the FeatureWriter in the workspace right now is what's writing this file to be used in the downstream S3 workspace I've set up in the FMEServerJobSubmitter.
@Mark2AtSafe, thanks for the response and confirming that I should not set up 23 different workspaces. Although that could work, it is a maintenance nightmare which I don't want to induce on myself if all possible. Also, the note you mention about dynamic workspaces being really only for writing data is noteworthy for me, as it helps me to understand some workflows much better.

 

 

So, the injection of a timestamp, (or I'm thinking it could be the fme_basename or fme_feature_type) name inserted into the filename to make that photo_id<someuniqueparameter>.csv a unique file for downstream processing.

 

 

What is really throwing me is how to control the notification(s) as you mention, ("create a conflict (two or more workspaces could try to read or write photo_ids.csv at the same time)" ), having multiple notifications happening because there is more than one CSV file uploaded to the directory being watched and triggering it. (And I've experimented with a few files to see what the behavior is if I upload one csv, wait 5 minutes upload a different, second csv etc.) The directory watch behaves as expected, triggering the workspace, but unless I can "tell" the workspace to act on the CSV which was just uploaded and triggered, I'm afraid instead it will start at the "top", using the same or only the first CSV file that was uploaded to the directory.

 

 

This is not the behavior I'm expecting from this workflow. I need to "compartmentalize" those different CSVs which enter that directory and trigger the directory watch topic, then run the workspace on each of those sequentially. I can see the timestamps and feature_type_name being involved in that somehow. I've just not advanced enough yet in my FME Server savvy to get it done, but believe after this week I will.

 

 

Yes, I'm convinced I need to provide unique names to the generated "photo_ids.csv" as it stands now. Thank you!

 

I'm not the expert on Server that @LauraAtSafe is, but I believe notifications work like this...

 

When a notification occurs, FME sends an alert, which triggers your workspace. But, that alert includes topic info containing the name of the file that caused the alert. So your workspace should take that info from the topic, extract
the file name, then use that as the file to read.

 

A file should only trigger that notification when an action occurs to it (Create, Modify, Delete). If the first file is still passively sitting there when a second file is created, then the first file is ignored because no new action has occurred to it.

 

Also, FME notifications can be set up to monitor for separate Create, Modify, and Delete actions on an S3 watch publisher. So if you set it up to watch for file creation only, then it should only ever choose a file when it is newly created. If that file is modified in some way, it won't trigger the notification (depends on if you want that or not).

 

But, to be really sure, you could set up the notifications to ignore Delete operations, and have the workspace delete the file after running. Then you know that it no longer exists to cause confusion.

 

Does that help? In short, a notification should only trigger on a new action. Existing files won't have any effect - unless your workspace is reading the entire folder, which it shouldn't do because it should get the file name from the notification topic.

 


@Mark2AtSafe, thanks for the response and confirming that I should not set up 23 different workspaces. Although that could work, it is a maintenance nightmare which I don't want to induce on myself if all possible. Also, the note you mention about dynamic workspaces being really only for writing data is noteworthy for me, as it helps me to understand some workflows much better.

 

 

So, the injection of a timestamp, (or I'm thinking it could be the fme_basename or fme_feature_type) name inserted into the filename to make that photo_id<someuniqueparameter>.csv a unique file for downstream processing.

 

 

What is really throwing me is how to control the notification(s) as you mention, ("create a conflict (two or more workspaces could try to read or write photo_ids.csv at the same time)" ), having multiple notifications happening because there is more than one CSV file uploaded to the directory being watched and triggering it. (And I've experimented with a few files to see what the behavior is if I upload one csv, wait 5 minutes upload a different, second csv etc.) The directory watch behaves as expected, triggering the workspace, but unless I can "tell" the workspace to act on the CSV which was just uploaded and triggered, I'm afraid instead it will start at the "top", using the same or only the first CSV file that was uploaded to the directory.

 

 

This is not the behavior I'm expecting from this workflow. I need to "compartmentalize" those different CSVs which enter that directory and trigger the directory watch topic, then run the workspace on each of those sequentially. I can see the timestamps and feature_type_name being involved in that somehow. I've just not advanced enough yet in my FME Server savvy to get it done, but believe after this week I will.

 

 

Yes, I'm convinced I need to provide unique names to the generated "photo_ids.csv" as it stands now. Thank you!

 

Oh, and yes, you can certainly use fme_basename or fme_feature_type instead of time as the file name to write. That would work just fine.

 


@Mark2AtSafe, @LauraAtSafe, or anyone else,

 

 

Is there a way to "delay" the directory watch trigger until all files are completed and uploaded?

 

 

Like say, if I set the polling parameter to "1 day" or "6 hours" and left all possible CSV file named variations in the directory, and just had the directory watch publication parameters on a "modify" filter, that could work as a delay action.

 

 

Or is there a way to say don't trigger workspace until after 5 minutes, or "once all jobs previously running are completed, then trigger the directory watch to execute next workspace"

 

 

This seems a bit counter-intuitive, but thought I'd ask anyway. And also, I'm becoming more aware that this may ultimately need to be handled through calling some python scripts to handle variables as possible an alternative to the directory watch or in conjunction with it. I'll have to seek out fellow developers smart on python then to help me figure that out.

 

 

Anyway, as always, thank you all for all the help. I'm much obliged!
I'm not the expert on Server that @LauraAtSafe is, but I believe notifications work like this...

 

When a notification occurs, FME sends an alert, which triggers your workspace. But, that alert includes topic info containing the name of the file that caused the alert. So your workspace should take that info from the topic, extract
the file name, then use that as the file to read.

 

A file should only trigger that notification when an action occurs to it (Create, Modify, Delete). If the first file is still passively sitting there when a second file is created, then the first file is ignored because no new action has occurred to it.

 

Also, FME notifications can be set up to monitor for separate Create, Modify, and Delete actions on an S3 watch publisher. So if you set it up to watch for file creation only, then it should only ever choose a file when it is newly created. If that file is modified in some way, it won't trigger the notification (depends on if you want that or not).

 

But, to be really sure, you could set up the notifications to ignore Delete operations, and have the workspace delete the file after running. Then you know that it no longer exists to cause confusion.

 

Does that help? In short, a notification should only trigger on a new action. Existing files won't have any effect - unless your workspace is reading the entire folder, which it shouldn't do because it should get the file name from the notification topic.

 

Thanks! I'm taking a much clearer look at these actions now.

 

 


@LauraAtSafe, @Mark2AtSafe ,

I believe that I have found the solution to this workflow challenge by using the Directory Watch Reader and Notification Service.

I've tested it with a sample JSON from the topic notification message that is produced as a result of the DirectoryWatch and output by the dirwatch_file_name attribute from the FilenamePartReader is what allows the sequential processing of the CSV files as they enter the directory. Thanks @LauraAtSafe .

I proofed this by verifying the log files written to the server jobs after completion and after loading 3 different CSV files at the same time in the shared directory to trigger the notification for this workspace. There were three separate jobs completed, one for each of the CSV files that were uploaded.

So, instead of the Creator to initiate the FeatureReader, the DirectoryWatchReader is the initiator, which captures the JSON from the notification message that the workspace is subscribed to when published to server. (The workspace must be subscribed to the notification service and the DirectoryWatch Topic. I have not ran a complete workspace through the full parsing of the photo fields but the logger results indicate that this will produce the correct output to create the photos_ids.csv file. It's now a matter of capturing the unique_name to input as a part of the file name which I think can also come from the FilenamePartReader output and applied in the text editor.

A bit more experimenting but I wanted to provide you both with my resolution and another round of Thanks!

Todd


@LauraAtSafe, @Mark2AtSafe ,

I believe that I have found the solution to this workflow challenge by using the Directory Watch Reader and Notification Service.

I've tested it with a sample JSON from the topic notification message that is produced as a result of the DirectoryWatch and output by the dirwatch_file_name attribute from the FilenamePartReader is what allows the sequential processing of the CSV files as they enter the directory. Thanks @LauraAtSafe .

I proofed this by verifying the log files written to the server jobs after completion and after loading 3 different CSV files at the same time in the shared directory to trigger the notification for this workspace. There were three separate jobs completed, one for each of the CSV files that were uploaded.

So, instead of the Creator to initiate the FeatureReader, the DirectoryWatchReader is the initiator, which captures the JSON from the notification message that the workspace is subscribed to when published to server. (The workspace must be subscribed to the notification service and the DirectoryWatch Topic. I have not ran a complete workspace through the full parsing of the photo fields but the logger results indicate that this will produce the correct output to create the photos_ids.csv file. It's now a matter of capturing the unique_name to input as a part of the file name which I think can also come from the FilenamePartReader output and applied in the text editor.

A bit more experimenting but I wanted to provide you both with my resolution and another round of Thanks!

Todd

Excellent. Sounds a good solution. Glad you have got it working.

 

 


Reply