Question

FME Cloud Workflow for moving media ZIP files between S3 buckets.

  • 3 October 2018
  • 7 replies
  • 3 views

Badge

Hi Community,

After reading this article here, I felt I should tailor my question for my particular challenge and request help from @GerhardAtSafe and @stewartharper, (and if you could point me to the youtube videos and github repo mentioned there I'd be so grateful!) My question pertains to creating a similar process as the "FME Cloud API return ZIP file", but with a few variations in the workflow.

What I want to do is call a varying number of CSV files, created as output from another set of FME workspaces executed on Cloud/server, into a workspace(s) which parses some media UUIDs from a couple columns in the CSV into a single line item list. This list of media UUIDs is then used to pull the media from an S3 bucket, and push to another S3 bucket. Now, this next steps is where it gets tricky in the workflow.

* How can I enable to ZIP up all the media into one zipped file, give it a date_timestamp file name, such as update_03OCT2018_0830utc.zip, (or what ever may come from system parameter) and then move it into the destination S3 bucket. There are a couple specific variables that may influence how I develop this workflow:

  • The number of CSV files may vary between 0-23 when written output in the (FME_SHAREDRESOURCE_DATA/mycsvdirectory). This is because there are workspaces which execute on a schedule to update a database, and so they may not always have record updates to add to database.
  • These CSVs need to be moved from this shared directory so the next batch is written in clean directory and no duplicate CSVs are processed again.
  • I currently have set up a directory watch topic and notification to watch this directory, as a trigger to notify the next workspace, the one which generates the media parsing list, to run on these CSV files once they are created in directory.
  • All output used in creating the final output ZIP file needs to be purged and reset for next schedule iteration of updates.

I hope this makes sense and I realize it's a bit lengthy, but if I could get some direction to start a workflow it would help me immensely.

Thanks, Todd


7 replies

Badge

Hi @tjpollard,

First of all, here are the links to the video and GitHub repository mentioned and the other post:

 

https://www.youtube.com/watch?v=_9VxQg6A7YU

 

https://github.com/safesoftware/codeless-api-demo

 

 

Regarding creating zip files with a time-stamp in the name I would check out this custom transformer on the FME Hub and the DateTimeStamper.

In general, I would avoid making a workflow dependent on an empty directory and rather use unique identifiers to make sure no duplicates are processed. The directory watcher is a good solution. Another option could be an FMEServerNotifier posting the filenames that should be processed to a topic that triggers the next workspace. This could create tighter connections in your chain of jobs and help with troubleshooting when necessary. But it's just an idea, not sure if it's best for your workflow.

 

 

Regarding cleaning up files and resources there are several options. One would be to set up a cleanup task on FME Server to make sure files don't pile or use the Temporary Disk on FME Cloud which is purged on every restart. The temp disk could bring some significant performance benefits if you are moving huge amounts of data. I covered some details about this in the link above.

 

 

If you want to control and trigger file deletion in a workspace I would recommend using the FME Server REST API and the HTTPCaller to delete resources. Another option could be a Python shutdown script taking care of file deleting after a job finished.

 

 

I am sure there are a lot of other parts to think about in this project, but I hope this gives you some pointers and ideas for your project!

 

 

Let us know how it goes!

Badge

Hi @tjpollard,

First of all, here are the links to the video and GitHub repository mentioned and the other post:

 

https://www.youtube.com/watch?v=_9VxQg6A7YU

 

https://github.com/safesoftware/codeless-api-demo

 

 

Regarding creating zip files with a time-stamp in the name I would check out this custom transformer on the FME Hub and the DateTimeStamper.

In general, I would avoid making a workflow dependent on an empty directory and rather use unique identifiers to make sure no duplicates are processed. The directory watcher is a good solution. Another option could be an FMEServerNotifier posting the filenames that should be processed to a topic that triggers the next workspace. This could create tighter connections in your chain of jobs and help with troubleshooting when necessary. But it's just an idea, not sure if it's best for your workflow.

 

 

Regarding cleaning up files and resources there are several options. One would be to set up a cleanup task on FME Server to make sure files don't pile or use the Temporary Disk on FME Cloud which is purged on every restart. The temp disk could bring some significant performance benefits if you are moving huge amounts of data. I covered some details about this in the link above.

 

 

If you want to control and trigger file deletion in a workspace I would recommend using the FME Server REST API and the HTTPCaller to delete resources. Another option could be a Python shutdown script taking care of file deleting after a job finished.

 

 

I am sure there are a lot of other parts to think about in this project, but I hope this gives you some pointers and ideas for your project!

 

 

Let us know how it goes!

Thanks @GerhardAtSafe. I have a couple follow up questions.

 

 

  1. If I have all 23 CSV files uploaded to the shared directory I mention above, on a scheduled day, as a part of a workspace output. How do I read all of those as one CSV into a dynamic schema, so I can select only the columns (named as about 5-6 different field names, but hold the same values) which are the media UUIDs I want to parse out in the follow on workspace in order to pull from an S3 bucket?
  2. Is this possible as one workspace workflow or do I need to break down the process a bit more into smaller segments in order to better automate the process.
  3. Ultimately, I want to ZIP all of the CSVs and media into one ZIP files with a date and time as the file name and once it is completed it is pushed to a client S3 bucket.
What I'm ultimately trying to do is automate a full process of accounting any records that are updated to the database on a weekly schedule, and push only those to other client databases requiring the same or a filter of the updated records. I want to be able to control the workflow at certain points since there is media that needs to accompany them but all may potentially be pushed to different S3 buckets.

 

 

Thanks a bunch as always, Todd
Badge
Thanks @GerhardAtSafe. I have a couple follow up questions.

 

 

  1. If I have all 23 CSV files uploaded to the shared directory I mention above, on a scheduled day, as a part of a workspace output. How do I read all of those as one CSV into a dynamic schema, so I can select only the columns (named as about 5-6 different field names, but hold the same values) which are the media UUIDs I want to parse out in the follow on workspace in order to pull from an S3 bucket?
  2. Is this possible as one workspace workflow or do I need to break down the process a bit more into smaller segments in order to better automate the process.
  3. Ultimately, I want to ZIP all of the CSVs and media into one ZIP files with a date and time as the file name and once it is completed it is pushed to a client S3 bucket.
What I'm ultimately trying to do is automate a full process of accounting any records that are updated to the database on a weekly schedule, and push only those to other client databases requiring the same or a filter of the updated records. I want to be able to control the workflow at certain points since there is media that needs to accompany them but all may potentially be pushed to different S3 buckets.

 

 

Thanks a bunch as always, Todd
@GerhardAtSafe, I think one problem I'm either not understanding how to do or it's not possible:

 

  • I would like the workflow set up to be dynamic enough to allow any variation and number combination of 23 CSV files, may or may not be updated each week with new records. The workflow needs to be flexible enough to allow 23 different CSV files to be processed in the workspace even if only 1 was processed the week before. So I can't necessarily specify a single CSV file to use through the "Select File from Web" browse menu. I need this file to be a variable from a list, or parameter that's called when the workspace is executed.
  • Can I leverage the output generated by the Topic such as this: CSVProdUpdatesToday at 13:27:50 { "dirwatch_publisher_path": "/data/fmeserver/resources/data/production_record_updates/cemetery_survey_updates.csv", "dirwatch_publisher_content": "ENTRY_CREATE /data/fmeserver/resources/data/production_record_updates/cemetery_survey_updates.csv", "dirwatch_publisher_action": "CREATE", "ws_topic": "CSVProdUpdates", "fns_type": "dirwatch_publisher" }
I'm sure I can achieve this by creating 20 separate workspaces to run each CSV update individually when it is created in the directory watch. I'm just wondering if these can be merged into smaller set of workspaces or single workspace to be more efficient.
Badge
Thanks @GerhardAtSafe. I have a couple follow up questions.

 

 

  1. If I have all 23 CSV files uploaded to the shared directory I mention above, on a scheduled day, as a part of a workspace output. How do I read all of those as one CSV into a dynamic schema, so I can select only the columns (named as about 5-6 different field names, but hold the same values) which are the media UUIDs I want to parse out in the follow on workspace in order to pull from an S3 bucket?
  2. Is this possible as one workspace workflow or do I need to break down the process a bit more into smaller segments in order to better automate the process.
  3. Ultimately, I want to ZIP all of the CSVs and media into one ZIP files with a date and time as the file name and once it is completed it is pushed to a client S3 bucket.
What I'm ultimately trying to do is automate a full process of accounting any records that are updated to the database on a weekly schedule, and push only those to other client databases requiring the same or a filter of the updated records. I want to be able to control the workflow at certain points since there is media that needs to accompany them but all may potentially be pushed to different S3 buckets.

 

 

Thanks a bunch as always, Todd
Hi @tjpollard,

 

1. I am not sure about the best way to do this. Please feel free to post a separate Q&A; for this question, I am certain that other community members had to solve similar tasks before.

 

 

2. Spreading out complex workflows to several workspaces can be very handy when something breaks and you need to troubleshoot. On the other hand, maintenance can become more difficult because you need to be aware of dependencies between workspaces (e.g. which workspaces need changing when you change one file path). Given you will write out files multiple times I probably would tend to separate workspaces just for the ease of troubleshooting.

 

 

3. If you write all your files to the same folder, zipping them with a timestamp in the filename, shouldn't be a problem (see ZipArchiver, DateTimeStamper above). To upload to S3 take a look at the S3Uploader or S3 subscriber.

 

 

Cheers,

 

Gerhard

 

Badge
Hi @tjpollard,

 

1. I am not sure about the best way to do this. Please feel free to post a separate Q&A; for this question, I am certain that other community members had to solve similar tasks before.

 

 

2. Spreading out complex workflows to several workspaces can be very handy when something breaks and you need to troubleshoot. On the other hand, maintenance can become more difficult because you need to be aware of dependencies between workspaces (e.g. which workspaces need changing when you change one file path). Given you will write out files multiple times I probably would tend to separate workspaces just for the ease of troubleshooting.

 

 

3. If you write all your files to the same folder, zipping them with a timestamp in the filename, shouldn't be a problem (see ZipArchiver, DateTimeStamper above). To upload to S3 take a look at the S3Uploader or S3 subscriber.

 

 

Cheers,

 

Gerhard

 

Thanks @GerhardAtSafe. I really do appreciate all your assistance on providing direction to how I might implement my workflow. I'm just now learning more about FME Cloud Server publications/subscriptions and Topics. These seem the method when automating consistent tasks at server and in the cloud data migrations.

 

 

Cheers,

 

Todd

 

Badge
@GerhardAtSafe, I think one problem I'm either not understanding how to do or it's not possible:

 

  • I would like the workflow set up to be dynamic enough to allow any variation and number combination of 23 CSV files, may or may not be updated each week with new records. The workflow needs to be flexible enough to allow 23 different CSV files to be processed in the workspace even if only 1 was processed the week before. So I can't necessarily specify a single CSV file to use through the "Select File from Web" browse menu. I need this file to be a variable from a list, or parameter that's called when the workspace is executed.
  • Can I leverage the output generated by the Topic such as this: CSVProdUpdatesToday at 13:27:50 { "dirwatch_publisher_path": "/data/fmeserver/resources/data/production_record_updates/cemetery_survey_updates.csv", "dirwatch_publisher_content": "ENTRY_CREATE /data/fmeserver/resources/data/production_record_updates/cemetery_survey_updates.csv", "dirwatch_publisher_action": "CREATE", "ws_topic": "CSVProdUpdates", "fns_type": "dirwatch_publisher" }
I'm sure I can achieve this by creating 20 separate workspaces to run each CSV update individually when it is created in the directory watch. I'm just wondering if these can be merged into smaller set of workspaces or single workspace to be more efficient.
You mentioned that the CSV files are created by a job that runs on a schedule. If this is the case the only info you get from the directory watch is which files need to be processed not exactly when you need to process them. This would allow triggering a logger subscription with the messages of the directory watch and then run a scheduled job sometime after the job that creates the CSVs to read the paths for all newly created CSV files from the log file used by the logger subscription. You parse the log file by a timestamp to make sure you only read the new paths you want to process.

 

 

This could be a lightweight solution to collect the relevant paths first and then run the processing job on a schedule.

 

Badge
Thanks @GerhardAtSafe. I really do appreciate all your assistance on providing direction to how I might implement my workflow. I'm just now learning more about FME Cloud Server publications/subscriptions and Topics. These seem the method when automating consistent tasks at server and in the cloud data migrations.

 

 

Cheers,

 

Todd

 

No problem @tjpollard. The Notification framework is definitely what you want to look at for this project. It can really help a lot in automating workflows that are reliable, consistent & easy to maintain.

 

 

Please also keep an eye out for this new feature coming soon: Automation Workflows will give you a visual user interface to design workflows for FME Server based on the Notification framework. Once we have this fully released this will be huge for scenarios like yours!

 

Reply