I’ve been experimenting for several months now for this using Form and Flow. The data set that I am using has about 25 million records per day and I am processing 14 days worth of it to insert into our database. The data insert ends up being less than 10M records. I started using an m5.xlarge but was working on extremely large datasets and noticed that the memory usage was maxing out in Form and Flow was taking a very long time.
Using a memory optimized instance was the best solution for me. For Form I started using an r6i.2xlarge (64GB memory) where I no longer get memory issues. For Flow I was using an r6i.xlarge (32 GB memory). In Flow the process was taking a little over 24 hours. It uses 100% cpu and doesn’t utilize all of the 32GB of memory in this case. We decided to try the r7i.xlarge which is essentially the same but has a newer generation CPU and run the update once again. This time the process took 19.5 hours. There was a ~20% processing speed increase.
While each use might have it’s own needs, the super sized datasets definitely need the memory component. I definitely recommend the Memory Optimized instance types since this seemed to be my largest bottleneck.