For training we use t3(a).xlarge. Its sufficient for that purpose, but I would want a more powerful instance for large data processing.
One thing to also consider is the iops of your storage, the higher the better (faster)
It really depends on what you want to do with it. Higher specs means shorter processing times (more or less, there’s a lot of factors involved of course) but it’s up to you to decide whether the extra cost is worth it.
I’ve been experimenting for several months now for this using Form and Flow. The data set that I am using has about 25 million records per day and I am processing 14 days worth of it to insert into our database. The data insert ends up being less than 10M records. I started using an m5.xlarge but was working on extremely large datasets and noticed that the memory usage was maxing out in Form and Flow was taking a very long time.
Using a memory optimized instance was the best solution for me. For Form I started using an r6i.2xlarge (64GB memory) where I no longer get memory issues. For Flow I was using an r6i.xlarge (32 GB memory). In Flow the process was taking a little over 24 hours. It uses 100% cpu and doesn’t utilize all of the 32GB of memory in this case. We decided to try the r7i.xlarge which is essentially the same but has a newer generation CPU and run the update once again. This time the process took 19.5 hours. There was a ~20% processing speed increase.
While each use might have it’s own needs, the super sized datasets definitely need the memory component. I definitely recommend the Memory Optimized instance types since this seemed to be my largest bottleneck.