Skip to main content

I wish to extract a sample from a table where each attribute from some field should be represented x times. For example see below where x is 1, and the fields I want to sample from are House, Priority and Injury.

 

Input:

ID      Name    House       Priority    Injury 

1        Draco   Slytherin    Medium  Head  

2        Harry    Gryffindor  Low        Hand

3        Ron      Gryffindor  Low        Head

4        Cedric  Hufflepuff  High       Chest

 

Output:

ID      Name    House       Priority    Injury 

1        Draco   Slytherin    Medium  Head  

2        Harry    Gryffindor  Low        Hand

4        Cedric  Hufflepuff  High       Chest

hey there, perhaps the Sampler with a “group by: set to the fields you want x sampling rate

 

 


hey there, perhaps the Sampler with a “group by: set to the fields you want x sampling rate

 

 

This tends to generate an excessive number of samples, particularly when dealing with datasets containing numerous fields. Ideally, the number of samples should be set at n * 10, where n represents the distinct number of attributes of the field with the greatest number of distinct attributes. 


Hi @me.aelmo looks like you’re on the right track. If the issue is that your solution generates too many samples, then maybe sampling a second time could generate a smaller sample? You could perform this with either an additional Sampler, or maybe the solution would be to use a PythonCaller with a script that takes care of this scenario all in one transformer. I won’t be able to advise you on how to write this script, sorry to not be of more help. Let us know if you are able to solve this :)


Reply