Question

Word Frequency Challenge

  • 29 September 2017
  • 6 replies
  • 7 views

Userlevel 4
Badge +25

Following on from this idea...

It was suggested that I throw this out as a challenge, and so here it is.

We would like to see your solutions for providing a word count for a set of text. The count should provide a list of words and their frequency.

To make it fair, everyone should use this Wikipedia page on the Inuit language as their source. I chose this because it includes a number of non-standard words and characters that you will probably all be equally unfamiliar with. For example: ???????????????? or qangatasuukkuvimmuuriaqalaaqtunga (meaning "I'll have to go to the airport").

 

You obviously don't have to translate the words, but they should be included in the frequency count. I wouldn't worry about numbers or tables either. You can ignore those if you wish.

Bonus points to be awarded for:

  • A solution that can be published to the FME Hub
  • A solution that reads directly from the web site (otherwise just paste the content into a text file)
  • A solution that is generic and doesn't hard code words specific to that web page
  • A solution that produces the correct result (which is going to be a bit subjective)
  • Anything else that I think of later!

I thought about this task for a while and a few (maybe obvious) things occurred to me:

  • A word is probably the same if it is plural (so elephant, elephants, and elephant's should all be counted as the same word really)
  • Case is important sometimes (eg Deer Lake is a place, but "deer" and "lake" are different) but sometimes not important (eg "The elephant was the biggest" - both "The" and "the" are the same word)
  • Speaking of Deer Lake, that's really one word - ie sometimes a sequence of words like that is a single phrase and really ought to be counted as such

I don't know if we can overcome any of these issues, or how, but they are things to consider. This toolkit has been suggested as a potential help.

So, have at it. Post your solutions here and we'll upload the best one to the hub for everyone to use.

Mark

ps - I thought one thing that would help would be a HTMLTagStripper - to get rid of tags that aren't necessary. But it appears there's already one in the FME Hub! So that can get us off to a good start, compliments of its creator, Jan Bliki.


6 replies

Badge +22

Lol, I put up a basic custom transformer on that idea half an hour before you posted this challenge. I considered html, but decided that resolving html entities é would take more time than I had on my coffee break.

 

 

I'll try to incorporate more of you're requirements on Monday.
Badge +2

@Mark2AtSafe So what happened to the FME Challenge: Transformer Naming Part II ?

https://knowledge.safe.com/questions/47408/fme-challenge-transformer-naming.html

I was sure I had found one others hadn't ;) (and probably missed a few as well!)

Badge

Hi @Mark2AtSafe

my two cents:

  • re elephant, elephants, and elephant's - this is definitely the same word, but if we simply assume that s makes plural and 's means possession, we limit our word count functionality to English only. Also, it is much harder with irregular verbs - is write/wrote/written one word or three different words? Saying this, we might want to simplify the task by treating word forms as different words.
  • re Deer Lake - afaik, this is usually handled with dictionaries. Is it OK to have a dictionary for this challenge?
Userlevel 4
Badge +30

Very interesting to create this custom transformer. I'll try to do on this Weekend.

Thanks,

Badge +22

Ok. Version 2. Can accept either a url or a text.

 

It's a little naive in the html processing. It extracts just the body, resolves html entities, removes any scripting and drops all tags. The assumption is that it is a properly formatted html.

 

 

For the actual text processing, it is case insensitive, and different word forms are counted separately.

Although python 2.7 is used, it does not rely on any external modules.

wordfrequencycounter.fmx

Badge +22

And here is the corresponding 'Pure FME' version. Also added a parameter to make case sensitivity a choice.

wordfrequencycounter2.fmx

Reply