Following on from this idea...
It was suggested that I throw this out as a challenge, and so here it is.
We would like to see your solutions for providing a word count for a set of text. The count should provide a list of words and their frequency.
To make it fair, everyone should use this Wikipedia page on the Inuit language as their source. I chose this because it includes a number of non-standard words and characters that you will probably all be equally unfamiliar with. For example: ???????????????? or qangatasuukkuvimmuuriaqalaaqtunga (meaning "I'll have to go to the airport").
You obviously don't have to translate the words, but they should be included in the frequency count. I wouldn't worry about numbers or tables either. You can ignore those if you wish.
Bonus points to be awarded for:
- A solution that can be published to the FME Hub
- A solution that reads directly from the web site (otherwise just paste the content into a text file)
- A solution that is generic and doesn't hard code words specific to that web page
- A solution that produces the correct result (which is going to be a bit subjective)
- Anything else that I think of later!
I thought about this task for a while and a few (maybe obvious) things occurred to me:
- A word is probably the same if it is plural (so elephant, elephants, and elephant's should all be counted as the same word really)
- Case is important sometimes (eg Deer Lake is a place, but "deer" and "lake" are different) but sometimes not important (eg "The elephant was the biggest" - both "The" and "the" are the same word)
- Speaking of Deer Lake, that's really one word - ie sometimes a sequence of words like that is a single phrase and really ought to be counted as such
I don't know if we can overcome any of these issues, or how, but they are things to consider. This toolkit has been suggested as a potential help.
So, have at it. Post your solutions here and we'll upload the best one to the hub for everyone to use.
Mark
ps - I thought one thing that would help would be a HTMLTagStripper - to get rid of tags that aren't necessary. But it appears there's already one in the FME Hub! So that can get us off to a good start, compliments of its creator, Jan Bliki.