Following on from this idea...It was suggested that I throw this out as a challenge, and so here it is.We would like to see your solutions for providing a word count for a set of text. The count should provide a list of words and their frequency.To make it fair, everyone should use this Wikipedia page on the Inuit language as their source. I chose this because it includes a number of non-standard words and characters that you will probably all be equally unfamiliar with. For example: ???????????????? or qangatasuukkuvimmuuriaqalaaqtunga (meaning "I'll have to go to the airport"). You obviously don't have to translate the words, but they should be included in the frequency count. I wouldn't worry about numbers or tables either. You can ignore those if you wish.Bonus points to be awarded for:A solution that can be published to the FME HubA solution that reads directly from the web site (otherwise just paste the content into a text file)A solution that is generic and doesn't hard code words specific to that web pageA solution that produces the correct result (which is going to be a bit subjective)Anything else that I think of later!I thought about this task for a while and a few (maybe obvious) things occurred to me:A word is probably the same if it is plural (so elephant, elephants, and elephant's should all be counted as the same word really)Case is important sometimes (eg Deer Lake is a place, but "deer" and "lake" are different) but sometimes not important (eg "The elephant was the biggest" - both "The" and "the" are the same word)Speaking of Deer Lake, that's really one word - ie sometimes a sequence of words like that is a single phrase and really ought to be counted as suchI don't know if we can overcome any of these issues, or how, but they are things to consider. This toolkit has been suggested as a potential help.So, have at it. Post your solutions here and we'll upload the best one to the hub for everyone to use.Markps - I thought one thing that would help would be a HTMLTagStripper - to get rid of tags that aren't necessary. But it appears there's already one in the FME Hub! So that can get us off to a good start, compliments of its creator, Jan Bliki.

Question

Word Frequency Challenge

7 years ago
September 29, 2017
6 replies
87 views

+44

mark2atsafe
Safer
2520 replies

Following on from this idea...

It was suggested that I throw this out as a challenge, and so here it is.

We would like to see your solutions for providing a word count for a set of text. The count should provide a list of words and their frequency.

To make it fair, everyone should use this Wikipedia page on the Inuit language as their source. I chose this because it includes a number of non-standard words and characters that you will probably all be equally unfamiliar with. For example: ???????????????? or qangatasuukkuvimmuuriaqalaaqtunga (meaning "I'll have to go to the airport").

You obviously don't have to translate the words, but they should be included in the frequency count. I wouldn't worry about numbers or tables either. You can ignore those if you wish.

Bonus points to be awarded for:

A solution that can be published to the FME Hub
A solution that reads directly from the web site (otherwise just paste the content into a text file)
A solution that is generic and doesn't hard code words specific to that web page
A solution that produces the correct result (which is going to be a bit subjective)
Anything else that I think of later!

I thought about this task for a while and a few (maybe obvious) things occurred to me:

A word is probably the same if it is plural (so elephant, elephants, and elephant's should all be counted as the same word really)
Case is important sometimes (eg Deer Lake is a place, but "deer" and "lake" are different) but sometimes not important (eg "The elephant was the biggest" - both "The" and "the" are the same word)
Speaking of Deer Lake, that's really one word - ie sometimes a sequence of words like that is a single phrase and really ought to be counted as such

I don't know if we can overcome any of these issues, or how, but they are things to consider. This toolkit has been suggested as a potential help.

So, have at it. Post your solutions here and we'll upload the best one to the hub for everyone to use.

Mark

ps - I thought one thing that would help would be a HTMLTagStripper - to get rid of tags that aren't necessary. But it appears there's already one in the FME Hub! So that can get us off to a good start, compliments of its creator, Jan Bliki.

+28

jdh
Contributor
1981 replies
7 years ago
September 29, 2017

Lol, I put up a basic custom transformer on that idea half an hour before you posted this challenge. I considered html, but decided that resolving html entities é would take more time than I had on my coffee break.

I'll try to incorporate more of you're requirements on Monday.

mark_f
325 replies
7 years ago
September 29, 2017

@Mark2AtSafe So what happened to the FME Challenge: Transformer Naming Part II ?

https://knowledge.safe.com/questions/47408/fme-challenge-transformer-naming.html

I was sure I had found one others hadn't ;) (and probably missed a few as well!)

lenaatsafe
275 replies
7 years ago
September 29, 2017

Hi @Mark2AtSafe

my two cents:

re elephant, elephants, and elephant's - this is definitely the same word, but if we simply assume that s makes plural and 's means possession, we limit our word count functionality to English only. Also, it is much harder with irregular verbs - is write/wrote/written one word or three different words? Saying this, we might want to simplify the task by treating word forms as different words.
re Deer Lake - afaik, this is usually handled with dictionaries. Is it OK to have a dictionary for this challenge?

+45

danilo_fme
Evangelist
2056 replies
7 years ago
September 30, 2017

Very interesting to create this custom transformer. I'll try to do on this Weekend.

Thanks,

+28

jdh
Contributor
1981 replies
7 years ago
October 3, 2017

Ok. Version 2. Can accept either a url or a text.

It's a little naive in the html processing. It extracts just the body, resolves html entities, removes any scripting and drops all tags. The assumption is that it is a properly formatted html.

For the actual text processing, it is case insensitive, and different word forms are counted separately.

Although python 2.7 is used, it does not rely on any external modules.

wordfrequencycounter.fmx

+28

jdh
Contributor
1981 replies
7 years ago
October 3, 2017

And here is the corresponding 'Pure FME' version. Also added a parameter to make case sensitivity a choice.

wordfrequencycounter2.fmx

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Word Frequency Challenge