Make sure that your columns are indented properly as Python uses indentation for code blocks. Each line in a code block must begin at the same column as the first line in the code block.
Use your text editor to view hidden characters and do not mix tabs and spaces. A Python-aware editor like PyScriter (Windows) or TextWrangler (Mac) can help you spot these errors.
Hi,
That error could occur when the pattern (the first argument of re.search function) contains parenthesis ( ). Check if there is such a MATCH string. The pattern string should be a valid regular expression, some special characters in the source string (i.e. MATCH in this case) will have to be escaped beforehand. For example: ( --> \\( ) --> \\)
Takashi
If MATCH may contain special characters (meta characters for regular expression), re.escape function could be useful. This function returns a string in which every characters except alphabets and digits are escaped. ----- if re.search(re.escape(MATCH), SOURCE, re.IGNORECASE): -----
The script is from my example?
> Add country attribute by searching for words Sorry, I didn't notice that a country name may contain meta characters.
Hi Takashi,
Thanks again for your help, I got the script to work but I did not yet reach my final goal.
My final goal would be to load in newsarticles (in spreadsheets) into FME (as the 'SOURCE' attribute) which would then be automatically linked to a spatial location (the 'MATCH'). These locations are not always countries but f.e. sectors (which apparently have some meta characters in them).
The problem I have now is that many different countries have the same named sectors. for example Sector 5 in Japan and Sector 5 in Belgium. This way, a newsarticle that mentions sector 5 in its body (the SOURCE) is often linked to the wrong sector 5. What would be the best way to only link sector names if there is also a match between country names? Do I implement it in the script (I have no python knowledge whatsoever) or do I use FME transformers to link the countries in advance of linking the sectors?
Kind regards,
dB
If possible, could you please show us concrete schemas of the spreadsheet and shapefile (related field columns and some sample contents) ?
Especially I'd like to know schema of the shape file. I guess the table contains 2 attributes - country name and sector name. Does the table look like this?
----- country | sector Belguim | Sector 1 Belguim | Sector 2 ... Japan | Sector 1 Japan | Sector 2 ...
Hi Takashi,
The shape files are something indeed something like this
SECTOR | COUNTRY | FUNCTION | .....
Blok-99 | Japan | commercial|
S-24(AJ) | Japan | Industry |
Blok-99 | Iran | Nuclear |
Blok 5 | Belgium | commerical
Sector 99 | Belgium | Industry |
This is a bit different from the first exercise you helped me with (where I had to retrieve the country name from a newsarticle.
The source data has now already got tables with the sector and the country listed in, I think it would be fairly easy to match then based on these two criteria
Example:
ARTICLE | SECTOR| COUNTRY|
Nuclear powerplant 1st birthday| Blok-99 | Iran |
50% of on hello kitty | Blok-99 | Japan |
I would however like to build a script that could also use an article (with mentioning of the country and the sector in it) as an input
For example:
"In Iran, the Nuclear powerplant in Blok-99 has celebrated its first birthday"
First I would like to match SOURCE to a MATCH based on the country In this case Iran. In Iran, no two sectors have the same name, so I would like the source, that has been matched to iran, look for all the sectors in iran and see if there is a match.
What I have so far is your suggestion that succesfully matches the article with a country if its name is stated in the article.
Hope this is a bit clear :)
It's clear now. See my first post in the previous thread.
> Add country attribute by searching for words Before FeatureMerger You can see that the ListBuilder creates these lists since the input table have 2 fields named SECTOR and COUNTRY. _list{}.SECTOR _list{}.COUNTRY
After FeatureMerger The features from the ListExploder will have SECTOR and COUNTRY. After this, there is a StringSearcher which searches country name in the article. You don't need to change the workflow before this StringSearcher. Add a second StringSearcher to search sector name in the same article. Then, you can get articles each of which contains a correct pair of COUNTRY and SECTOR. Even if you search SECTOR before searching COUNTRY, the result will be the same. Python example would be like this.
-----
import fmeobjects, re class LocationFinder(object): def __init__(self): pass def input(self, feature): source = feature.getAttribute('SOURCE') countries = feature.getAttribute('_list{}.COUNTRY') sectors = feature.getAttribute('_list{}.SECTOR') if not source or not countries or not sectors: return feature.removeAttrsWithPrefix('_list') for country, sector in zip(countries, sectors): if re.search(re.escape(country), str(source), re.IGNORECASE) \\ and re.search(re.escape(sector), str(source), re.IGNORECASE): newFeature = feature.cloneAttributes() newFeature.setAttribute('COUNTRY', country) newFeature.setAttribute('SECTOR', sector) self.pyoutput(newFeature) def close(self): pass -----
Tips: If the shape table has many attributes other than SECTOR and COUNTRY, consider removing them using the AttributeRemover or the AttributeKeeper before the ListBuilder, so that efficiency maybe goes up.
Another approach flashed. Maybe the InlineQuerier can be used effectively. I'm home now, will try it tomorrow.
Thank you Takashi, you are too kind!
I got the script to work (with the 2x stringsearcher)! because there are 33000 different sectors it does take a while to run, like 7 min or so. I will try to run it with a pythonscript tomorrow to see if its faster.
Thank you again!
Hi,
I think efficiency would be important in this subject. More efficient Python script can be also considered, but it would be a little complicated. I think that is not preferable on view points of understandability and maintainability. I expect the InlineQuerier will be simpler and also more efficient. InlineQuerier Settings Example Assume the shape features have attributes named SECTOR and COUNTRY, and the spreadsheet features have attributes named ARTICLE and SOURCE.
Inputs Table: Location Columns: SECTOR text COUNTRY text Table: NewsArticle: Columns: ARTICLE text SOURCE text
Outputs Output Port: Matched SQL Query: ----- select a.fme_feature_content, b.ARTICLE, b.SOURCE from Location as a cross join NewsArticle as b where b.SOURCE like '%'||a.SECTOR||'%' and b.SOURCE like '%'||a.COUNTRY||'%' ----- Geometry: First Feature With the settings above, "Location" and "NewsArticle" will be created as input ports of the InlineQuerier. Send the shape features to "Location", the spreadsheet features to "NewsArticle". You can get shape features having associated ARTICLE and SOURCE as attributes. "cross join" and the where clause in the SQL statement are the points. "a.fme_feature_content" selects every content (including geometry) from table "a" i.e. Location. So that, the InlineQuerier treats the shape features as something like REQUESTOR on the FeatureMerger.
"a.fme_feature_content" can be replaced with "a.*". The InlineQuerier uses SQLite internally, limitations basically depend on SQLite specifications.
If possible, let us know for future reference which solution is more efficient on the actual data.
Takashi
Hi Takashi,
I got the inlinequerier to work and its a lot faster and simpeler than the listexploder and stringsearchers.
Whereas the stringsearchers use 3'45" of computing time, the inlinequery script does it in 9". I did not get the python script to work.
The percentage of matches is also similar, but slightly different. I will now polish the scripts up and try to improve the recognition capacities, see which one works best. I will def let you know the outcome. Thank you once more, it is so nice to be helped by someone halfway around the world.
Very kind regards
Hi,
just a quick heads-up regarding the InlineQuerier: The LIKE operator used for the matching is only case-insensitive for characters inside the ASCII range. For all others it is case sensitive, which may influence the matching.
Example:
- "AFGHANISTAN" LIKE "afghanistan" = TRUE
- "FØROYAR" LIKE "føroyar" = FALSE
Hopefully this will not be an issue for you.
David
David, thanks for the caution. Yes, case sensitivity could be an issue depending on the requirement. Unfortunately, it seems not to be able to control case sensitivity in the SQL syntax for the InlineQuerier. I also hope it will not be an issue in your project.
Maybe I could use the StringCaseChanger in the beginning of the script? or does this also not affects the non-ASCII characters?
Hi,
the StringCaseChanger works for extended characters as well, so that might be a good solution.
David
Hi all
A short heads up at my script so far I have two readers:
a. NewsArticle
ID-NA | Text | Country_S | Sector_S
1. ...
... ...
30 ...
b. Shapefiles of all different sectors
ID-S | Sector_dB | Country_dB |
1. ....
... ....
30000 .....
The mail goal is to assign the different Newsarticles to the correct sector. I am doing this by an inlinequery that Takashi suggested. This query's output are all the articles wich are both matched by country as by sector. (lets say 14 out of the 30 articles).
I would now like to use the rest of the articles (that have not been assigned to a specific sector) to be linked at the country level. I have a shapefile of all the countries.
I have come up so far with a query that links all the articles to a country, but I only need the ones that are not yet linked to a sector.
Is there an easy filter method or even better a query to efficiently do this?