Skip to main content
Solved

Text Diff


dbaldacchino1
Enthusiast
Forum|alt.badge.img+13

So this is a problem that keeps coming up...we need to report text differences between two versions of reports. I can access the before and after fields and read them into FME, but I am searching for ways to analyze the strings and report the differences (additions and deletions).

The ultimate goal is to format a report with additions (green) and deletions (red) clearly marked. As a start though, I have to find some method to analyze the two strings and perhaps break things down enough to be able to re-assemble a properly formatted string, perhaps with HTML. Nothing is jumping out right now as a simple answer, so I suspect it is not as easy :)

I have looked into the custom workspaces "FuzzyStringComparer" and "FuzzyStringCompareFrom2Datasets", but I don't think they will help me much with the above. I thought about a process of chopping a string into individual words and do some repetitive looping using Regular Expressions to determine which chunks existed before and identify additions and deletions, but it's now starting to look more like a thesis project and not something easily achieved. So I thought I'd ask here to see if anyone has any other ideas that might juggle my brain and set it on a potential path to success! Thanks in advance for your insight.

 

PS: I know of several online text diff. tools and even found a very good PDF compare tool that retains the original formatting (which is actually desirable but for this task it is not crucial) but I am looking more at a way to report data differences in a visual way and have some control over the layout. BeyondCompare does a very good job too, but it lacks the control of creating a single variance report with all the differences.

Best answer by gerhardatsafe

Hi @dbaldacchino1,

I would check out this python module and use it in a PythonCaller: https://docs.python.org/3.5/library/difflib.html Here is a usage example.

Hope this helps!

View original
Did this help you find an answer to your question?

12 replies

Forum|alt.badge.img+2
Hi @dbaldacchino1,

 

The ChangeDetector might be useful to you. You can input both the Original and Revised report and then there are three possible outputs; Unchanged, Added and Deleted.

 

What file format is your report - PDF?

 


dbaldacchino1
Enthusiast
Forum|alt.badge.img+13
  • Author
  • Enthusiast
  • July 31, 2018
hollyatsafe wrote:
Hi @dbaldacchino1,

 

The ChangeDetector might be useful to you. You can input both the Original and Revised report and then there are three possible outputs; Unchanged, Added and Deleted.

 

What file format is your report - PDF?

 

 

Thanks @hollyatsafe, I have actually already explored that but it's a dead end because it just detects feature changes. If I pursue a custom transformer with looping etc., I could use that transformer on a word by word basis but that still won't properly solve this problem, hence why I thought about regular expressions instead (using the words themselves from the "after" features to see if I can infer what was added and deleted when compared to the "before" features).

 

 

The reports are in PDF, but I can access the raw data in a postgres database via an ODBC connector, or ouput xml data instead. That's why I'm exploring the use of FME to go around the original PDFs themselves and just create a new reporting mechanism to show changes within a date range.

Forum|alt.badge.img+2
dbaldacchino1 wrote:

 

Thanks @hollyatsafe, I have actually already explored that but it's a dead end because it just detects feature changes. If I pursue a custom transformer with looping etc., I could use that transformer on a word by word basis but that still won't properly solve this problem, hence why I thought about regular expressions instead (using the words themselves from the "after" features to see if I can infer what was added and deleted when compared to the "before" features).

 

 

The reports are in PDF, but I can access the raw data in a postgres database via an ODBC connector, or ouput xml data instead. That's why I'm exploring the use of FME to go around the original PDFs themselves and just create a new reporting mechanism to show changes within a date range.
Hi @dbaldacchino1,

 

Are you able to share a snippet of sample data (before and after) so I can understand better what you are trying to achieve? I do not thing regex will work as this will look for an exact match to the expressions you have specified and I assume you do not know in advance how the string might will change?

 


Forum|alt.badge.img

Hi @dbaldacchino1,

I would check out this python module and use it in a PythonCaller: https://docs.python.org/3.5/library/difflib.html Here is a usage example.

Hope this helps!


Forum|alt.badge.img+6
gerhardatsafe wrote:

Hi @dbaldacchino1,

I would check out this python module and use it in a PythonCaller: https://docs.python.org/3.5/library/difflib.html Here is a usage example.

Hope this helps!

Ahhhh Python...not my area of expertise (yet!). Thanks for the tip though; it might help me with perhaps customizing one of the above mentioned custom transformers. Would this be something that makes sense to put in as an idea for a new transformer? I would think it's a useful thing to have without requiring custom code.

 

 


Forum|alt.badge.img
dbaldacchino wrote:
Ahhhh Python...not my area of expertise (yet!). Thanks for the tip though; it might help me with perhaps customizing one of the above mentioned custom transformers. Would this be something that makes sense to put in as an idea for a new transformer? I would think it's a useful thing to have without requiring custom code.

 

 

Yes, I think those custom transformers use the same library it's just a different use case. I think for the diff itself it might not be worth to reinvent the wheel, the challenge will be how to parse the result you get back from the module. But I think that's achievable in FME. Maybe a new custom transformer for the HUB?

 

That said I think it's a great idea and I would definitely recommend to post it here to let people vote and add requirements for it.

 


Forum|alt.badge.img+6
gerhardatsafe wrote:
Yes, I think those custom transformers use the same library it's just a different use case. I think for the diff itself it might not be worth to reinvent the wheel, the challenge will be how to parse the result you get back from the module. But I think that's achievable in FME. Maybe a new custom transformer for the HUB?

 

That said I think it's a great idea and I would definitely recommend to post it here to let people vote and add requirements for it.

 

Thanks, will post. I will definitely take a look at this library and see if I can spin off a custom transformer based on them. At least this gives me comfort that I wasn't missing any other available techniques and I surely don't have time to reinvent the wheel...I like the wheel! Have a great day and thanks to all.

 

 


paalped
Contributor
Forum|alt.badge.img+5
  • Contributor
  • August 3, 2018

Sounds like you want to use git.

 

Have you tried any git tools?


Forum|alt.badge.img+6
paalped wrote:

Sounds like you want to use git.

 

Have you tried any git tools?

Hi @paalped no I have not. I need to keep everything within FME (each string would be a feature that I'd need to analyse the before & after before formatting one single report for all features). I'm also looking into possibilities with JScript and a couple other variants as I might be able to call a function within the report application directly and bypass FME altogether since in this case FME was "a way out" of the limited report system (XSLT used by Ecrion rendering server). I was seeing FME as a possible way to make a custom diff reporting system, but since it looks like it'll take coding, I'm taking a second look at the original report authoring application (Ecrion Publisher) since it allows scripting directly there.

 


paalped
Contributor
Forum|alt.badge.img+5
  • Contributor
  • August 4, 2018

@dbaldacchino1 I build this transformer for you: TextDifferenceReportGenerator


dbaldacchino1
Enthusiast
Forum|alt.badge.img+13
  • Author
  • Enthusiast
  • August 6, 2018
paalped wrote:

@dbaldacchino1 I build this transformer for you: TextDifferenceReportGenerator

Thanks @paalped! I'll give it a try in the morning and will post back.

 


dbaldacchino1
Enthusiast
Forum|alt.badge.img+13
  • Author
  • Enthusiast
  • August 6, 2018
paalped wrote:

@dbaldacchino1 I build this transformer for you: TextDifferenceReportGenerator

Hi @paalped I found there are some some other options that can be set so it looks at each word and exports out an html file instead of a string (looks like you're taking the string output from difflib and formtting it into html output yourself). Here's an example I found last week:https://www.youtube.com/watch?v=a1x6h19M9j0&t;=7s

 

 

Would you be willing to share your password or remove protection so I/we can further improve upon your work? Thanks again!

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings