Skip to main content

Any ideas how would FME power be able to convert a 'decapitated' CAD file back to text?

I have a PDF file which used to be a CAD drawing (not sure if Autodesk or Bentley) and it contains only a 'dropped/exploded' line segment geometry. I have managed to read it into FME as line geometries, rotate the page so that it's 'level' and use LineCombiner to join segments back to 'character shapes'.

Now I have geometries as shown below (some are perfect characters, some are in 2-3 parts, like 'd or b') and have no idea how to turn it back into text. I tried exporting it to DGN but Microstation doesn't seem to have a function like that either (i.e. once you drop/explode a text to lines, there's only 'Undo' to help, no function to 'characterise' it again, that I found).

Also tried rasterising it with ImageRasterizer and then calling a Tesseract to attempt OCR, but it notoriously returns 'No text found' regardless of resolution, colour, pixel size, background colour or language.

So any further ideas how to resurrect this 'almost there' file?

Is it all the same font?

If so, you might be able to distinguish the characters by properties such as:

  • number of vertices
  • angle between first and last vertex
  • bounding box height/width ratio
  • etc.

It would be a slow, painstaking process to get it working. Might do the trick though


Is it all the same font?

If so, you might be able to distinguish the characters by properties such as:

  • number of vertices
  • angle between first and last vertex
  • bounding box height/width ratio
  • etc.

It would be a slow, painstaking process to get it working. Might do the trick though

@jakemolnar yes it is, but as with CAD schematic, some is vertical, some horizontal, different sizes. Not sure I'm so much determined to create a set of rules for every character and scale it by font size. See below.


Do you get any improvement if you buffer the lines prior to rasterisation?

And I'm not sure if the GoogleVisionConnector is an option for the OCR. I saw someone mention using it (albeit unsuccessfully) for a word-find problem with last week's quiz...


Do you get any improvement if you buffer the lines prior to rasterisation?

And I'm not sure if the GoogleVisionConnector is an option for the OCR. I saw someone mention using it (albeit unsuccessfully) for a word-find problem with last week's quiz...

@lindsay I have failed at attempting to set up a Cloud authentication. The build-in one does not work for me. @gerhardatsafe ??


@lindsay I have failed at attempting to set up a Cloud authentication. The build-in one does not work for me. @gerhardatsafe ??

Hi @kkrajewski,

We just released a new version of the Google AI package including the GoogleVisionConnector that allows you to use a Service Account key directly in the transformer (https://cloud.google.com/docs/authentication/end-user#creating_your_client_credentials).

 

 

You can still use your own OAuth 2.0 credentials as well. Here's good resource on how to create and use your own OAuth 2.0 client: https://cdn.safe.com/resources/ebook/Creating-Web-Connections.pdf

You can upgrade to the new package in FME Desktop under FME Options -> Packages or download the new version via FME Hub and drag & drop it onto your canvas.

 

 

Let us know how this goes!

@jakemolnar yes it is, but as with CAD schematic, some is vertical, some horizontal, different sizes. Not sure I'm so much determined to create a set of rules for every character and scale it by font size. See below.

In that case, I'd say @lindsay has the best idea: try buffering before raster OCR. Otherwise you are dealing with the hard problem of rolling your own character recognition. https://imgs.xkcd.com/comics/tasks.png


Reply