Question

Regex parse ticket information from web scrape

6 years ago
January 17, 2019
4 replies
10 views

gisbradokla_t
24 replies

OK S00224 11-18_1-19.csvI have a list that I need to parse. I am not very well versed in Regex. I can manually build my attribute names (and in fact will have to change some of these as they are. I need to know how to

1. handle the different number of dashes between the attribute name and the attr value.

2. retrieve multiple attributes from one line.

3. remove the brackets and write the value to the attribute with varying length strings

4. handle LF

5. manage dashes inside of brackets (keep)

Here is some of the data structure of the list...

RKORT 1712 OKOCS 11/01/18 11:33:07 18110111301739 UPDATE

TICKET NUMBER--[18110111301739]

OLD TICKET NUM-[18102215173653]

MESSAGE TYPE--[UPDATE] LEAD TIME--[48]

PREPARED------[11/01/18] AT [11:30] BY [IMTHEBESTMAN@TRUCKERCONST.COM]

CONTRACTOR--[MTRUCKER CONSTRUCTION] CALLER--[JARVIS BESTMAN]

ADDRESS-----[707 S CREEK COUNTRY ROAD]

COUNTY-[PINE] PLACE--[CUTHING]ADDRESS-----[] STREET--[E][DEEROCK][RD][]

NEARBY MAJOR INTERSECTION-[N LITTLE AVE AND E DEEROCK RD]

LATITUDE--[43.014899] LONGITUDE--[-166.761132]

SECONDARY LATITUDE--[43.017155] SECONDARY LONGITUDE--[-166.758085]ADDITIONAL ADDRESSES IN LOCATION--[N]

LOCATION INFORMATION--[7:34:18 - PIPELINE - FROM THE INT OF N LITTLE AVE AND E DEEROCK RD, EAST]

[ON E DEEROCK RD 0.41 MI, NORTH 0.10 MI ONTO UNMARKED ROAD -- LOCATE 120 FT]

[EAST, 211 FT SOUTH, 126 FT WEST, 200 FT NORTH AND EVERYTHING WITHIN -- MAY]

[BE A GATED ENTRANCE]

M

+2

markatsafe
1891 replies
6 years ago
January 23, 2019

The reason this is so tricky is that:

- some lines have only one attribute and value, i.e. OLD TICKET NUM-[18102215173653]

- some lines have several attributes/values, i..e MESSAGE TYPE--[UPDATE] LEAD TIME--[48] but note that the attribute name "MESSAGE TYPE" isn't quoted. The separator between more than one attribute value pair is two or more <space> characters, i.e. MESSAGE<sp>TYPE--[UPDATE]<sp><sp><sp>LEAD TIME--[48]

- some of the attribute / value pairs are multi-line, i.e. the LOCATION INFORMATION attribute above.

If we strip out the special cases (like the multi-line attribute values) and handle them separately, then all the 'simple' attributes can be reduced to name / value pairs and created in the AttributeCreator.

The multi-line attributes can be handled using a rarely used feature of the AttributeCreator - Advanced Attribute Handling - Enable Adjacent Feature Attributes. This allows the AttributeCreator to work with several lines (features) at the same time (AttributeCreator_2)

Along the way regular expressions are used in StringReplacer to clean-up

Example workspace (2018.1): onecallticketparser V1.fmw

There's nothing wrong with the original workspace, but this approach just makes things a little less sensitive to the changes in the source data.

G

gisbradokla_t
Author
24 replies
6 years ago
January 24, 2019

markatsafe wrote:

The reason this is so tricky is that:

- some lines have only one attribute and value, i.e. OLD TICKET NUM-[18102215173653]

- some lines have several attributes/values, i..e MESSAGE TYPE--[UPDATE] LEAD TIME--[48] but note that the attribute name "MESSAGE TYPE" isn't quoted. The separator between more than one attribute value pair is two or more <space> characters, i.e. MESSAGE<sp>TYPE--[UPDATE]<sp><sp><sp>LEAD TIME--[48]

- some of the attribute / value pairs are multi-line, i.e. the LOCATION INFORMATION attribute above.

If we strip out the special cases (like the multi-line attribute values) and handle them separately, then all the 'simple' attributes can be reduced to name / value pairs and created in the AttributeCreator.

The multi-line attributes can be handled using a rarely used feature of the AttributeCreator - Advanced Attribute Handling - Enable Adjacent Feature Attributes. This allows the AttributeCreator to work with several lines (features) at the same time (AttributeCreator_2)

Along the way regular expressions are used in StringReplacer to clean-up

Example workspace (2018.1): onecallticketparser V1.fmw

There's nothing wrong with the original workspace, but this approach just makes things a little less sensitive to the changes in the source data.

Thank you Mark.

I uploaded my latest workspace. I had continued working on the original direction. without regex I just removed spaces and characters until I had it down to 1 attribute per line. Then I added LF and I just process every other line. Not pretty. but...

I am inspecting your workspace to see if 1. it helps me get a little better at regex. 2. I can carry it into a final output.

again thank you.

G

gisbradokla_t
Author
24 replies
6 years ago
January 24, 2019

markatsafe wrote:

The reason this is so tricky is that:

- some lines have only one attribute and value, i.e. OLD TICKET NUM-[18102215173653]

- some lines have several attributes/values, i..e MESSAGE TYPE--[UPDATE] LEAD TIME--[48] but note that the attribute name "MESSAGE TYPE" isn't quoted. The separator between more than one attribute value pair is two or more <space> characters, i.e. MESSAGE<sp>TYPE--[UPDATE]<sp><sp><sp>LEAD TIME--[48]

- some of the attribute / value pairs are multi-line, i.e. the LOCATION INFORMATION attribute above.

If we strip out the special cases (like the multi-line attribute values) and handle them separately, then all the 'simple' attributes can be reduced to name / value pairs and created in the AttributeCreator.

The multi-line attributes can be handled using a rarely used feature of the AttributeCreator - Advanced Attribute Handling - Enable Adjacent Feature Attributes. This allows the AttributeCreator to work with several lines (features) at the same time (AttributeCreator_2)

Along the way regular expressions are used in StringReplacer to clean-up

Example workspace (2018.1): onecallticketparser V1.fmw

There's nothing wrong with the original workspace, but this approach just makes things a little less sensitive to the changes in the source data.

Can you explain what you are doing in AttributeSplitter_4? I can't determine what that character is and why. I see it also in StringReplacer_5

M

+2

markatsafe
1891 replies
6 years ago
January 24, 2019

gisbradokla_t wrote:

Can you explain what you are doing in AttributeSplitter_4? I can't determine what that character is and why. I see it also in StringReplacer_5

@kidsmake6until2 In the StringReplacer_4 I've used a regular expression to replace two or more spaces "\\s{2,}" with the pipe (|) character - for those records that have two sets of attribute/value pairs on the same row, like:

MESSAGE<sp>TYPE--[UPDATE]<sp><sp><sp>LEAD TIME--[48]

"|" is usually a safe bet when creating a delimiter as they don't generally appear in text (you could use a TAB or some other character). You can't just use <sp> as a delimiter in the AttributeSplitter because some attributes have a single space in the name.

Similarly, in StringReplacer_5 I'm using regex "[-]+\\[|\\]" to remove the attribute value " wrapper" (i.e. --[UPDATE] ) Again you can't use "-]" in the AttributeSplitter because there can be any number of hyphens "-" as part of the wrapper. So replace the -[ pattern with a more consistent | and split on that.

It's a tough document to parse!

Regex parse ticket information from web scrape

4 replies

Reply

Helpful Members This Week

Recently Solved Questions

Read AEC Objects (Geometries and Attributes) in FME

Problems with points in Bufferer

WorkspaceReader - Find annotation linked to transformers

Linear Referencing Speed along line / Event CSV and Line Geometry

Reading and IFC-file, reproject it and write back to new IFC-file

Community Stats

Latest FME

Cookie policy

Cookie settings

Reply

Related Topics

Question of the Week: New User Learning Curvesicon

Transormer/workflow for Road Closuresicon

Web connection between FME and opentext content servericon

Extract information from Web Pageicon

Extract informations from a web pageicon

Helpful Members This Week

Recently Solved Questions

Read AEC Objects (Geometries and Attributes) in FME

Problems with points in Bufferer

WorkspaceReader - Find annotation linked to transformers

Linear Referencing Speed along line / Event CSV and Line Geometry

Reading and IFC-file, reproject it and write back to new IFC-file

Popular Tags

Community Stats

Latest FME

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings