Skip to main content
Question

Regex parse ticket information from web scrape


Forum|alt.badge.img

OK S00224 11-18_1-19.csvI have a list that I need to parse. I am not very well versed in Regex. I can manually build my attribute names (and in fact will have to change some of these as they are. I need to know how to

1. handle the different number of dashes between the attribute name and the attr value.

2. retrieve multiple attributes from one line.

3. remove the brackets and write the value to the attribute with varying length strings

4. handle LF

5. manage dashes inside of brackets (keep)

Here is some of the data structure of the list...

RKORT 1712 OKOCS 11/01/18 11:33:07 18110111301739 UPDATE

TICKET NUMBER--[18110111301739]

OLD TICKET NUM-[18102215173653]

MESSAGE TYPE--[UPDATE] LEAD TIME--[48]

PREPARED------[11/01/18] AT [11:30] BY [IMTHEBESTMAN@TRUCKERCONST.COM]

CONTRACTOR--[MTRUCKER CONSTRUCTION] CALLER--[JARVIS BESTMAN]

ADDRESS-----[707 S CREEK COUNTRY ROAD]

COUNTY-[PINE] PLACE--[CUTHING]ADDRESS-----[] STREET--[E][DEEROCK][RD][]

NEARBY MAJOR INTERSECTION-[N LITTLE AVE AND E DEEROCK RD]

LATITUDE--[43.014899] LONGITUDE--[-166.761132]

SECONDARY LATITUDE--[43.017155] SECONDARY LONGITUDE--[-166.758085]ADDITIONAL ADDRESSES IN LOCATION--[N]

LOCATION INFORMATION--[7:34:18 - PIPELINE - FROM THE INT OF N LITTLE AVE AND E DEEROCK RD, EAST]

[ON E DEEROCK RD 0.41 MI, NORTH 0.10 MI ONTO UNMARKED ROAD -- LOCATE 120 FT]

[EAST, 211 FT SOUTH, 126 FT WEST, 200 FT NORTH AND EVERYTHING WITHIN -- MAY]

[BE A GATED ENTRANCE]

4 replies

Forum|alt.badge.img+2
  • January 23, 2019

The reason this is so tricky is that:

- some lines have only one attribute and value, i.e. OLD TICKET NUM-[18102215173653]

- some lines have several attributes/values, i..e MESSAGE TYPE--[UPDATE] LEAD TIME--[48] but note that the attribute name "MESSAGE TYPE" isn't quoted. The separator between more than one attribute value pair is two or more <space> characters, i.e. MESSAGE<sp>TYPE--[UPDATE]<sp><sp><sp>LEAD TIME--[48]

- some of the attribute / value pairs are multi-line, i.e. the LOCATION INFORMATION attribute above.

If we strip out the special cases (like the multi-line attribute values) and handle them separately, then all the 'simple' attributes can be reduced to name / value pairs and created in the AttributeCreator.

The multi-line attributes can be handled using a rarely used feature of the AttributeCreator - Advanced Attribute Handling - Enable Adjacent Feature Attributes. This allows the AttributeCreator to work with several lines (features) at the same time (AttributeCreator_2)

Along the way regular expressions are used in StringReplacer to clean-up

Example workspace (2018.1): onecallticketparser V1.fmw

There's nothing wrong with the original workspace, but this approach just makes things a little less sensitive to the changes in the source data.


Forum|alt.badge.img
markatsafe wrote:

The reason this is so tricky is that:

- some lines have only one attribute and value, i.e. OLD TICKET NUM-[18102215173653]

- some lines have several attributes/values, i..e MESSAGE TYPE--[UPDATE] LEAD TIME--[48] but note that the attribute name "MESSAGE TYPE" isn't quoted. The separator between more than one attribute value pair is two or more <space> characters, i.e. MESSAGE<sp>TYPE--[UPDATE]<sp><sp><sp>LEAD TIME--[48]

- some of the attribute / value pairs are multi-line, i.e. the LOCATION INFORMATION attribute above.

If we strip out the special cases (like the multi-line attribute values) and handle them separately, then all the 'simple' attributes can be reduced to name / value pairs and created in the AttributeCreator.

The multi-line attributes can be handled using a rarely used feature of the AttributeCreator - Advanced Attribute Handling - Enable Adjacent Feature Attributes. This allows the AttributeCreator to work with several lines (features) at the same time (AttributeCreator_2)

Along the way regular expressions are used in StringReplacer to clean-up

Example workspace (2018.1): onecallticketparser V1.fmw

There's nothing wrong with the original workspace, but this approach just makes things a little less sensitive to the changes in the source data.

Thank you Mark.

I uploaded my latest workspace. I had continued working on the original direction. without regex I just removed spaces and characters until I had it down to 1 attribute per line. Then I added LF and I just process every other line. Not pretty. but...

I am inspecting your workspace to see if 1. it helps me get a little better at regex. 2. I can carry it into a final output.

again thank you.


Forum|alt.badge.img
markatsafe wrote:

The reason this is so tricky is that:

- some lines have only one attribute and value, i.e. OLD TICKET NUM-[18102215173653]

- some lines have several attributes/values, i..e MESSAGE TYPE--[UPDATE] LEAD TIME--[48] but note that the attribute name "MESSAGE TYPE" isn't quoted. The separator between more than one attribute value pair is two or more <space> characters, i.e. MESSAGE<sp>TYPE--[UPDATE]<sp><sp><sp>LEAD TIME--[48]

- some of the attribute / value pairs are multi-line, i.e. the LOCATION INFORMATION attribute above.

If we strip out the special cases (like the multi-line attribute values) and handle them separately, then all the 'simple' attributes can be reduced to name / value pairs and created in the AttributeCreator.

The multi-line attributes can be handled using a rarely used feature of the AttributeCreator - Advanced Attribute Handling - Enable Adjacent Feature Attributes. This allows the AttributeCreator to work with several lines (features) at the same time (AttributeCreator_2)

Along the way regular expressions are used in StringReplacer to clean-up

Example workspace (2018.1): onecallticketparser V1.fmw

There's nothing wrong with the original workspace, but this approach just makes things a little less sensitive to the changes in the source data.

Can you explain what you are doing in AttributeSplitter_4? I can't determine what that character is and why. I see it also in StringReplacer_5


Forum|alt.badge.img+2
  • January 24, 2019
gisbradokla_t wrote:

Can you explain what you are doing in AttributeSplitter_4? I can't determine what that character is and why. I see it also in StringReplacer_5

@kidsmake6until2 In the StringReplacer_4 I've used a regular expression to replace two or more spaces "\\s{2,}" with the pipe (|) character - for those records that have two sets of attribute/value pairs on the same row, like:

MESSAGE<sp>TYPE--[UPDATE]<sp><sp><sp>LEAD TIME--[48]

"|" is usually a safe bet when creating a delimiter as they don't generally appear in text (you could use a TAB or some other character). You can't just use <sp> as a delimiter in the AttributeSplitter because some attributes have a single space in the name.

Similarly, in StringReplacer_5 I'm using regex "[-]+\\[|\\]" to remove the attribute value " wrapper" (i.e. --[UPDATE] ) Again you can't use "-]" in the AttributeSplitter because there can be any number of hyphens "-" as part of the wrapper. So replace the -[ pattern with a more consistent | and split on that.

It's a tough document to parse!


Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings