the sequence comments are returned as string tuples. Next step
is to take these results, convert them to integers, and make sure
they occur in the expected linear order.
There's an issue parsing p1220 on line 2570. Maybe making the parser ignore
full-width lines during parsing would fix the problem, if there's some
way to check the length of a row, only counting single-spaced words?
backwards to determine the record name. We also added the ability to "break" from
reading a series of field definitions based on certain break points such as
"Record Layout". There is currently an error in p1220 line 2704 which is caused
by the column data starting on the 4th column "Description and Remarks".
If ColumnCollectors started with the field titles, and had awareness of the column
positions starting with those, it may be possible to at least read the following
record fields without auto-adjusting them.
proper information prior to getting passed into the ColumnCollector.
It seems like some things are getting stripped out due to blank lines
or perhaps the annoying "Record Layout" pages. If we could extract the
"record layout" sections, things may be simpler"
are overlapping. I'm assuming this is due to missing a continue
or something inside the ColumnCollector. I added a couple new IsNextRecord
exceptions in response to blank rows, but this may be causing more problems
than expected. Next step is probably to check the records returned, and verify
that nothing is being duplicated. Some of the duplicates may be filtered out
by the RecordBuilder class, or during the fields filtering in the pyaccuwage-pdfparse
script (see: fields).
to track the beginning/ending points for each record and append
continuation records onto the previous. There's some issue in
the pyaccuwage-pdfparse script causing it to have problems reading
the last record field in a record group. Maybe the record extractor
needs to dump the last failed ColumnCollector rather than return it
if it's determined to hold junk data?
The record builder seems to handle everything just fine.
Added a function to the field name parsing to replace ampersands
with an "and" string so as not to cause problems with variable names.
an issue where full-page width blocks are being interpreted as a
single large column, and then subsequent field definition columns
are being truncated in as subcolumns.
The current problematic line in p1220 is 1598.
Maybe add some functionality which lets us specify the number of
columns we're most interested in? Automatically discard 1-column
ColumnCollectors maybe?
very confusing to keep track of, due to global iterators being passed around
and iterated over in chunks.
I've added a located_heading_rows method which scans the entire document
for row numbers that look like record definition headings. I think we
can use these number spans to feed into the row columnizer stuff.
range matches the next expected range. The only way to get around
this seems to be to manually remove the range value from the input.
One idea is to iterate through the entire token set and look for
range tokens. When a range token correctly continues the sequence, then
it is assumed to be a new record. Instead, if we scan the whole list of
tokens and look for out of order ranges and exclude them as possible
field identifiers.
1
10*
10
20
30
90*
40
10*
50