backwards to determine the record name. We also added the ability to "break" from
reading a series of field definitions based on certain break points such as
"Record Layout". There is currently an error in p1220 line 2704 which is caused
by the column data starting on the 4th column "Description and Remarks".
If ColumnCollectors started with the field titles, and had awareness of the column
positions starting with those, it may be possible to at least read the following
record fields without auto-adjusting them.
proper information prior to getting passed into the ColumnCollector.
It seems like some things are getting stripped out due to blank lines
or perhaps the annoying "Record Layout" pages. If we could extract the
"record layout" sections, things may be simpler"
are overlapping. I'm assuming this is due to missing a continue
or something inside the ColumnCollector. I added a couple new IsNextRecord
exceptions in response to blank rows, but this may be causing more problems
than expected. Next step is probably to check the records returned, and verify
that nothing is being duplicated. Some of the duplicates may be filtered out
by the RecordBuilder class, or during the fields filtering in the pyaccuwage-pdfparse
script (see: fields).
to track the beginning/ending points for each record and append
continuation records onto the previous. There's some issue in
the pyaccuwage-pdfparse script causing it to have problems reading
the last record field in a record group. Maybe the record extractor
needs to dump the last failed ColumnCollector rather than return it
if it's determined to hold junk data?
The record builder seems to handle everything just fine.
Added a function to the field name parsing to replace ampersands
with an "and" string so as not to cause problems with variable names.
an issue where full-page width blocks are being interpreted as a
single large column, and then subsequent field definition columns
are being truncated in as subcolumns.
The current problematic line in p1220 is 1598.
Maybe add some functionality which lets us specify the number of
columns we're most interested in? Automatically discard 1-column
ColumnCollectors maybe?
very confusing to keep track of, due to global iterators being passed around
and iterated over in chunks.
I've added a located_heading_rows method which scans the entire document
for row numbers that look like record definition headings. I think we
can use these number spans to feed into the row columnizer stuff.
range matches the next expected range. The only way to get around
this seems to be to manually remove the range value from the input.
One idea is to iterate through the entire token set and look for
range tokens. When a range token correctly continues the sequence, then
it is assumed to be a new record. Instead, if we scan the whole list of
tokens and look for out of order ranges and exclude them as possible
field identifiers.
1
10*
10
20
30
90*
40
10*
50
We encountered a problem with the parser where a description contained
a range value and the parse thought it was the beginning of a new field
definition. We should be able to exclude the incorrect range values
by looking at our last good range, and if the range does not continue
the previous range, then it is probably incorrect and can be discarded.
These changes can probably be performed in the tokenize section of the
parser.
Another idea for defining the fields in records
would be to create a class method that would instantiate
the individual fields at instance creation rather than
during class definition. This would use less memory when
there are no Record objects being used.
Storing each Field after it's instantiated into a List, as
well as a Dict would remove the necessity for counting the
Field instantiation order, since the List would hold them in
their proper order.
Fixed problem where fields contained shared values by
performing a shallow copy on all fields during Record instantiation.
That way, each record has its own copy of the field instances, rather
than the shared class-wide instance provided by the definition.
Found a fairly difficult bug involved with Field instances
being shared across Records. The issue is that Field instances
are static. I either need to implement a way to instantiate
copies of all the Fields per-record, or write a wrapping
interface which provides a unique value store on a per-Record
basis.
we should go over them once more to make sure we didn't miss anything, but
testing validation should probably be done after that. Verify that the
record ordering enforcement code is correct, then start thinking of how
to get data from external sources into the record generator.
for all required records in a set. Added a controller class but
decided to put stuff in __init__ instead, at least for now.
Added a DateField which converts datetime.date into the proper
string format for EFW2 files (hopefully), this should still be
tested next week.