to track the beginning/ending points for each record and append
continuation records onto the previous. There's some issue in
the pyaccuwage-pdfparse script causing it to have problems reading
the last record field in a record group. Maybe the record extractor
needs to dump the last failed ColumnCollector rather than return it
if it's determined to hold junk data?
The record builder seems to handle everything just fine.
Added a function to the field name parsing to replace ampersands
with an "and" string so as not to cause problems with variable names.
an issue where full-page width blocks are being interpreted as a
single large column, and then subsequent field definition columns
are being truncated in as subcolumns.
The current problematic line in p1220 is 1598.
Maybe add some functionality which lets us specify the number of
columns we're most interested in? Automatically discard 1-column
ColumnCollectors maybe?
very confusing to keep track of, due to global iterators being passed around
and iterated over in chunks.
I've added a located_heading_rows method which scans the entire document
for row numbers that look like record definition headings. I think we
can use these number spans to feed into the row columnizer stuff.
range matches the next expected range. The only way to get around
this seems to be to manually remove the range value from the input.
One idea is to iterate through the entire token set and look for
range tokens. When a range token correctly continues the sequence, then
it is assumed to be a new record. Instead, if we scan the whole list of
tokens and look for out of order ranges and exclude them as possible
field identifiers.
1
10*
10
20
30
90*
40
10*
50
We encountered a problem with the parser where a description contained
a range value and the parse thought it was the beginning of a new field
definition. We should be able to exclude the incorrect range values
by looking at our last good range, and if the range does not continue
the previous range, then it is probably incorrect and can be discarded.
These changes can probably be performed in the tokenize section of the
parser.
Another idea for defining the fields in records
would be to create a class method that would instantiate
the individual fields at instance creation rather than
during class definition. This would use less memory when
there are no Record objects being used.
Storing each Field after it's instantiated into a List, as
well as a Dict would remove the necessity for counting the
Field instantiation order, since the List would hold them in
their proper order.
Fixed problem where fields contained shared values by
performing a shallow copy on all fields during Record instantiation.
That way, each record has its own copy of the field instances, rather
than the shared class-wide instance provided by the definition.
Found a fairly difficult bug involved with Field instances
being shared across Records. The issue is that Field instances
are static. I either need to implement a way to instantiate
copies of all the Fields per-record, or write a wrapping
interface which provides a unique value store on a per-Record
basis.
we should go over them once more to make sure we didn't miss anything, but
testing validation should probably be done after that. Verify that the
record ordering enforcement code is correct, then start thinking of how
to get data from external sources into the record generator.
for all required records in a set. Added a controller class but
decided to put stuff in __init__ instead, at least for now.
Added a DateField which converts datetime.date into the proper
string format for EFW2 files (hopefully), this should still be
tested next week.