Commit graph

108 commits

Author SHA1 Message Date
6e4a975cfb Changed the way records are found by searching for field headers and then working
backwards to determine the record name. We also added the ability to "break" from
reading a series of field definitions based on certain break points such as
"Record Layout". There is currently an error in p1220 line 2704 which is caused
by the column data starting on the 4th column "Description and Remarks".

If ColumnCollectors started with the field titles, and had awareness of the column
positions starting with those, it may be possible to at least read the following
record fields without auto-adjusting them.
2012-12-04 16:04:08 -06:00
8995f142e5 Merge branch 'master' of brimstone.klowner.com:pyaccuwage
Conflicts:
	pyaccuwage/pdfextract.py
2012-12-04 14:57:20 -06:00
6e1d02db8d trying new header location method 2012-12-04 14:54:10 -06:00
e9a6dc981f Refer to previous log, but also verify that records are returning
proper information prior to getting passed into the ColumnCollector.
It seems like some things are getting stripped out due to blank lines
or perhaps the annoying "Record Layout" pages. If we could extract the
"record layout" sections, things may be simpler"
2012-11-27 16:01:00 -06:00
31ff97db8a Almost have things working. It seems like some of the record results
are overlapping. I'm assuming this is due to missing a continue
or something inside the ColumnCollector. I added a couple new IsNextRecord
exceptions in response to blank rows, but this may be causing more problems
than expected. Next step is probably to check the records returned, and verify
that nothing is being duplicated. Some of the duplicates may be filtered out
by the RecordBuilder class, or during the fields filtering in the pyaccuwage-pdfparse
script (see: fields).
2012-11-20 16:05:36 -06:00
1c7533973a Parsing all the way through the pdf appears to work. Next we need
to track the beginning/ending points for each record and append
continuation records onto the previous. There's some issue in
the pyaccuwage-pdfparse script causing it to have problems reading
the last record field in a record group. Maybe the record extractor
needs to dump the last failed ColumnCollector rather than return it
if it's determined to hold junk data?

The record builder seems to handle everything just fine.

Added a function to the field name parsing to replace ampersands
with an "and" string so as not to cause problems with variable names.
2012-11-13 15:53:41 -06:00
fe4bd20bad Record detection seems to be working much better. We currently have
an issue where full-page width blocks are being interpreted as a
single large column, and then subsequent field definition columns
are being truncated in as subcolumns.

The current problematic line in p1220 is 1598.

Maybe add some functionality which lets us specify the number of
columns we're most interested in? Automatically discard 1-column
ColumnCollectors maybe?
2012-11-06 15:34:35 -06:00
46755dd90d updated VERSION 2012-10-16 13:22:44 -05:00
820f71b3f5 Merge branch 'master' of brimstone.klowner.com:pyaccuwage 2012-10-09 15:36:11 -05:00
6abfa5b345 fixed missing field, updated for 2012 2012-10-09 15:35:13 -05:00
30376a54f3 fixed missing field, updated for 2012 2012-10-09 15:31:35 -05:00
717f929015 updated records to match 2012 definitions 2012-09-25 15:45:00 -05:00
40fcbdc8b8 getting closer, added a FIXME to one of the fields. Having issues with columns in description fields 2012-07-17 15:44:28 -05:00
5dde3be536 forgot to convert tuple to list for the missing description field fix, derrrp 2012-07-17 14:16:28 -05:00
0dc55ab3dd fixed reading fields that don't have descriptions 2012-07-17 14:10:34 -05:00
b3aed20388 fixed rangetoken issue with single byte values 2012-07-10 15:41:47 -05:00
e8145c5616 adding new pdf extract capability 2012-07-10 15:24:13 -05:00
b77b80e485 We need to remove some of the yield statements because it's making iteration
very confusing to keep track of, due to global iterators being passed around
and iterated over in chunks.

I've added a located_heading_rows method which scans the entire document
for row numbers that look like record definition headings. I think we
can use these number spans to feed into the row columnizer stuff.
2012-06-30 15:21:05 -05:00
6b5eb30f34 added ColumnCollector, fixed column parsing by scanning for whitespace before separating 2012-06-26 15:55:18 -05:00
fecd14db59 adding pdfextract for column extraction 2012-06-19 15:37:17 -05:00
770aeb0d2b Ranges in descriptions are ignored, except in cases where the
range matches the next expected range. The only way to get around
this seems to be to manually remove the range value from the input.

One idea is to iterate through the entire token set and look for
range tokens. When a range token correctly continues the sequence, then
it is assumed to be a new record. Instead, if we scan the whole list of
tokens and look for out of order ranges and exclude them as possible
field identifiers.

1
10*
10
20
30
90*
40
10*
50
2012-06-06 14:46:17 -05:00
04b3c3f273 Added pyaccuwage-parse script.
We encountered a problem with the parser where a description contained
a range value and the parse thought it was the beginning of a new field
definition. We should be able to exclude the incorrect range values
by looking at our last good range, and if the range does not continue
the previous range, then it is probably incorrect and can be discarded.

These changes can probably be performed in the tokenize section of the
parser.
2012-06-02 15:16:13 -05:00
69da154e59 attempting to add a commandline script 2012-06-02 14:18:48 -05:00
ad5262e37e added length checking to field matching criteria for parser 2012-05-08 14:08:39 -05:00
2c9551f677 Fixed issue with last item not being insert into tokens. Now able to convert PDF text into record field definitions pretty reliably. Need to add additional field type detection rules. 2012-04-18 14:51:59 -05:00
027b44b65c Parser is mostly working, there's an issue with the last grouping of tokens
not being parsed. This can probably fixed by yielding an end-marker from the
tokenizer generator so the compiler knows to clear out the last item.
2012-04-13 14:39:02 -05:00
6e9b8041b9 adding a simple parser for reading stuff from pdfs 2012-04-05 15:19:00 -05:00
97a74c09f9 fixed some field types, misc 2011-11-12 15:26:17 -06:00
7772ec679f Renamed "verify" functions to "validate".
Another idea for defining the fields in records
would be to create a class method that would instantiate
the individual fields at instance creation rather than
during class definition. This would use less memory when
there are no Record objects being used.

Storing each Field after it's instantiated into a List, as
well as a Dict would remove the necessity for counting the
Field instantiation order, since the List would hold them in
their proper order.
2011-11-12 13:50:14 -06:00
ea492c2f56 renamed NumericField to IntegerField 2011-11-05 14:12:47 -05:00
a3f89e3790 fixed a couple field types being wrong, improved validation, auto-truncate over-length fields 2011-11-05 14:11:37 -05:00
076efd4036 0.0.6, fixed field types 2011-10-29 14:58:59 -05:00
7cb8bed61e Bumped version to 0.0.5
Fixed problem where fields contained shared values by
performing a shallow copy on all fields during Record instantiation.
That way, each record has its own copy of the field instances, rather
than the shared class-wide instance provided by the definition.
2011-10-29 14:03:03 -05:00
4023d46b4a Changed a few fields to be optional.
Found a fairly difficult bug involved with Field instances
being shared across Records. The issue is that Field instances
are static. I either need to implement a way to instantiate
copies of all the Fields per-record, or write a wrapping
interface which provides a unique value store on a per-Record
basis.
2011-10-25 14:54:22 -05:00
775d3d3700 bump to v0.0.3 2011-09-24 15:40:06 -05:00
3dfcf030e7 I can't type 2011-09-24 15:27:55 -05:00
c8965afab5 changing to version 0.0.2 2011-09-24 13:32:31 -05:00
93d7465e1a promoting to v0.2 2011-09-24 13:29:42 -05:00
1a0f4183e7 Everything works, or seems to. The package is now installable as
a regular python module through pip or whatever. Now our apps can
assemble data objects to be converted into accuwage files.
2011-09-17 11:22:04 -05:00
6f5d29faab moved everything into pyaccuwage subdir 2011-06-25 15:08:38 -05:00
5eb8925032 added __init__ to setup 2011-06-25 15:02:06 -05:00
78f8b845fe fixed set>setup 2011-06-25 14:59:18 -05:00
3d6a64db1d added test setup.py 2011-06-25 14:57:30 -05:00
ab16399e19 made enum names consistent 2011-06-25 14:33:28 -05:00
5f9211f30a fixed two silly syntax errors 2011-06-25 14:31:52 -05:00
0646bf7b9b Added record validation functions for everything (that we saw in the PDF),
we should go over them once more to make sure we didn't miss anything, but
testing validation should probably be done after that. Verify that the
record ordering enforcement code is correct, then start thinking of how
to get data from external sources into the record generator.
2011-06-11 14:45:12 -05:00
7dcbd6305b Added country code list to enums 2011-06-04 15:52:48 -05:00
a0014ca451 Added a MonthYear field, fixed some field required values and fixed
validation functions. Added numeric state abbreviation capability.
So far everything appears to be working good.
2011-06-04 15:46:41 -05:00
5781cbf335 Finished up most of the record order validation and also checking
for all required records in a set. Added a controller class but
decided to put stuff in __init__ instead, at least for now.
Added a DateField which converts datetime.date into the proper
string format for EFW2 files (hopefully), this should still be
tested next week.
2011-05-07 15:19:48 -05:00
f30237a90d added custom field-record validator support, not using it yet though 2011-04-23 14:57:18 -05:00