Commit graph

42 commits

Author SHA1 Message Date
e8145c5616 adding new pdf extract capability 2012-07-10 15:24:13 -05:00
b77b80e485 We need to remove some of the yield statements because it's making iteration
very confusing to keep track of, due to global iterators being passed around
and iterated over in chunks.

I've added a located_heading_rows method which scans the entire document
for row numbers that look like record definition headings. I think we
can use these number spans to feed into the row columnizer stuff.
2012-06-30 15:21:05 -05:00
6b5eb30f34 added ColumnCollector, fixed column parsing by scanning for whitespace before separating 2012-06-26 15:55:18 -05:00
fecd14db59 adding pdfextract for column extraction 2012-06-19 15:37:17 -05:00
770aeb0d2b Ranges in descriptions are ignored, except in cases where the
range matches the next expected range. The only way to get around
this seems to be to manually remove the range value from the input.

One idea is to iterate through the entire token set and look for
range tokens. When a range token correctly continues the sequence, then
it is assumed to be a new record. Instead, if we scan the whole list of
tokens and look for out of order ranges and exclude them as possible
field identifiers.

1
10*
10
20
30
90*
40
10*
50
2012-06-06 14:46:17 -05:00
04b3c3f273 Added pyaccuwage-parse script.
We encountered a problem with the parser where a description contained
a range value and the parse thought it was the beginning of a new field
definition. We should be able to exclude the incorrect range values
by looking at our last good range, and if the range does not continue
the previous range, then it is probably incorrect and can be discarded.

These changes can probably be performed in the tokenize section of the
parser.
2012-06-02 15:16:13 -05:00
69da154e59 attempting to add a commandline script 2012-06-02 14:18:48 -05:00
ad5262e37e added length checking to field matching criteria for parser 2012-05-08 14:08:39 -05:00
2c9551f677 Fixed issue with last item not being insert into tokens. Now able to convert PDF text into record field definitions pretty reliably. Need to add additional field type detection rules. 2012-04-18 14:51:59 -05:00
027b44b65c Parser is mostly working, there's an issue with the last grouping of tokens
not being parsed. This can probably fixed by yielding an end-marker from the
tokenizer generator so the compiler knows to clear out the last item.
2012-04-13 14:39:02 -05:00
6e9b8041b9 adding a simple parser for reading stuff from pdfs 2012-04-05 15:19:00 -05:00
97a74c09f9 fixed some field types, misc 2011-11-12 15:26:17 -06:00
7772ec679f Renamed "verify" functions to "validate".
Another idea for defining the fields in records
would be to create a class method that would instantiate
the individual fields at instance creation rather than
during class definition. This would use less memory when
there are no Record objects being used.

Storing each Field after it's instantiated into a List, as
well as a Dict would remove the necessity for counting the
Field instantiation order, since the List would hold them in
their proper order.
2011-11-12 13:50:14 -06:00
ea492c2f56 renamed NumericField to IntegerField 2011-11-05 14:12:47 -05:00
a3f89e3790 fixed a couple field types being wrong, improved validation, auto-truncate over-length fields 2011-11-05 14:11:37 -05:00
076efd4036 0.0.6, fixed field types 2011-10-29 14:58:59 -05:00
7cb8bed61e Bumped version to 0.0.5
Fixed problem where fields contained shared values by
performing a shallow copy on all fields during Record instantiation.
That way, each record has its own copy of the field instances, rather
than the shared class-wide instance provided by the definition.
2011-10-29 14:03:03 -05:00
4023d46b4a Changed a few fields to be optional.
Found a fairly difficult bug involved with Field instances
being shared across Records. The issue is that Field instances
are static. I either need to implement a way to instantiate
copies of all the Fields per-record, or write a wrapping
interface which provides a unique value store on a per-Record
basis.
2011-10-25 14:54:22 -05:00
775d3d3700 bump to v0.0.3 2011-09-24 15:40:06 -05:00
3dfcf030e7 I can't type 2011-09-24 15:27:55 -05:00
c8965afab5 changing to version 0.0.2 2011-09-24 13:32:31 -05:00
93d7465e1a promoting to v0.2 2011-09-24 13:29:42 -05:00
1a0f4183e7 Everything works, or seems to. The package is now installable as
a regular python module through pip or whatever. Now our apps can
assemble data objects to be converted into accuwage files.
2011-09-17 11:22:04 -05:00
6f5d29faab moved everything into pyaccuwage subdir 2011-06-25 15:08:38 -05:00
5eb8925032 added __init__ to setup 2011-06-25 15:02:06 -05:00
78f8b845fe fixed set>setup 2011-06-25 14:59:18 -05:00
3d6a64db1d added test setup.py 2011-06-25 14:57:30 -05:00
ab16399e19 made enum names consistent 2011-06-25 14:33:28 -05:00
5f9211f30a fixed two silly syntax errors 2011-06-25 14:31:52 -05:00
0646bf7b9b Added record validation functions for everything (that we saw in the PDF),
we should go over them once more to make sure we didn't miss anything, but
testing validation should probably be done after that. Verify that the
record ordering enforcement code is correct, then start thinking of how
to get data from external sources into the record generator.
2011-06-11 14:45:12 -05:00
7dcbd6305b Added country code list to enums 2011-06-04 15:52:48 -05:00
a0014ca451 Added a MonthYear field, fixed some field required values and fixed
validation functions. Added numeric state abbreviation capability.
So far everything appears to be working good.
2011-06-04 15:46:41 -05:00
5781cbf335 Finished up most of the record order validation and also checking
for all required records in a set. Added a controller class but
decided to put stuff in __init__ instead, at least for now.
Added a DateField which converts datetime.date into the proper
string format for EFW2 files (hopefully), this should still be
tested next week.
2011-05-07 15:19:48 -05:00
f30237a90d added custom field-record validator support, not using it yet though 2011-04-23 14:57:18 -05:00
edb8e90340 fixed missing stuff in state wage record, yay 2011-04-09 15:22:44 -05:00
179f67bac9 added state wage record, but it isn't quite right 2011-04-09 15:17:32 -05:00
068f1bbae4 Added load/dump methods which work similarly to those found in
simplejson. Tests seem to work so far. Still need to figure out
how to get data into the records in some easy way.
2011-04-02 15:28:38 -05:00
a32feb79ed added a basic test thing, so far it only tests for record size, and
that shouldn't change after they're in place anyway so I'll likely
remove it once things are deemed fairly functional
2011-03-31 14:07:48 -05:00
bdcaaf1230 added more records, yay 2011-03-31 13:35:01 -05:00
3baf64e1ad added more record types 2011-03-30 21:40:20 -05:00
83e2a0cda9 added the first 3 record definitions 2011-03-30 21:17:48 -05:00
e12557db2d initial checkin 2011-03-26 14:56:00 -05:00