pyaccuwage

Author	SHA1	Message	Date
Binh Nguyen	1c7533973a	Parsing all the way through the pdf appears to work. Next we need to track the beginning/ending points for each record and append continuation records onto the previous. There's some issue in the pyaccuwage-pdfparse script causing it to have problems reading the last record field in a record group. Maybe the record extractor needs to dump the last failed ColumnCollector rather than return it if it's determined to hold junk data? The record builder seems to handle everything just fine. Added a function to the field name parsing to replace ampersands with an "and" string so as not to cause problems with variable names.	2012-11-13 15:53:41 -06:00
Binh Nguyen	fe4bd20bad	Record detection seems to be working much better. We currently have an issue where full-page width blocks are being interpreted as a single large column, and then subsequent field definition columns are being truncated in as subcolumns. The current problematic line in p1220 is 1598. Maybe add some functionality which lets us specify the number of columns we're most interested in? Automatically discard 1-column ColumnCollectors maybe?	2012-11-06 15:34:35 -06:00
Binh Nguyen	46755dd90d	updated VERSION	2012-10-16 13:22:44 -05:00
Binh Nguyen	820f71b3f5	Merge branch 'master' of brimstone.klowner.com:pyaccuwage	2012-10-09 15:36:11 -05:00
Binh Nguyen	6abfa5b345	fixed missing field, updated for 2012	2012-10-09 15:35:13 -05:00
Binh Nguyen	30376a54f3	fixed missing field, updated for 2012	2012-10-09 15:31:35 -05:00
Binh Nguyen	717f929015	updated records to match 2012 definitions	2012-09-25 15:45:00 -05:00
Binh Nguyen	40fcbdc8b8	getting closer, added a FIXME to one of the fields. Having issues with columns in description fields	2012-07-17 15:44:28 -05:00
Binh Nguyen	5dde3be536	forgot to convert tuple to list for the missing description field fix, derrrp	2012-07-17 14:16:28 -05:00
Binh Nguyen	0dc55ab3dd	fixed reading fields that don't have descriptions	2012-07-17 14:10:34 -05:00
Binh Nguyen	b3aed20388	fixed rangetoken issue with single byte values	2012-07-10 15:41:47 -05:00
Binh Nguyen	e8145c5616	adding new pdf extract capability	2012-07-10 15:24:13 -05:00
Binh Nguyen	b77b80e485	We need to remove some of the yield statements because it's making iteration very confusing to keep track of, due to global iterators being passed around and iterated over in chunks. I've added a located_heading_rows method which scans the entire document for row numbers that look like record definition headings. I think we can use these number spans to feed into the row columnizer stuff.	2012-06-30 15:21:05 -05:00
Binh Nguyen	6b5eb30f34	added ColumnCollector, fixed column parsing by scanning for whitespace before separating	2012-06-26 15:55:18 -05:00
Binh Nguyen	fecd14db59	adding pdfextract for column extraction	2012-06-19 15:37:17 -05:00
Binh Nguyen	770aeb0d2b	Ranges in descriptions are ignored, except in cases where the range matches the next expected range. The only way to get around this seems to be to manually remove the range value from the input. One idea is to iterate through the entire token set and look for range tokens. When a range token correctly continues the sequence, then it is assumed to be a new record. Instead, if we scan the whole list of tokens and look for out of order ranges and exclude them as possible field identifiers. 1 10* 10 20 30 90* 40 10* 50	2012-06-06 14:46:17 -05:00
Binh Nguyen	04b3c3f273	Added pyaccuwage-parse script. We encountered a problem with the parser where a description contained a range value and the parse thought it was the beginning of a new field definition. We should be able to exclude the incorrect range values by looking at our last good range, and if the range does not continue the previous range, then it is probably incorrect and can be discarded. These changes can probably be performed in the tokenize section of the parser.	2012-06-02 15:16:13 -05:00
Binh Nguyen	69da154e59	attempting to add a commandline script	2012-06-02 14:18:48 -05:00
Binh Nguyen	ad5262e37e	added length checking to field matching criteria for parser	2012-05-08 14:08:39 -05:00
Binh Nguyen	2c9551f677	Fixed issue with last item not being insert into tokens. Now able to convert PDF text into record field definitions pretty reliably. Need to add additional field type detection rules.	2012-04-18 14:51:59 -05:00
Binh Nguyen	027b44b65c	Parser is mostly working, there's an issue with the last grouping of tokens not being parsed. This can probably fixed by yielding an end-marker from the tokenizer generator so the compiler knows to clear out the last item.	2012-04-13 14:39:02 -05:00
Binh Nguyen	6e9b8041b9	adding a simple parser for reading stuff from pdfs	2012-04-05 15:19:00 -05:00
Binh Nguyen	97a74c09f9	fixed some field types, misc	2011-11-12 15:26:17 -06:00
Binh Nguyen	7772ec679f	Renamed "verify" functions to "validate". Another idea for defining the fields in records would be to create a class method that would instantiate the individual fields at instance creation rather than during class definition. This would use less memory when there are no Record objects being used. Storing each Field after it's instantiated into a List, as well as a Dict would remove the necessity for counting the Field instantiation order, since the List would hold them in their proper order.	2011-11-12 13:50:14 -06:00
Binh Nguyen	ea492c2f56	renamed NumericField to IntegerField	2011-11-05 14:12:47 -05:00
Binh Nguyen	a3f89e3790	fixed a couple field types being wrong, improved validation, auto-truncate over-length fields	2011-11-05 14:11:37 -05:00
Binh Van Nguyen	076efd4036	0.0.6, fixed field types	2011-10-29 14:58:59 -05:00
Binh Van Nguyen	7cb8bed61e	Bumped version to 0.0.5 Fixed problem where fields contained shared values by performing a shallow copy on all fields during Record instantiation. That way, each record has its own copy of the field instances, rather than the shared class-wide instance provided by the definition.	2011-10-29 14:03:03 -05:00
Binh Van Nguyen	4023d46b4a	Changed a few fields to be optional. Found a fairly difficult bug involved with Field instances being shared across Records. The issue is that Field instances are static. I either need to implement a way to instantiate copies of all the Fields per-record, or write a wrapping interface which provides a unique value store on a per-Record basis.	2011-10-25 14:54:22 -05:00
Binh Van Nguyen	775d3d3700	bump to v0.0.3	2011-09-24 15:40:06 -05:00
Binh Van Nguyen	3dfcf030e7	I can't type	2011-09-24 15:27:55 -05:00
Binh Van Nguyen	c8965afab5	changing to version 0.0.2	2011-09-24 13:32:31 -05:00
Binh Van Nguyen	93d7465e1a	promoting to v0.2	2011-09-24 13:29:42 -05:00
Binh Van Nguyen	1a0f4183e7	Everything works, or seems to. The package is now installable as a regular python module through pip or whatever. Now our apps can assemble data objects to be converted into accuwage files.	2011-09-17 11:22:04 -05:00
Binh Van Nguyen	6f5d29faab	moved everything into pyaccuwage subdir	2011-06-25 15:08:38 -05:00
Binh Van Nguyen	5eb8925032	added __init__ to setup	2011-06-25 15:02:06 -05:00
Binh Van Nguyen	78f8b845fe	fixed set>setup	2011-06-25 14:59:18 -05:00
Binh Van Nguyen	3d6a64db1d	added test setup.py	2011-06-25 14:57:30 -05:00
Binh Van Nguyen	ab16399e19	made enum names consistent	2011-06-25 14:33:28 -05:00
Binh Van Nguyen	5f9211f30a	fixed two silly syntax errors	2011-06-25 14:31:52 -05:00
Binh Van Nguyen	0646bf7b9b	Added record validation functions for everything (that we saw in the PDF), we should go over them once more to make sure we didn't miss anything, but testing validation should probably be done after that. Verify that the record ordering enforcement code is correct, then start thinking of how to get data from external sources into the record generator.	2011-06-11 14:45:12 -05:00
Binh Van Nguyen	7dcbd6305b	Added country code list to enums	2011-06-04 15:52:48 -05:00
Binh Van Nguyen	a0014ca451	Added a MonthYear field, fixed some field required values and fixed validation functions. Added numeric state abbreviation capability. So far everything appears to be working good.	2011-06-04 15:46:41 -05:00
Binh Van Nguyen	5781cbf335	Finished up most of the record order validation and also checking for all required records in a set. Added a controller class but decided to put stuff in __init__ instead, at least for now. Added a DateField which converts datetime.date into the proper string format for EFW2 files (hopefully), this should still be tested next week.	2011-05-07 15:19:48 -05:00
Binh Van Nguyen	f30237a90d	added custom field-record validator support, not using it yet though	2011-04-23 14:57:18 -05:00
Binh Van Nguyen	edb8e90340	fixed missing stuff in state wage record, yay	2011-04-09 15:22:44 -05:00
Binh Van Nguyen	179f67bac9	added state wage record, but it isn't quite right	2011-04-09 15:17:32 -05:00
Binh Van Nguyen	068f1bbae4	Added load/dump methods which work similarly to those found in simplejson. Tests seem to work so far. Still need to figure out how to get data into the records in some easy way.	2011-04-02 15:28:38 -05:00
Mark Riedesel	a32feb79ed	added a basic test thing, so far it only tests for record size, and that shouldn't change after they're in place anyway so I'll likely remove it once things are deemed fairly functional	2011-03-31 14:07:48 -05:00
Mark Riedesel	bdcaaf1230	added more records, yay	2011-03-31 13:35:01 -05:00

1 2

53 commits