This content has been marked as final. Show 9 replies
When you say "contains a table" what exactly do you mean?
Do you mean that on one page is an area of text, arranged in columns
with lines drawn, which you choose to see as a table? Or something
By "contains a table" I meant to say that .....
The text in the PDF document is display in the form of a table having seven columns and 10 rows.
Note: the PDF doc contains other text too. But i am concern only with the table's text.
Do you have the exact co-ordinates of the table in advance? (E.g. the
left, right, top and bottom X,Y co-ordinates?))
No ... I do not have any details of the table...
What is my Aim:
I have a PDF document and I will get a row unique Id (say ), given by the user...
Using this .. I am suppose to use the Row Id and retrieve the data for that relevant row. Then validate the retrived data.
So internally i have to write a logic using VBS/JS to read the PDF document data and identify the table. After that access the data from the data and validate.
The big challenge at the moment seems to be not finding information
inside the table, but finding the table. If you don't know where it
is, how will the program identify what text belongs to the table?
Very True.... that is the major challenge..and m looking for a solution...
Maybe I can extract the data into an Excel file and then read the text...
Is that possible?
thanks Aandi Inston
When the document is tagged, you can save it as XML and parse the XML file.
You can programmatically get all of the text as words. And you can get
the location of each word. And that is all you have.
If you can figure out an algorithm to find where the table is from
that information (think of a list of words and x,y co-ordinates) you
are set. Otherwise, you seem stuck.
If you add a constant piece of text before and after the table ("TableBegin", "TableEnd"), it would be much easier to find.