Hi,
We are Solution developer using Acrobat,as we have reuirement of extracting text from pdf using C# we have downloaded adobe sdk and installed. We have found only four exmaples in C# and those are used only for viewing pdf in windows application. Can you please guide us how to extract text from pdf using SDK in C#.
Thanks you for your help.
Regards
kiranmai
Try page 135 of this document
More than likely, you'll need to write a plugin to handle this because IAC doesn't seem to support it through COM
If you are creative in using the JS interface, you can extract all the words from a document. You would need to use a loop and put everything into an array or a List.
Take a look at page 311 of
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/js_api_r eference.pdf
Okay so I went ahead and actually added the text extraction functionality to my own C# application, since this was a requested feature by the client anyhow, which originally we were told to bypass if it wasn't "cut and dry", but it wasn't bad so I went ahead and gave the client the text extraction that they wanted. Decided I'd post the source code here for you. This returns the text from the entire document as a string.
private static string GetText(AcroPDDoc pdDoc)
{
AcroPDPage page;
int pages = pdDoc.GetNumPages();
string pageText = "";
for (int i = 0; i < pages; i++)
{
page = (AcroPDPage)pdDoc.AcquirePage(i);
object jso, jsNumWords, jsWord;
List<string> words = new List<string>();
try
{
jso = pdDoc.GetJSObject();
if (jso != null)
{
object[] args = new object[] { i };
jsNumWords = jso.GetType().InvokeMember("getPageNumWords", BindingFlags.InvokeMethod, null, jso, args, null);
int numWords = Int32.Parse(jsNumWords.ToString());
for (int j = 0; j <= numWords; j++)
{
object[] argsj = new object[] { i, j, false };
jsWord = jso.GetType().InvokeMember("getPageNthWord", BindingFlags.InvokeMethod, null, jso, argsj, null);
words.Add((string)jsWord);
}
}
foreach (string word in words)
{
pageText += word;
}
}
catch
{
}
}
return pageText;
}
Hi I can direct you to a program-guide that tells how to extract text and Image from PDF using C#.NET. Have a try.
Hi,
Thank you for your guidence to extract text from pdf.
you have guided to extract text from pdf using javascript objects, i have checked in the documents that you have guided, they contain code only to extract text from pdf , i have requirement of extracting images also, but that documents does not contain code to extract images, can you please guide to extract images from pdf.
Hi,
Thank you for your relpy, yes i know that there are no methods to extract images from pdf using c#, i also came to know that we can do it by using C/C++ plugin, but in the samples provided by sdk contains only text extract plugin not image extraction. As I develop our products using c# I am not so good at C/C++ to create plugin, can you people please guide how to create plugin to extract images from pdf using adobe SDK.
Hi,
I am using
getPageNthWord and getPageNthWordQuads to get extract words and their position from pdf,now i have requirment to get each word font properties aswell like size, font name, italic or bold , etc, do we have any function like 'getPageNthWordQuads' to get font properties for extracted word from pdf. Thanks Kiranmai
Not with JavaScript. With a plug-in, yes, but you need to understand PDF internals better e.g. to realise why italic, bold, and size are not simple concepts. See http://forums.adobe.com/thread/1166866?start=0&tstart=0
Hi,
Thank you for ur reply, now i am trying to create sample plugin in c++ i am getting error and even if i try to build starter sample plugin i am getting follwing error, as i am new to c++ i am not able to solve this.
i have defined our environement as win_env in environ.h and i agetting error in
ACROASSERT.h
please suggest me how to solve this so that i may create plugin to reach my requirement.
Thanks
Kiranmai
It is possible to create a new project, but the requirements for setting it up are very complex. It is not worth wasting your time trying to solve the many problems you will get. For this reason, I recommend starting with one of the existing project files, or using the Wizard to create a new project. The project will build a file of type *.API like all of the other plug-ins.
I hope you have considered how you will communicate from your application to the plug-in. This is a challenging project in itself.
no i dont have any idea how to communicate plugin with my application, based on your suggestion to use plugin to get word properties i just started to create plugin, i am trying to build all the sample plugins provided by sdk but non of them are working. basically my requirement is to find different properties of words extracted from pdf like font size, font weight and style etc.. . based on my research i came to know that by using pdfwordfinder we can extract words from pdf, but i didnot find that plugin so i am trying to create that plugin with the help of document but i am not able to do. can you please suggest whether we already have pdfwordfinder plugin or i have to create
PDWordFinder is not a plug-in. It is a collection of methods you can use in your own plug-in.
A plug-in installs a series of routines to Acrobat which are called on certain events.
For example, a plug-in can add a menu item to Acrobat. When the user clicks your menu item, your routine is called. Then it can do the required task and report results. (Usually by popping up a message on screen if required, but it is not limited to this.) Are you happy to have your function called from a menu by the user of Acrobat?
ok, can we use plugins in c#, because i am developing application in c#. i will try to create plugin but i am not able to solve the error that i am getting when building sample apis. if i am not able to solve that error then i cannot work on exsisting sample api nor i cannor create new api, please suggest why i am getting that error and how to solve it.
"i am developing application in c#."
I recommend moving the whole application into a plug-in, where it will be convenient for the end user of Acrobat to run your PDF functionality in one place.
"please suggest why i am getting that error and how to solve it."
I have done so. Twice. Is my advice confusing, or does it seem not to apply to your problem?
North America
Europe, Middle East and Africa
Asia Pacific