Skip navigation
kiranmai mora
Currently Being Moderated

Extract Text from pdf using C#

Mar 29, 2012 11:52 PM

Tags: #pdf #text #c# #extract

Hi,

 

We are Solution developer using Acrobat,as we have reuirement of extracting text from pdf using C# we have downloaded adobe sdk and installed. We have found only four exmaples in C# and those are used only for viewing pdf in windows application. Can you please guide us how to extract text from pdf using SDK in C#.

 

 

Thanks you for your help.

 

Regards

kiranmai

 
Replies
  • Currently Being Moderated
    Mar 30, 2012 12:21 AM   in reply to kiranmai mora

    Have you read the documentation?   Both for the native COM/.NET calls as well as those available via the JSBridge??

     
    |
    Mark as:
  • Currently Being Moderated
    Apr 4, 2012 7:22 AM   in reply to kiranmai mora

    Try page 135 of this document

     

    http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/a crobat/pdfs/plugin_apps_developer_guide.pdf

     

    More than likely, you'll need to write a plugin to handle this because IAC doesn't seem to support it through COM

     

     

    If you are creative in using the JS interface, you can extract all the words from a document. You would need to use a loop and put everything into an array or a List.
    Take a look at page 311 of

     

    http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/js_api_r eference.pdf

     
    |
    Mark as:
  • Currently Being Moderated
    Apr 4, 2012 8:33 AM   in reply to kiranmai mora

    Okay so I went ahead and actually added the text extraction functionality to my own C# application, since this was a requested feature by the client anyhow, which originally we were told to bypass if it wasn't "cut and dry", but it wasn't bad so I went ahead and gave the client the text extraction that they wanted. Decided I'd post the source code here for you. This returns the text from the entire document as a string.

     

           private static string GetText(AcroPDDoc pdDoc)

            {

                AcroPDPage page;

                int pages = pdDoc.GetNumPages();

                string pageText = "";

     

     

                for (int i = 0; i < pages; i++)

                {

                    page = (AcroPDPage)pdDoc.AcquirePage(i);

                    object jso, jsNumWords, jsWord;

                    List<string> words = new List<string>();

     

     

                    try

                    {

                        jso = pdDoc.GetJSObject();

     

     

                        if (jso != null)

                        {

                            object[] args = new object[] { i };

                            jsNumWords = jso.GetType().InvokeMember("getPageNumWords", BindingFlags.InvokeMethod, null, jso, args, null);

                            int numWords = Int32.Parse(jsNumWords.ToString());

     

     

                            for (int j = 0; j <= numWords; j++)

                            {

                                object[] argsj = new object[] { i, j, false };

                                jsWord = jso.GetType().InvokeMember("getPageNthWord", BindingFlags.InvokeMethod, null, jso, argsj, null);

                                words.Add((string)jsWord);

                            }

                        }

     

     

                        foreach (string word in words)

                        {

                            pageText += word;

                        }

                    }

                    catch

                    {

                    }

                }

     

     

                return pageText;

            }

     
    |
    Mark as:
  • Currently Being Moderated
    May 30, 2012 2:13 PM   in reply to Eldrarak82

    the code sample is very helpful.

    maybe the code would be more wonderful if we prefix

    BindingFlags with qualifier System.Reflection.BindingFlags

     

    so beginner or a not so alert .net user would not have to search to find out when the class does not have

    using System.Reflection

     
    |
    Mark as:
  • Currently Being Moderated
    Jun 26, 2012 8:40 AM   in reply to kiranmai mora

    There are no APIs exposed from Acrobat to C# for extracting images or for OCR.

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 2, 2012 5:45 AM   in reply to kiranmai mora

    You cannot use .NET – by itself- to extract image from PDF using the Acrobat SDK.  You would have to write a plugin in C/C++ and then call the plugin from .NET.

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 3, 2012 12:14 AM   in reply to kiranmai mora

    The samples are only there to illustrate some points of the SDK. There are thousands of possible tasks with plug-ins, perhaps millions. Writing the plug-in is _your_ job.

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 4, 2012 12:13 AM   in reply to Test Screen Name

    Hi I can direct you to a program-guide that tells how to  extract text and Image from PDF using C#.NET. Have a try.

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 5, 2012 5:29 AM   in reply to kiranmai mora

    There are no methods for extracting images using C# with the Acrobat SDK.

     
    |
    Mark as:
  • Currently Being Moderated
    Jul 5, 2012 6:38 AM   in reply to kiranmai mora

    The easiest thing to do would be to simply run the “Extract All Images” command using the AVCommand APIs.   That will handle all the complexities for you.

     
    |
    Mark as:
  • Currently Being Moderated
    Mar 13, 2013 12:44 AM   in reply to kiranmai mora

    Not with JavaScript. With a plug-in, yes, but you need to understand PDF internals better e.g. to realise why italic, bold, and size are not simple concepts. See http://forums.adobe.com/thread/1166866?start=0&tstart=0

     
    |
    Mark as:
  • Currently Being Moderated
    Mar 14, 2013 1:36 AM   in reply to kiranmai mora

    What do you mean "defined our environment as WIN_ENV"? How did you do this, and why did you have to?

     

    Are you using the pre-made project file for the sample plug-in - not trying to create a new project?

     
    |
    Mark as:
  • Currently Being Moderated
    Mar 14, 2013 1:52 AM   in reply to Test Screen Name

    By the way, one possible cause of problems compiling is trying to use plug-in code in your own EXE. You cannot, it is only made to be plugged in to the Acrobat EXE (hence the name).

     
    |
    Mark as:
  • Currently Being Moderated
    Mar 14, 2013 2:55 AM   in reply to kiranmai mora

    It is possible to create a new project, but the requirements for setting it up are very complex. It is not worth wasting your time trying to solve the many problems you will get. For this reason, I recommend starting with one of the existing project files, or using the Wizard to create a new project. The project will build a file of type *.API like all of the other plug-ins.

     

    I hope you have considered how you will communicate from your application to the plug-in. This is a challenging project in itself.

     
    |
    Mark as:
  • Currently Being Moderated
    Mar 14, 2013 3:25 AM   in reply to kiranmai mora

    PDWordFinder is not a plug-in. It is a collection of methods you can use in your own plug-in.

    A plug-in installs a series of routines to Acrobat which are called on certain events.

     

    For example, a plug-in can add a menu item to Acrobat. When the user clicks your menu item, your routine is called. Then it can do the required task and report results. (Usually by popping up a message on screen if required, but it is not limited to this.)  Are you happy to have your function called from a menu by the user of Acrobat?

     
    |
    Mark as:
  • Currently Being Moderated
    Mar 14, 2013 3:46 AM   in reply to kiranmai mora

    "i am developing application in c#."

    I recommend moving the whole application into a plug-in, where it will be convenient for the end user of Acrobat to run your PDF functionality in one place.

     

    "please suggest why i am getting that error and how to solve it."

    I have done so. Twice. Is my advice confusing, or does it seem not to apply to your problem?

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points