28 Replies Latest reply: Mar 14, 2013 3:46 AM by Test Screen Name RSS

    Extract Text from pdf using C#

    kiranmai mora

      Hi,

       

      We are Solution developer using Acrobat,as we have reuirement of extracting text from pdf using C# we have downloaded adobe sdk and installed. We have found only four exmaples in C# and those are used only for viewing pdf in windows application. Can you please guide us how to extract text from pdf using SDK in C#.

       

       

      Thanks you for your help.

       

      Regards

      kiranmai

        • 1. Re: Extract Text from pdf using C#
          lrosenth Adobe Employee

          Have you read the documentation?   Both for the native COM/.NET calls as well as those available via the JSBridge??

          • 2. Re: Extract Text from pdf using C#
            kiranmai mora Community Member

            Thank you for your quick reply, can you please suggest the correct document that can help me, as i feel the most of the documentation is meant for C/C++ developers.

             

            Regards

            Kiranmai

            • 3. Re: Extract Text from pdf using C#
              Eldrarak82 Community Member

              Try page 135 of this document

               

              http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/plugin_a pps_developer_guide.pdf

               

              More than likely, you'll need to write a plugin to handle this because IAC doesn't seem to support it through COM

               

               

              If you are creative in using the JS interface, you can extract all the words from a document. You would need to use a loop and put everything into an array or a List.
              Take a look at page 311 of

               

              http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/js_api_reference.pdf

              • 4. Re: Extract Text from pdf using C#
                Eldrarak82 Community Member

                Okay so I went ahead and actually added the text extraction functionality to my own C# application, since this was a requested feature by the client anyhow, which originally we were told to bypass if it wasn't "cut and dry", but it wasn't bad so I went ahead and gave the client the text extraction that they wanted. Decided I'd post the source code here for you. This returns the text from the entire document as a string.

                 

                       private static string GetText(AcroPDDoc pdDoc)

                        {

                            AcroPDPage page;

                            int pages = pdDoc.GetNumPages();

                            string pageText = "";

                 

                 

                            for (int i = 0; i < pages; i++)

                            {

                                page = (AcroPDPage)pdDoc.AcquirePage(i);

                                object jso, jsNumWords, jsWord;

                                List<string> words = new List<string>();

                 

                 

                                try

                                {

                                    jso = pdDoc.GetJSObject();

                 

                 

                                    if (jso != null)

                                    {

                                        object[] args = new object[] { i };

                                        jsNumWords = jso.GetType().InvokeMember("getPageNumWords", BindingFlags.InvokeMethod, null, jso, args, null);

                                        int numWords = Int32.Parse(jsNumWords.ToString());

                 

                 

                                        for (int j = 0; j <= numWords; j++)

                                        {

                                            object[] argsj = new object[] { i, j, false };

                                            jsWord = jso.GetType().InvokeMember("getPageNthWord", BindingFlags.InvokeMethod, null, jso, argsj, null);

                                            words.Add((string)jsWord);

                                        }

                                    }

                 

                 

                                    foreach (string word in words)

                                    {

                                        pageText += word;

                                    }

                                }

                                catch

                                {

                                }

                            }

                 

                 

                            return pageText;

                        }

                • 5. Re: Extract Text from pdf using C#
                  gg.fs

                  the code sample is very helpful.

                  maybe the code would be more wonderful if we prefix

                  BindingFlags with qualifier System.Reflection.BindingFlags

                   

                  so beginner or a not so alert .net user would not have to search to find out when the class does not have

                  using System.Reflection

                  • 6. Re: Extract Text from pdf using C#
                    kiranmai mora Community Member

                    Thank you for your support it helped me a lot to extract text from pdf.

                     

                    Can you please suggest me how to extract images from pdf and also how to extract text from image based pdf in c#

                    • 7. Re: Extract Text from pdf using C#
                      lrosenth Adobe Employee

                      There are no APIs exposed from Acrobat to C# for extracting images or for OCR.

                      • 8. Re: Extract Text from pdf using C#
                        kiranmai mora Community Member

                        Thank you for your reply, can you please suggest how to extract images from pdf using adobe SDK using any other language .net supported language(other than c#)

                        • 9. Re: Extract Text from pdf using C#
                          lrosenth Adobe Employee

                          You cannot use .NET – by itself- to extract image from PDF using the Acrobat SDK.  You would have to write a plugin in C/C++ and then call the plugin from .NET.

                          • 10. Re: Extract Text from pdf using C#
                            kiranmai mora Community Member

                            Thank you for your reply. In the samples provided by SDK does not contain sample to extract images from pdf, can you please provide plungin in C/C++ to extract images from pdf

                            • 11. Re: Extract Text from pdf using C#
                              Test Screen Name MVP

                              The samples are only there to illustrate some points of the SDK. There are thousands of possible tasks with plug-ins, perhaps millions. Writing the plug-in is _your_ job.

                              • 12. Re: Extract Text from pdf using C#
                                dev_peter

                                Hi I can direct you to a program-guide that tells how to  extract text and Image from PDF using C#.NET. Have a try.

                                • 13. Re: Extract Text from pdf using C#
                                  kiranmai mora Community Member

                                  Hi,

                                   

                                  Thank you for your guidence to extract text from pdf.

                                   

                                  you have guided to extract text from pdf using javascript objects, i have checked in the documents that you have guided, they contain code only to extract text from pdf , i have requirement of extracting images also, but that documents does not contain code to extract images, can you please guide to extract images from pdf.

                                  • 14. Re: Extract Text from pdf using C#
                                    lrosenth Adobe Employee

                                    There are no methods for extracting images using C# with the Acrobat SDK.

                                    • 15. Re: Extract Text from pdf using C#
                                      kiranmai mora Community Member

                                      Hi,

                                       

                                      Thank you for your relpy, yes i know that there are no methods to extract images from pdf using c#, i also came to know that we can do it by using C/C++ plugin, but in the samples provided by sdk contains only text extract plugin not image extraction. As I develop our products using c# I am not so good at C/C++ to create plugin, can you people please guide how to create plugin to extract images from pdf using adobe SDK.

                                      • 16. Re: Extract Text from pdf using C#
                                        lrosenth Adobe Employee

                                        The easiest thing to do would be to simply run the “Extract All Images” command using the AVCommand APIs.   That will handle all the complexities for you.

                                        • 17. Re: Extract Text from pdf using C#
                                          kiranmai mora Community Member

                                          Hi,

                                           

                                          I am using

                                           

                                          getPageNthWord and

                                          getPageNthWordQuads to get extract words and their position from pdf,now i have requirment to get each word font properties aswell like size, font name, italic or bold , etc, do we have any function like

                                           

                                          'getPageNthWordQuads' to get font properties for extracted word from pdf.

                                           

                                          Thanks

                                          Kiranmai

                                          • 18. Re: Extract Text from pdf using C#
                                            Test Screen Name MVP

                                            Not with JavaScript. With a plug-in, yes, but you need to understand PDF internals better e.g. to realise why italic, bold, and size are not simple concepts. See http://forums.adobe.com/thread/1166866?start=0&tstart=0

                                            • 19. Re: Extract Text from pdf using C#
                                              kiranmai mora Community Member

                                              Hi,

                                               

                                              Thank you for ur reply, now i am trying to create sample plugin in c++ i am getting error and even if i try to build starter sample plugin  i am getting follwing error, as i am new to c++ i am not able to solve this.

                                               

                                              i have defined our environement as win_env in environ.h and i agetting error in

                                               

                                              ACROASSERT.h

                                               

                                              please suggest me how to solve this so that i may create plugin to reach my requirement.

                                              Error.png

                                              Thanks

                                              Kiranmai

                                              • 20. Re: Extract Text from pdf using C#
                                                Test Screen Name MVP

                                                What do you mean "defined our environment as WIN_ENV"? How did you do this, and why did you have to?

                                                 

                                                Are you using the pre-made project file for the sample plug-in - not trying to create a new project?

                                                • 21. Re: Extract Text from pdf using C#
                                                  Test Screen Name MVP

                                                  By the way, one possible cause of problems compiling is trying to use plug-in code in your own EXE. You cannot, it is only made to be plugged in to the Acrobat EXE (hence the name).

                                                  • 22. Re: Extract Text from pdf using C#
                                                    kiranmai mora Community Member

                                                    no i have started with new project and in that new project i added environ.h header file by defifning environment as windows .

                                                     

                                                    i am just using pre-made project file as reference as i am new to c++

                                                    • 23. Re: Extract Text from pdf using C#
                                                      kiranmai mora Community Member

                                                      sorry i am cofused, i am getting error when i open sample starter plugin example from visual studio and debug it. you mean to say we cannot debug plugin in visual studio, if so then cant we create plugin using visual studio.

                                                      • 24. Re: Extract Text from pdf using C#
                                                        Test Screen Name MVP

                                                        It is possible to create a new project, but the requirements for setting it up are very complex. It is not worth wasting your time trying to solve the many problems you will get. For this reason, I recommend starting with one of the existing project files, or using the Wizard to create a new project. The project will build a file of type *.API like all of the other plug-ins.

                                                         

                                                        I hope you have considered how you will communicate from your application to the plug-in. This is a challenging project in itself.

                                                        • 25. Re: Extract Text from pdf using C#
                                                          kiranmai mora Community Member

                                                          no i dont have any idea how to communicate plugin with my application, based on your suggestion to use plugin to get word properties i just started to create plugin, i am trying to build all the sample plugins provided by sdk but non of them are working. basically my requirement is to find different properties of words extracted from pdf like font size, font weight and style etc.. . based on my research i came to know that by using pdfwordfinder we can extract words from pdf, but i didnot find that plugin so i am trying to create that plugin with the help of document but i am not able to do. can you please suggest whether we already have pdfwordfinder plugin or i have to create

                                                          • 26. Re: Extract Text from pdf using C#
                                                            Test Screen Name MVP

                                                            PDWordFinder is not a plug-in. It is a collection of methods you can use in your own plug-in.

                                                            A plug-in installs a series of routines to Acrobat which are called on certain events.

                                                             

                                                            For example, a plug-in can add a menu item to Acrobat. When the user clicks your menu item, your routine is called. Then it can do the required task and report results. (Usually by popping up a message on screen if required, but it is not limited to this.)  Are you happy to have your function called from a menu by the user of Acrobat?

                                                            • 27. Re: Extract Text from pdf using C#
                                                              kiranmai mora Community Member

                                                              ok, can we use plugins in c#, because i am developing application in c#. i will try to create plugin but i am not able to solve the error that i am getting when building sample apis. if i am not able to solve that error then i cannot work on exsisting sample api nor i cannor create new api, please suggest why i am getting that error and how to solve it.

                                                              • 28. Re: Extract Text from pdf using C#
                                                                Test Screen Name MVP

                                                                "i am developing application in c#."

                                                                I recommend moving the whole application into a plug-in, where it will be convenient for the end user of Acrobat to run your PDF functionality in one place.

                                                                 

                                                                "please suggest why i am getting that error and how to solve it."

                                                                I have done so. Twice. Is my advice confusing, or does it seem not to apply to your problem?