Copy link to clipboard
Copied
How to get the speical character and output txt in acrobat by JS ?
example: ® ™ è
I can get them,
but why the ™ have lost when I save these character to txt,?
function getword2(doc){
var i, j, ckWord, numWords, aWords = [];
for (i = 0; i < doc.numPages; i++ ) {
var bWords = [];
numWords = doc.getPageNumWords(i);
for (j = 0; j < numWords; j++) {
ckWord = doc.getPageNthWord(i, j, false);
if (ckWord) {
bWords.push(ckWord); // Add word to array
}
}
aWords.push( bWords );
}
return aWords.join('\r\n');
}
function output_csv2(){
var outputString = getword2(this);
this.createDataObject("output.txt", outputString);
this.exportDataObject({ cName:"output.txt", nLaunch: "2"});
}
You might need to encode the string you're writing to the data file as UTF-8 for it to work. In order to do that you should use the setDataObjectContents method and a UTF-8 encoded stream, instead of setting the contents in the createDataObject method directly.
Copy link to clipboard
Copied
JavaScript text uses the Unicode encoding. This is a 16 bit code that can represent just about any character in existence. On the other hand, plain text is 8bit, which uses the ASCII or ANSI encoding which only provides for Western European characters plus punctuation and a few special characters used on early teletype machines. So, if the text scraped from the PDF page does not have an easy translation to ASCII it will be replaced with garbage.
Note: by "easy translation" I mean that the 8bit Unicode prefix is 00
Copy link to clipboard
Copied
What should I do?
Can I save the other file format?
Thanks!
Copy link to clipboard
Copied
UP
Copy link to clipboard
Copied
Sorry about the late reply. I actually wrote this a week ago and it didn't get posted.
*********************************************************************************************
That's a really good question. Not one I've thought about before. The first thing to do is some more testing to determine you're exact situation. Find out the exact character codes that are causing this issue. It's entirely possible that the problem may be in the text file viewer, and not with the character codes.
Modify your script to list the words and word indexes for a single page you know has this issue. Once you know the index of a word with problem characters you can use this script in the console window to find the Unicode code of the problem character
var cWord = this.getPageNthWord(nIndex);
cCode = cWord.charCodeAt(n).toString(16);
For example, the character codes for the 3 characters you've listed in the post are
® = 0xae, ANSI code
™ = 0x2122 Unicode, Also 0x99 in ANSI
è = 0xe8 ANSI code
Except for the trade mark these characters are coded as ANSI, which is an 8th bit extension to the 7-bit ASCII codes for covering special symbols. A good plain text viewer should display these symbols since they are still 8 bit. Maybe if you view the text on something different you'll seem them.
The only other alternative is to create a different kind of file format, which is outside the scope of what we can do on the forum.
Copy link to clipboard
Copied
Thanks very much。
Maybe I need to find other solutions.
Copy link to clipboard
Copied
What exactly is it that you are trying to do? Perhaps we can suggest another approach.
Copy link to clipboard
Copied
I have a PDF file, and a XLS file,
The pdf file with 100s pages will to be printing,
I must to check the content of every page by the xls file.
If the speical character of ervery page can be read
It will easy to compare them.
Copy link to clipboard
Copied
You might need to encode the string you're writing to the data file as UTF-8 for it to work. In order to do that you should use the setDataObjectContents method and a UTF-8 encoded stream, instead of setting the contents in the createDataObject method directly.
Copy link to clipboard
Copied
Can you give an example?
Please!
Copy link to clipboard
Copied
Something like this:
this.createDataObject("output.txt", "");
this.setDataObjectContents("output.txt", util.streamFromString(outputString, "utf-8"));
this.exportDataObject({ cName:"output.txt", nLaunch: "2"});