Copy link to clipboard
Copied
When i extracted selected text, some special characters like (≤ Ω Β ∞ ≠ ≥) are displayed as junk character.
Please suggest me, how to Get all characters.
Thanks.
Copy link to clipboard
Copied
Can you post the actual code fragment you are using?
Copy link to clipboard
Copied
TextSelect = AVPageViewTrackText(pageView, xHit, yHit, NULL);
PDDoc pdDoc = AVDocGetPDDoc(AVAppGetActiveDoc());
PDPage pdPage = AVPageViewGetPage(pageView);
int iPage = PDPageGetNumber(pdPage);
BKMCRot = PDPageGetRotate(pdPage);
if (TextSelect != NULL)
{
PDTextSelectEnumText(TextSelect, ASCallbackCreateProto(PDTextSelectEnumTextProc, BKMCTextEnumProc), NULL);
PDTextSelectEnumQuads(TextSelect, ASCallbackCreateProto(PDTextSelectEnumQuadProc, BKMCTextEnumQuadProc), NULL);
AVPageViewHighlightText(pageView, TextSelect);
ASBool bselection = AVDocSetSelection(AVAppGetActiveDoc(), ASAtomFromString("BMCreatorText"), TextSelect, true);
}
ACCB1 ASBool ACCB2 BKMCTextEnumProc(void* procObj, PDFont pdFont, ASFixed size, PDColorValue Color, char *buff, ASInt32 asLen)
{
//for getting Font Size
int iFontSize = FixedRoundToInt16(size);
csFontSize.Format(L"%d", iFontSize);
//For Getting Text Color
long Textcolor = CPDFLink::GetRGBFromPDColor(*Color);
long lRValue, lGValue, lBValue;
CString csRVal, csGVal, csBVal;
CColor::COLORREFToRGB(Textcolor, lRValue, lGValue, lBValue);
csRVal.Format(L"%d", lRValue);
csGVal.Format(L"%d", lGValue);
csBVal.Format(L"%d", lBValue);
csColorValue = (L"R=") + csRVal + (" G=") + csGVal + (" B=") + csBVal;
//for Getting Font name
char fontNameBuf[PSNAMESIZE];
PDFontGetName(pdFont, fontNameBuf, PSNAMESIZE);
csFontname = (CString)fontNameBuf;
//For multiple words we need to add each time.
CString csChar;
for (int iIndex = 0; iIndex < asLen; iIndex++)
{
char cBuff = buff[iIndex];
if (cBuff != 13 && cBuff != 10)
{
csChar = cBuff;
csBKMCKeyword += csChar;
}
}
buff = "";
return true;
}
Copy link to clipboard
Copied
Two things…
1 – Use PDTextSelectEnumTextUCS as that will return UCS (aka Unicode) encoded information so that you will be sure to get all text in a standardized fashion
2 – We careful with CString as (IIRC) it’s not great for arbitrary encodings.
Copy link to clipboard
Copied
Thanks for the reply...
how to extract special characters using PDWordFinder object?
When i extracting "≤ Ω Β ∞ ≠ ≥" it display as " = O . 8 . = " .
i am using the following code...
ACCB1 ASBool ACCB2 SearchTextBasedonSelectedFont(PDWordFinder wObj, PDWord wInfo, ASInt32 pgNum, void* clientData)
{
CString csText;// = "";
CString csColor, csBlueText;
COLORREF wordColor = NULL;
char buf[256];
bool NonAlphaNum = false, LeadingPunc = false, LeadingSpace = false;
ASInt32 liStyleIndex;
bool bcolorValue = false;
static int FirstOccurence = 0;
PDStyle pdWordStyle;
PDColorValueRec color;
color.space = PDDeviceRGB;
color.value[0] = color.value[1] = color.value[2] = color.value[3] = fixedZero;
liStyleIndex = 0;
ASFixedQuad quad;
ASInt16 llAttr, liNumQuads;
long llCurSequence;
llCurSequence = ++(*(long*)clientData);
try
{
//Get the word color
if ((pdWordStyle = PDWordGetNthCharStyle(wObj, wInfo, liStyleIndex)) != NULL)
{
PDStyleGetColor(pdWordStyle, &color);
}
// To get the word in buffer
PDWordGetString(wInfo, buf, 256);
csText = buf;
PDStyleGetFont(pdWordStyle);
PDStyle aoPDWordStyle = PDWordGetNthCharStyle(wObj, wInfo, 0);
liNumQuads = PDWordGetNumQuads(wInfo);
}
}
Thanks..
Copy link to clipboard
Copied
If you use the UCS word finder you wull get a Unicode word string. This must be treated as an array of WCHAR. You NEED to Understand Unicode encoding.
Copy link to clipboard
Copied
Thank you for the reply...
I will Looking Unicode Encoding Concept...
Copy link to clipboard
Copied
Hi Test Screen,
I am using PDDocCreateWordFinderUCS for Word Finder then
then i Get AsText from each PDWord.
then i Encode AsText to get all characters.
is it the correct way i am going?
Copy link to clipboard
Copied
Does it give the expected answer?
If not, what encoding do you choose when you convert the ASText?
Copy link to clipboard
Copied
it shows some other format...
i used the following code to encode ASText.
ASText nextWordASText = ASTextNew();
PDWordGetASText(nextWord, 0, nextWordASText);
CString TextObjectTxt = ASTextGetEncoded(nextWordASText, ASTextGetBestEncoding(nextWordASText, (ASHostEncoding)PDGetHostEncoding()));
Thanks..
Copy link to clipboard
Copied
Your problem is your trying to convert from what is most likely Unicode (UTF8 or UTF16) to Host encoding (probably ISO 8891) which may not include those characters. And even if they do, are you then trying to display them in a font that doesn’t include those glyphs.
Instead of using HostEncoding, use UTF8Encoding, as that will give you back an ASCII string which will show you where the extra characters are (via the UTF8 escaping).
Copy link to clipboard
Copied
Hi Irosenth,
I have used UTF8 encoding, still it was not working..
the following code i have used,
PDWord nextWord = dwAllWords[Wordindex];
CString TextObjectTxt;
ASText nextWordASText = ASTextNew();
PDWordGetASText(nextWord, 0, nextWordASText);
ASTextFromUnicode((ASUTF16Val*)nextWordASText, kUTF8);
TextObjectTxt = ASTextGetEncoded(nextWordASText, ASTextGetBestEncoding(nextWordASText, kUTF8));
Copy link to clipboard
Copied
DO NOT use ASTextFromUnicode or ASTextGetEncoded. Use ASTextGetUnicodeCopy( nextWordASText, kUTF8)
Copy link to clipboard
Copied
Still i have not get all characters.
I have used the following code,
PDWord nextWord = dwAllWords[Wordindex];
CString TextObjectTxt;
ASText nextWordASText = ASTextNew();
PDWordGetASText(nextWord, 0, nextWordASText);
ASTextGetUnicodeCopy( nextWordASText, kUTF8)
TextObjectTxt = ASTextGetEncoded(nextWordASText, ASTextGetBestEncoding(nextWordASText, kUTF8));
Copy link to clipboard
Copied
So, what is your host encoding? Does your host encoding include the characters ≤ Ω Β ∞ ≠ ≥? Tip: mine does not. I would have to work in Unicode. I think you are assuming something impossible will work.
What is your aim for these characters: please list all the ways you need them to work? (For example: only in one message that is popped up using the Windows MessageBox function) Please be detailed.
Copy link to clipboard
Copied
Please answer my question about intended use.
Copy link to clipboard
Copied
I want to extract words from the PDF Page using WordFinder and displayed in the Tree view controller.
While Extracting, all words are getting correctly, Except special characters included in the word.
in the place of special characters, junk character is displayed, like(.,8,O).
I want to know how to display words with special characters.
Copy link to clipboard
Copied
Ok, which API are you using to the tree view controller. Some of them are not Unicode aware so this is impossible for them. Others may accept a Unicode string, but you must change your code.
Copy link to clipboard
Copied
I extracted all the words through PDWordFinder and Added into List,
then directly insert those Words present in the List into Tree view controller.
I have not used any API to insert words into Tree view controller.
Copy link to clipboard
Copied
First I Get Astext object from PDWord,
then,When Getting text from AsText object, there only it shows junk character.
I think, i need to do something while extracting string from ASText Object.
Copy link to clipboard
Copied
I do not understand what you mean by no API. Please help us understand in detail what you are doing. I know the tree view controller as something used though the .Net API and also via the MS Access API.
Copy link to clipboard
Copied
The following code used for creating Tree view Controller.
cBMCTreeCtrl m_BKMCTreeCtrl = new cBMCTreeCtrl();
UINT nTreeStyles = WS_CHILD | WS_VISIBLE | WS_TABSTOP | WS_BORDER | TVS_LINESATROOT| TVS_HASLINES| TVS_HASBUTTONS| TVS_EDITLABELS | TVS_EX_MULTISELECT;
m_BKMCTreeCtrl->Create(nTreeStyles, crTreeView, this, IDC_BKMCreatorTreeView);
m_BKMCTreeCtrl->ShowWindow(SW_SHOW);
but, When Getting text from AsText object, there only it shows junk character.
Copy link to clipboard
Copied
So, that is an API you are using!! CBMCTreeCtrl appears to be a custom C++ API. You must refer to the documentation or code of that API to see if you can pass a Unicode string. You will NOT be able to use the same interface you used to set one byte text. I think you still have not studied Unicode or the concept of encodings. This is vital to you.
However, I do have two suggestions on your next step.
1. Try and add directly the constant string "≤ Ω Β ∞ ≠ ≥" to your tree control, without using Acrobat. You may find this complex or impossible to solve, but once solved you will be ready to work with strings from Acrobat.
2. Also, copy and paste text from this same PDF into Word. Examine the Word document and make sure the characters ≤ Ω Β ∞ ≠ ≥ appear as you wish. If they do not appear in Word, you certainly cannot extract them from the PDF.
Copy link to clipboard
Copied
In both the cases, Expected result achieved.
I need to work on Acrobat String.