• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

I'm trying to remove the html from my solr collections.

New Here ,
Apr 19, 2013 Apr 19, 2013

Copy link to clipboard

Copied

I've tried HTMLStripCharFilterFactory and StandardTokenizerFactory in the schema.xml BUT not working.

Views

1.3K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Apr 22, 2013 Apr 22, 2013

Copy link to clipboard

Copied

What are you indexing; files, database, etc.?

Do you mean to strip out HTML when making the collection?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Apr 22, 2013 Apr 22, 2013

Copy link to clipboard

Copied

I'm indexing an sql database...

the specific field contains the html for the selected page.

(<p> <span style="font-family ect...)

I've tried several things to try and remove the html from the search results. I've tried various striphtml fuctions that haven't work along with trying to do it in the solr schema... I recently converted all of my collections over to solr hoping for better search results. AND in verity the following code (that's still in place, don't know why it won't work in solr) was working perfect in the aspect of removing the html.

*************************************

example a.

<cfset searchterm = rereplace(searchterm, '%20', ' ', 'all')>

<cfset searchterm = rereplace(searchterm, "acute", "'", "all")>

<cfset searchterm = rereplace(searchterm, "\(", "", "all")>

<cfset searchterm = rereplace(searchterm, "\)", "", "all")>

<cfset searchterm = rereplace(searchterm, "\/", " ", "all")>

<cfset searchterm = rereplace(searchterm, "\\", " ", "all")>

<cfsearch name = "getSearchResults2"

collection = "s_mysamplepage"

criteria = "#searchterm#"

status = "info"

ContextPassages = "10"

ContextBytes = "500"

suggestions = "Always"

contextHighlightBegin = "<font color=red><strong>"

contextHighlightEnd = "</strong><font>">

<cfcatch>

<cfoutput>

<p> Invalid Search Criteria.</p>

</cfoutput>

****************************************************

Also included in the output query....

*******************************

<cfoutput query="getSearchResults2">

    <cftry>

          <cfset cleaned = rereplaced(Context, "<.*?>", "", "all")>

          <cfset cleaned = rereplaced(cleaned, "<.*?$", "", "all")>

          <cfset cleaned = rereplaced(cleaned, "^.*?>", "", "all")>

          <cfset cleaned = rereplaceNoCase(cleaned, "#searchterm#", "<font color=red><b>#searchterm#</b></font>", "all")>

          <cfset currPage = replace(URL, '/', '0', 'all')>

    <cfcatch></cfcatch>

    </cftry>

********************************

Now I've been pulling my hair out trying to get this to work from the getSearchResults2 query...

Is it possible to strip out the HTML when making the collection?????

what about stripping it during the index??????

Any help is appreciated....

Thanx

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Apr 22, 2013 Apr 22, 2013

Copy link to clipboard

Copied

LATEST

Seems to me stripping as you index would be more efficient for searching.  What your code shows, however, is more than just stripping out HTML - you're also removing parenthesis and slashes, etc.

As far as the HTML is concerned, one simple RegEx should do it.

(Editor is not letting me paste the URL for you to look at... )

Try this one more time..

There we go.. check that link and see if that helps.

^_^

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources
Documentation