-
Notifications
You must be signed in to change notification settings - Fork 254
Features_SearchImprovements_DocSearch
Docs in regard to spacewalk/satellite can exist in 2 ways.
- Link to www.redhat.com/manuals/satellite (Spacewalk)
- Locally residing jsp files (Satellite)
Our goal is to create a lucene index representing this info and allow it to be searchable through the spacewalk search-server.
We will use "nutch" to generate the lucene index. This index will be packaged into a RPM and will be part of the process for building the docs. The search-server will use this data as another regular lucene index.
-
First case: Docs reside at http://www.redhat.com/docs/manuals/satellite/
-
Will need to perform a web crawl with nutch
-
Download nightly build at: http://hudson.zones.apache.org/hudson/job/Nutch-trunk/
-
untar build to something like /tmp/nutch-nightly
-
create a urls directory: mkdir /tmp/nutch-nightly/urls
-
NOTE this needs to be a directory, the tutorial from nutch still shows this as a text file, I had problems with this failing until I created the directory
-
create a urls.txt file consisting of the below line: vim /tmp/nutch-nightly/urls/urls.txt
-
update "http.agent.name" the crawl will fail if this is not done
-
edit conf/nutch-site.xml add an entry for "http.agent.name"
http.agent.name Spacewalk HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization.NOTE: You should also check other related properties:
http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version
and set their values appropriately.
-
additional entries to conf/nutch-site.xml to make the crawl quicker (at the expense of being rude to our target host)
fetcher.server.delay 0 The number of seconds the fetcher will delay between successive requests to the same server. fetcher.threads.per.host 50 This number is the maximum number of threads that should be allowed to access a host at one time. -
Edit conf/crawl-urlfilter.txt. We are setting it to only accept links for redhat.com and we are telling it to skip pdf documents (we can add pdf support back in later, taking it out now to simplify things and to go quicker)
-.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|pdf)$
+^http://([a-z0-9]*.)*redhat.com/
-
Run the crawl:
-
/tmp/nutch-nightly/bin/nutch crawl urls -dir crawl.data -depth 5 -threads 50 &> crawl.log
-
this takes on the order of ~15 minutes to complete with the above changes, or about 90 minutes without the changes.
-
Working with the output from nutch
-
End goal is to take the output and feed it into the search-server and have that perform the search
-
For a first cut to see if the data is valid, we can use the search interface nutch provides.
-
From the above run copy crawl.data to /tmp/redhat.com-test-data
-
Copy the .war file from the nutch build tomcat's webapps directory
-
cp nutch-nightly.war $TOMCAT_HOME/webapps/nutch.war
-
Restart tomcat, go into $TOMCAT_HOME/webapps/nutch/WEB-INF/classes
-
Edit nutch-site.xml
searcher.dir /tmp/redhat.com-test-data/ -
It looks like the nightly builds have a problem with "search.jsp". Below is the error message I was receiving
ERROR [jsp] - Servlet.service() for servlet jsp threw exception org.apache.jasper.JasperException: /search.jsp(172,22) Attribute value language + "/include/header.html" is quoted with " which must be escaped when used within the value
-
This is the workaround, edit $TOMCAT_HOME/webapps/nutch/search.jsp and escape the quotes. Line 172 for my nightly build.
<jsp:include page="<%= language + "/include/header.html"%>"/>
-
This issue appears to have been around since September 2008, I've seen 2 posts relating to it, but as of 11/21 it still hasn't been fixed in the nightly build.
-
Restart tomcat once more
-
Now go to the nutch webapp, example: http://127.0.0.1:8080/nutch
-
You can search through the data to see what will be available to our search app.
-
Integrating output from nutch into our search-server. Copy the generated indexes into the search-server.
-
"mkdir /usr/share/rhn/search/indexes/docs"
-
"cp /tmp/nutch-nightly/crawl.data/index/* /usr/share/rhn/search/indexes/docs"
-
to test, go to : http://localhost/rhn/help/Search.do (assuming localhost is running a devel version of spacewalk)
-
Adding "Summary Content" for matches
-
"cp /tmp/nutch-nightly/crawl.data/segments /usr/share/rhn/search/indexes/docs"
-
"mkdir /usr/share/rhn/search/nutch"
-
nutch must be told where to find it's plugin directory
-
Edit /tmp/nutch-nightly/conf/nutch-site.xml, add in the below change
<property>
<name>plugin.folders</name>
<value>/usr/share/rhn/search/nutch/plugins</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
- "cp /tmp/nutch-nightly/conf /usr/share/rhn/search/nutch"
- "cp /tmp/nutch-nightly/plugins /usr/share/rhn/search/nutch"
- At this point if you do a search, the surrounding text from the match should be displayed in the webui.
Issue | Status | Notes |
---|---|---|
----- | ||
add content from a surrounding match to webui | done | requires extra "segments" directory to be included with indexes |
indexes from www.redhat.com/manuals/satellite need the URLs to be restricted more, i.e. the crawl is picking up entries we don't care about. | done | |
Need to crawl/index in multiple languages. First non-english language selected will be Japanese. | done | supports all languages present in translated docs |
Webui needs to recognize language and pass it through to search server request. | done | the locale is passed in the xmlrpc call as an argument |
Need to setup a nutch configuration that will parse JSP docs | done | each language has it's own set of configuration files |
----- |
- API docs are not currently being indexed/searched.
Estimate: ~3 days
Do you want to contribute to this wiki? See page WikiContribute for more info.