Sergey Nikolaev
January 27, 2011 ・ Sphinx
Using wget as a simple web crawler for Sphinx search engine
The following is just a proof of concept which shows that it's possible to make a search engine of a web site with minimal tools and knowledges:
-
wget
-
bash
-
Sphinx: cool directive sql_file_field is used
-
mysql
To make data from some site searchable you need to do the following few simple steps:
-
crawl the whole site with wget preserving it's original structure
-
prepare the data for Sphinx
-
index the data with Sphinx
Crawling
Just use wget -r ... -o log to crawl the whole site with wget. I've crawled www.ivinco.com:
[snikolaev@r27 ivinco]$ wget -r "http://www.ivinco.com" -o log
UPDATE: It appeared wget -nv -r "http://www.ivinco.com" -o log works better (just -r doesn't work sometimes, it depends on wget version etc.)
Preparing data for Sphinx
Any search result usually consists of at least 2 main things: snippet and link to the original document. Now that you have crawled the whole site and have all the contents it's not a problem to make snippets, you can just use Sphinx' function BuildExcerpts() to do this, but the thing you also need is a map between a document and it's original url. wget doesn't generate such map so you need to do it yourself. You should have file 'log' after the previous step. There you can find things like:
...
--05:21:14-- /feed/
=> `www.ivinco.com/feed/index.html'
...
This is the map you need, you just need to normalize it a bit:
perl -i -pe 'undef $/; s/\n+\s+=\>\s+/=>/smg' log
Now you have:
...
--05:21:14-- /feed/=>`www.ivinco.com/feed/index.html'
...
Leave only needed lines containing '=>' and Replace '=>' with a space.
[snikolaev@r27 ivinco]$ grep "=>" log|awk '{print $2}'|sed 's/=>/ /g' > map
Now you should have the following in 'map':
[snikolaev@r27 ivinco]$ head -5 map
`www.ivinco.com/index.html'
/robots.txt `www.ivinco.com/robots.txt'
/feed/ `www.ivinco.com/feed/index.html'
/xmlrpc.php `www.ivinco.com/xmlrpc.php'
/wp-content/themes/ivincowp/style.css `www.ivinco.com/wp-content/themes/ivincowp/style.css'
Sometimes the above might not work due to different version of wget etc. and you need to use some other commands to normalize, e.g.
cat log|grep "\->"|awk '{print $3 " " $6}'|sed 's/URL://g'|sed 's/"//g' > map
it's enough. You also have all files crawled by wget in www.ivinco.com:
[snikolaev@r27 ivinco]$ ls www.ivinco.com/
blog contact-us favicon.ico feed index.html robots.txt search services software wp-content wp-includes xmlrpc.php xmlrpc.php?rsd
Since you don't want Sphinx to index css/js/image etc. files you might want to filter this stuff. You can use script like this:
[snikolaev@r27 ivinco]$ cat prepare.sh
for n in `find ./ -type f|grep -v ".gif\|.jpg\|.png\|\.zip\|.css\|.js\|.ico\|.xml\|robots.txt\|xmlrpc.php\|feed"|grep "www.ivinco.com"|sed 's/^\(.\)\{2\}//g'`; do echo -n `grep $n map|awk '{print $1}'|sort -n|uniq`; echo -en "\t"; readlink -f $n; done;
This script will also output crawled documents full paths (with readlink -f) and will concatenate it with the original links from the 'map' file built earlier. As a result you will have:
[snikolaev@r27 ivinco]$ bash prepare.sh |head -5
/search/ /home/snikolaev/ivinco/www.ivinco.com/search/index.html
/services/custom-full-text-search-solutions/ /home/snikolaev/ivinco/www.ivinco.com/services/custom-full-text-search-solutions/index.html
/services/ /home/snikolaev/ivinco/www.ivinco.com/services/index.html
/services/performance-audit/ /home/snikolaev/ivinco/www.ivinco.com/services/performance-audit/index.html
/contact-us/careers/ /home/snikolaev/ivinco/www.ivinco.com/contact-us/careers/index.html
Save it to /tmp/data and you're done with preparing data for Sphinx
Indexing
Loading data from /tmp/data to mysql is very simple:
mysql> create database se;
Query OK, 1 row affected (0.01 sec)
mysql> use se;
Database changed
mysql> create table data (id int primary key auto_increment, url varchar(1024), path varchar(1024));
Query OK, 0 rows affected (0.01 sec)
mysql> load data infile '/tmp/data' into table data (url, path);
Query OK, 93 rows affected (0.00 sec)
Records: 93 Deleted: 0 Skipped: 0 Warnings: 0
Now you have the following in mysql:
mysql> select * from data limit 5;
+----+-------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
| id | url | path |
+----+-------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
| 1 | /search/ | /home/snikolaev/ivinco/www.ivinco.com/search/index.html |
| 2 | /services/custom-full-text-search-solutions/ | /home/snikolaev/ivinco/www.ivinco.com/services/custom-full-text-search-solutions/index.html |
| 3 | /services/ | /home/snikolaev/ivinco/www.ivinco.com/services/index.html |
| 4 | /services/performance-audit/ | /home/snikolaev/ivinco/www.ivinco.com/services/performance-audit/index.html |
| 5 | /contact-us/careers/ | /home/snikolaev/ivinco/www.ivinco.com/contact-us/careers/index.html |
+----+-------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
5 rows in set (0.00 sec)
Use the following sphinx config to index the data:
source se
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass =
sql_db = se
sql_query = select id, url, path from data
sql_file_field = path
sql_query_info = select * from data where id=$id
}
index se
{
path = idx
source = se
html_strip = 1
}
searchd
{
listen = 9306:mysql41
log = sphinx.log
pid_file = sphinx.pid
}
Note that sql_file_field is used to tell Sphinx where to find the files containing data to index, Sphinx will do the rest: it will read the files itself and index the data, just give him the path.
Index the data:
Sphinx 1.11-id64-dev (r2650)
Copyright (c) 2001-2011, Andrew Aksyonoff
Copyright (c) 2008-2011, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file 'sphinx_se.conf'...
indexing index 'se'...
WARNING: collect_hits: mem_limit=0 kb too low, increasing to 24576 kb
collected 58 docs, 0.3 MB
sorted 0.0 Mhits, 100.0% done
total 58 docs, 330606 bytes
total 0.047 sec, 7017149 bytes/sec, 1231.05 docs/sec
total 60 reads, 0.000 sec, 17.3 kb/call avg, 0.0 msec/call avg
total 6 writes, 0.000 sec, 51.1 kb/call avg, 0.1 msec/call avg
Voila! The data is searchable now, you can use it with any method (any api, sphinxql etc.), I used 'search' tool and query 'happen with some latency' which should find me the following document /blog/sphinx-replication/. It works:
[snikolaev@r27 ~]$ ~/bin/search -c sphinx_se.conf -a "happen with some latency"
Sphinx 1.11-id64-dev (r2650)
Copyright (c) 2001-2011, Andrew Aksyonoff
Copyright (c) 2008-2011, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file 'sphinx_se.conf'...
index 'se': query 'happen with some latency ': returned 1 matches of 1 total in 0.000 sec
displaying matches:
1. document=43, weight=4681
id=43
url=/blog/sphinx-replication/
path=/home/snikolaev/ivinco/www.ivinco.com/blog/sphinx-replication/index.html
words:
1. 'happen': 1 documents, 1 hits
2. 'with': 26 documents, 86 hits
3. 'some': 10 documents, 29 hits
4. 'latency': 1 documents, 2 hits
Now you know how to delegate things to existing tools to build a simple search engine: wget can crawl the web, Sphinx can read the crawled data and allow you to search in it. Of course the above algorithm isn't useful for any serious production use as is, this is just a PoC, but if you make an effort you can make something good of it.
- Sphinx