Using wget as a simple web crawler for Sphinx search engine

The following is just a proof of concept which shows that it’s possible to make a search engine of a web site with minimal tools and knowledges:

  • wget
  • bash
  • Sphinx: cool directive sql_file_field is used
  • mysql

To make data from some site searchable you need to do the following few simple steps:

  • crawl the whole site with wget preserving it’s original structure
  • prepare the data for Sphinx
  • index the data with Sphinx

Crawling

Just use wget -r … -o log to crawl the whole site with wget. I’ve crawled www.ivinco.com:

[snikolaev@r27 ivinco]$ wget -r "http://www.ivinco.com" -o log

UPDATE: It appeared wget -nv -r “http://www.ivinco.com” -o log works better (just -r doesn’t work sometimes, it depends on wget version etc.)

Preparing data for Sphinx

Any search result usually consists of at least 2 main things: snippet and link to the original document. Now that you have crawled the whole site and have all the contents it’s not a problem to make snippets, you can just use Sphinx’ function BuildExcerpts() to do this, but the thing you also need is a map between a document and it’s original url. wget doesn’t generate such map so you need to do it yourself. You should have file ‘log’ after the previous step. There you can find things like:

...
--05:21:14--  /feed/
           => `www.ivinco.com/feed/index.html'
...

This is the map you need, you just need to normalize it a bit:

perl -i -pe 'undef $/; s/\n+\s+=\>\s+/=>/smg' log

Now you have:

...
--05:21:14--  /feed/=>`www.ivinco.com/feed/index.html'
...

Leave only needed lines containing ‘=>’ and Replace ‘=>’ with a space.

[snikolaev@r27 ivinco]$ grep "=>" log|awk '{print $2}'|sed 's/=>/ /g' > map

Now you should have the following in ‘map’:

[snikolaev@r27 ivinco]$ head -5 map
`www.ivinco.com/index.html'
/robots.txt `www.ivinco.com/robots.txt'
/feed/ `www.ivinco.com/feed/index.html'
/xmlrpc.php `www.ivinco.com/xmlrpc.php'
/wp-content/themes/ivincowp/style.css `www.ivinco.com/wp-content/themes/ivincowp/style.css'

Sometimes the above might not work due to different version of wget etc. and you need to use some other commands to normalize, e.g.

cat log|grep "\->"|awk '{print $3 " " $6}'|sed 's/URL://g'|sed 's/"//g' > map

it’s enough. You also have all files crawled by wget in www.ivinco.com:

[snikolaev@r27 ivinco]$ ls www.ivinco.com/
blog  contact-us  favicon.ico  feed  index.html  robots.txt  search  services  software  wp-content  wp-includes  xmlrpc.php  xmlrpc.php?rsd

Since you don’t want Sphinx to index css/js/image etc. files you might want to filter this stuff. You can use script like this:

[snikolaev@r27 ivinco]$ cat prepare.sh 
for n in `find ./ -type f|grep -v ".gif\|.jpg\|.png\|\.zip\|.css\|.js\|.ico\|.xml\|robots.txt\|xmlrpc.php\|feed"|grep "www.ivinco.com"|sed 's/^\(.\)\{2\}//g'`; do echo -n `grep $n map|awk '{print $1}'|sort -n|uniq`; echo -en "\t"; readlink -f $n; done;

This script will also output crawled documents full paths (with readlink -f) and will concatenate it with the original links from the ‘map’ file built earlier. As a result you will have:

[snikolaev@r27 ivinco]$ bash prepare.sh |head -5
/search/	/home/snikolaev/ivinco/www.ivinco.com/search/index.html
/services/custom-full-text-search-solutions/	/home/snikolaev/ivinco/www.ivinco.com/services/custom-full-text-search-solutions/index.html
/services/	/home/snikolaev/ivinco/www.ivinco.com/services/index.html
/services/performance-audit/	/home/snikolaev/ivinco/www.ivinco.com/services/performance-audit/index.html
/contact-us/careers/	/home/snikolaev/ivinco/www.ivinco.com/contact-us/careers/index.html

Save it to /tmp/data and you’re done with preparing data for Sphinx

Indexing

Loading data from /tmp/data to mysql is very simple:

mysql> create database se;
Query OK, 1 row affected (0.01 sec)
mysql> use se;
Database changed
mysql> create table data (id int primary key auto_increment, url varchar(1024), path varchar(1024));
Query OK, 0 rows affected (0.01 sec)
mysql> load data infile '/tmp/data' into table data (url, path);
Query OK, 93 rows affected (0.00 sec)
Records: 93  Deleted: 0  Skipped: 0  Warnings: 0

Now you have the following in mysql:

mysql> select * from data limit 5;
+----+-------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
| id | url                                                               | path                                                                                        |
+----+-------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
|  1 | /search/                                     | /home/snikolaev/ivinco/www.ivinco.com/search/index.html                                     | 
|  2 | /services/custom-full-text-search-solutions/ | /home/snikolaev/ivinco/www.ivinco.com/services/custom-full-text-search-solutions/index.html | 
|  3 | /services/                                   | /home/snikolaev/ivinco/www.ivinco.com/services/index.html                                   | 
|  4 | /services/performance-audit/                 | /home/snikolaev/ivinco/www.ivinco.com/services/performance-audit/index.html                 | 
|  5 | /contact-us/careers/                         | /home/snikolaev/ivinco/www.ivinco.com/contact-us/careers/index.html                         | 
+----+-------------------------------------------------------------------+---------------------------------------------------------------------------------------------+
5 rows in set (0.00 sec)

Use the following sphinx config to index the data:

source se
{
    type = mysql
    sql_host = localhost
    sql_user = root
    sql_pass =
    sql_db = se
    sql_query = select id, url, path from data
    sql_file_field  = path
    sql_query_info = select * from data where id=$id
}

index se
{
    path = idx
    source = se
    html_strip = 1
}

searchd
{
    listen = 9306:mysql41
    log = sphinx.log
    pid_file = sphinx.pid
}

Note that sql_file_field is used to tell Sphinx where to find the files containing data to index, Sphinx will do the rest: it will read the files itself and index the data, just give him the path.

Index the data:

Sphinx 1.11-id64-dev (r2650)
Copyright (c) 2001-2011, Andrew Aksyonoff
Copyright (c) 2008-2011, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file 'sphinx_se.conf'...
indexing index 'se'...
WARNING: collect_hits: mem_limit=0 kb too low, increasing to 24576 kb
collected 58 docs, 0.3 MB
sorted 0.0 Mhits, 100.0% done
total 58 docs, 330606 bytes
total 0.047 sec, 7017149 bytes/sec, 1231.05 docs/sec
total 60 reads, 0.000 sec, 17.3 kb/call avg, 0.0 msec/call avg
total 6 writes, 0.000 sec, 51.1 kb/call avg, 0.1 msec/call avg

Voila! The data is searchable now, you can use it with any method (any api, sphinxql etc.), I used ‘search’ tool and query ‘happen with some latency’ which should find me the following document /blog/sphinx-replication/. It works:

[snikolaev@r27 ~]$ ~/bin/search -c sphinx_se.conf -a "happen with some latency"
Sphinx 1.11-id64-dev (r2650)
Copyright (c) 2001-2011, Andrew Aksyonoff
Copyright (c) 2008-2011, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file 'sphinx_se.conf'...
index 'se': query 'happen with some latency ': returned 1 matches of 1 total in 0.000 sec

displaying matches:
1. document=43, weight=4681
	id=43
	url=/blog/sphinx-replication/
	path=/home/snikolaev/ivinco/www.ivinco.com/blog/sphinx-replication/index.html

words:
1. 'happen': 1 documents, 1 hits
2. 'with': 26 documents, 86 hits
3. 'some': 10 documents, 29 hits
4. 'latency': 1 documents, 2 hits

Now you know how to delegate things to existing tools to build a simple search engine: wget can crawl the web, Sphinx can read the crawled data and allow you to search in it. Of course the above algorithm isn’t useful for any serious production use as is, this is just a PoC, but if you make an effort you can make something good of it.

20 Comments

ssnobbenJune 6th, 2011 at 1:56 pm

Can you make similar description on how to use it for a production site ?

Would be interesting to see a production scenario using Sphinx..

rgds

Sergey NikolaevJune 6th, 2011 at 2:01 pm

You mean crawling some real site using the above approach or implement crawling via wget+Sphinx and incorporate it into some site?

Andy AgarwalJuly 21st, 2011 at 11:25 pm

Does the indexer pay attention to the HTML tags and apply weightage to listing content based on the HTML tags surrounding that content e.g. content gets higher weightage than content.
If not, how would one do that?

Sergey NikolaevJuly 22nd, 2011 at 3:24 am

Hello Andy. Sphinx cannot make higher weight for content inside specific HTML tags.
But since 2.0.1 it’s possible to use http://sphinxsearch.com/docs/manual-2.0.1.html#conf-index-zones and http://sphinxsearch.com/docs/manual-2.0.1.html#extended-syntax to make search limited by specified HTML tags, for example if you add the following to your index config:

index_zones = h*, th, title

and then make extended query ‘ZONE:(h3,h4) cat’ the keyword will be searched only inside H3 and H4 tags. This way you can make 2 sequential queries to Sphinx (first to search inside needed zones and the second one to search everywhere) and then merge the results in your application.

Andy AgarwalJuly 22nd, 2011 at 3:27 am

Thanks Sergey! I’ll try that.

armrhaOctober 13th, 2011 at 6:14 pm

My wget doesn’t seem to output the log in the:

–05:21:14– http://www.ivinco.com/feed/
=> `www.ivinco.com/feed/index.html’

format, that seems to break everything. What version of wget is used here? Am I just doing something really wrong? I just get (for example):

-2011-10-13 11:04:46– http://www.ivinco.com/feed/
Reusing existing connection to http://www.ivinco.com:80.
HTTP request sent, awaiting response… 200 OK
Length: 55732 (54K) [text/xml]
Saving to: `www.ivinco.com/feed/index.html’

0K ………. ………. ………. ………. ………. 91% 300K 0s
50K …. 100% 14.5M=0.2s

2011-10-13 11:04:46 (326 KB/s) – `www.ivinco.com/feed/index.html’ saved [55732/55732]

–2011-10-13 11:04:46– http://www.ivinco.com/xmlrpc.php
Reusing existing connection to http://www.ivinco.com:80.
HTTP request sent, awaiting response… 200 OK
Length: 42 [text/plain]
Saving to: `www.ivinco.com/xmlrpc.php’

0K 100% 8.03M=0s

without the links to map, which then messes me up on later steps. Not sure what to do.

Sergey NikolaevOctober 24th, 2011 at 1:16 pm

Hi armrha. Can you show the way you execute the wget command?
The version I was using is:

[snikolaev@r27 ~]$ wget -h|head -1
GNU Wget 1.10.2 (Red Hat modified), a non-interactive network retriever.

MukkoFebruary 6th, 2012 at 10:56 am

i get the same problem as armrha, using
GNU Wget 1.12, a non-interactive network retriever.
any idea how to downgrade?

Sergey NikolaevFebruary 7th, 2012 at 6:37 am

Hello Mukko

Can you show the way you execute the wget command? Just ‘wget -r http://some_site.com -o log ‘ ? I want to reproduce it on my side to update the article.

MukkoFebruary 8th, 2012 at 10:56 am

Hi Sergey,
I just used the same command as you mentioned:
wget -r “http://www.yourdomain.com” -o log

Seems wget from version 1.12 has a different logging pattern…

Sergey NikolaevFebruary 10th, 2012 at 12:13 pm

Hi Mukko

I’ve managed to reproduce the problem with wget 1.13. Yes, the issue is related with another logging pattern. I’ll try to invent something to make the algo universal next week.

Sergey NikolaevFebruary 20th, 2012 at 3:55 am

We’ve found out that ‘wget -nv -r http://www.ivinco.com‘ might work better since its output is almost what is needed:

[snikolaev@r27 ivinco]$ wget -nv -r http://www.ivinco.com
22:53:47 URL:http://www.ivinco.com/ [10370/10370] -> “www.ivinco.com/index.html” [1]
22:53:48 URL:http://www.ivinco.com/robots.txt [24/24] -> “www.ivinco.com/robots.txt” [1]
22:53:49 URL:http://www.ivinco.com/feed/ [36282/36282] -> “www.ivinco.com/feed/index.html” [1]
22:53:50 URL:http://www.ivinco.com/xmlrpc.php [42/42] -> “www.ivinco.com/xmlrpc.php” [1]
22:53:50 URL:http://www.ivinco.com/wp-content/themes/ivincowp/homepage.css [5042/5042] -> “www.ivinco.com/wp-content/themes/ivincowp/homepage.css” [1]

JasonFebruary 21st, 2013 at 5:38 pm

How do you use BuildExcerpts with this system?

Sergey NikolaevFebruary 22nd, 2013 at 4:53 am

To use BuildExcerpts() you should use some script which uses Sphinx API instead of just
~/bin/search -c sphinx_se.conf -a “happen with some latency”

The script should:
1) fetch documents from db/filesystem by the ids (or path in this case) Sphinx returned
2) then you should feed the texts along with the keyword to BuildExcerpts() to make the highlighting. You can also use SphinxQL’s CALL SNIPPETS command.

Here’s an example (using SphinxQL):
First you need to start searchd:
[snikolaev@dev01 tmp]$ searchd -c sphinx_se.conf
Sphinx 2.0.7-id64-dev (rel20-r3560)

Then you can fire query like this:

[snikolaev@dev01 tmp]$ for n in `search -c sphinx.conf -a "happen with some latency"|grep path|awk -F= '{print $2}'`; do mysql -hlocalhost -P9306 --proto=tcp -e "call snippets('$n', 'se', 'happen with some latency', 1 as load_files, 'strip' as html_strip_mode)\G"; done;
*************************** 1. row ***************************
snippet:  ... ??search’ tool and query ‘happen with some latency’ which should find me ... /search -c sphinx_se.conf -a "happen with some latency" Sphinx 1.11-id64-dev ...  'sphinx_se.conf'... index 'se': query 'happen with some latency ': returned 1 matches of 1 ... 
*************************** 1. row ***************************
snippet:  ...  some reason you want to combine the both of the above with ...  end, but it may happen with some latency, because some server is a bit more ...  way which would provide minimal latency: make indexing only in one ...  very short period of time, some few seconds. ChristianSeptember 23rd, 2011 ... 

This is an ugly solution, nobody should do anything like this in production, this is just to demonstrate abilities of Sphinx and other linux tools.

sudjono afApril 16th, 2016 at 12:50 am

ask; whether these tools can be used on web wordpress
because my web http://www.pesonawisataindonesia.com
use wordpress
thank you

Sergey NikolaevApril 18th, 2016 at 2:09 am

Yes, this technique can be used with any site, but for WP there’re a lot of search extensions, e.g. our plugin – https://www.ivinco.com/software/wordpress-sphinx-search/

muchtarNovember 27th, 2016 at 2:45 pm

How use to apply on siite basic blogger and i can tray in my site http://www.khazzanahhaji.com

thanks for attetion

Sergey NikolaevNovember 28th, 2016 at 1:33 am

Hello Muchtar

As I wrote “you can use it with any method (any api, sphinxql etc.)”. I.e. if you have the index and even ‘search’ tool works as you wish all is left is to integrate search functionality into your site. Sphinx supports mysql interface (it’s called SphinxQL, you can find more in Sphinx documentation) which enables integration of Sphinx with virtually any programming language or platform.

jasa seo medan professionalMarch 30th, 2017 at 4:12 am

left is to integrate search functionality into your site. Sphinx supports mysql interface (it’s called SphinxQL, you can find more in Sphinx documentation) which enables integration of Sphinx with virtually any programming language or platform.
Leave a comment

Sergey NikolaevMarch 30th, 2017 at 5:54 am

It even supports http interface now

Leave a comment

Your comment