U-Crawler

U-Crawler即Url-Crawler

并发的爬取baidu,yahoo,bing和360so搜索结果的url。暂不支持google,google的反爬虫机制太厉害了,爬取不到一百个结果,就要验证,故取消。项目地址

依赖

gevent,
requests,
BeautifulSoup,
lxml,
urlparse,
optparse,

使用说明

1
2
3
4
5
6
7
8
9
10
11
12
13
Usage: U-Crawler.py [-q] query [--limit] number [-o] filename

Options:
--version show program's version number and exit
-h, --help show this help message and exit
-q QUERY, --query=QUERY
The query of search engine.
-l LIMIT, --limit=LIMIT
The limit of each search engine.
-o NAME, --output=NAME
If not use -o,the filename of output is time string.
-b, --baseurl The url of writing in file,if it is set,the url will
remove path and param.
U-Crawler.py -q inurl:login.php -l 100```
1
2
3
4

q参数即搜索语法,l参数是每个搜索引擎结果的数。

```python U-Crawler.py -q inurl:login.php -l 10 -b -o login.txt

b参数是写入的url去掉后面的路径,默认不去除。o参数是保存结果文件名,默认以开始运行的时间为文件名。