Rubyの静的・ステートレスなクローラー(crawler)Anemoneのインストールをしました。
先ず、Nokogiriをインストールしました。
https://normalblog.net/system/nokogiri-install/
次にAnemoneをインストールしました。
[user@localhost ~]$ gem install anemone Fetching: robotex-1.0.0.gem (100%) Successfully installed robotex-1.0.0 Fetching: anemone-0.7.2.gem (100%) Successfully installed anemone-0.7.2 Parsing documentation for robotex-1.0.0 Installing ri documentation for robotex-1.0.0 Parsing documentation for anemone-0.7.2 Installing ri documentation for anemone-0.7.2 Done installing documentation for robotex, anemone after 0 seconds 2 gems installed
これでインストール出来ました。
使い方
リンク取得
あるサイトの指定URLから1階層目までのリンクを表示するRubyは以下です。
# -*- coding: utf-8 -*-
require 'anemone'
Anemone.crawl("http://test.test", :depth_limit => 1) do |anemone|
anemone.on_every_page do |page|
puts page.url
end
end
これを実行すると
[user@localhost crawler]$ ruby test.rb http://test.test/1/ http://test.test/2/ http://test.test/3/
このような形で出てきました。URL(test.test)は適当に入れたものです。
img src取得
当ブログのTOPページにアクセスして、ブログ記事一覧からaタグのhrefとサムネイル画像のsrcを取得しました。
# -*- coding: utf-8 -*-
require 'anemone'
require 'nokogiri'
require 'kconv'
urls = ["http://normalblog.net/system"]
Anemone.crawl(urls,:depth_limit => 0) do |anemone|
anemone.on_every_page do |page|
doc = Nokogiri::HTML.parse(page.body.toutf8)
entries = doc.xpath(
"//*[@class='entry-thumb']")
entries.each{|entry|
# a href
puts 'href = '+entry.xpath("a").attribute("href").text
# a img src
puts 'src = '+entry.xpath("a/img").attribute("src").text
}
end
end
↓出力結果
[user@localhost crawler]$ ruby scraping-img-src.rb href = https://normalblog.net/system/ruby/crawler-anemone/ src = https://normalblog.net/system/wp-content/uploads/2016/10/ruby1-150x150.jpg href = https://normalblog.net/system/ruby/nokogiri-install/ src = https://normalblog.net/system/wp-content/uploads/2016/10/ruby1-150x150.jpg href = https://normalblog.net/system/linux/elinks/ src = https://normalblog.net/system/wp-content/uploads/2015/12/221.png href = https://normalblog.net/system/wordpress/permission/ src = https://normalblog.net/system/wp-content/uploads/2015/12/WordPress-150x150.png href = https://normalblog.net/system/mysql/binlog-innodblog/ src = https://normalblog.net/system/wp-content/uploads/2016/02/MySQLNewLogo1-150x150.png href = https://normalblog.net/system/mysql/mysqlbinlog/ src = https://normalblog.net/system/wp-content/uploads/2016/02/MySQLNewLogo1-150x150.png href = https://normalblog.net/system/amp/amazon-afiliate/ src = https://normalblog.net/system/wp-content/uploads/2016/11/amp.png href = https://normalblog.net/system/amp/error-yen/ src = https://normalblog.net/system/wp-content/uploads/2016/11/amp.png href = https://normalblog.net/system/%e6%ad%a3%e8%a6%8f%e8%a1%a8%e7%8f%be/regular-expression/ src = https://normalblog.net/system/wp-content/uploads/2016/11/seiki-150x125.png href = https://normalblog.net/system/dos/dos-command/ src = https://normalblog.net/system/wp-content/uploads/2016/11/dos-147x150.png href = https://normalblog.net/system/rbenv/cron-ruby-command-not-found/ src = https://normalblog.net/system/wp-content/uploads/2016/10/ruby1-150x150.jpg href = https://normalblog.net/system/vim/command/ src = https://normalblog.net/system/wp-content/uploads/2016/11/Vim.png href = https://normalblog.net/system/mysql/alter-table-engine/ src = https://normalblog.net/system/wp-content/uploads/2016/02/MySQLNewLogo1-150x150.png href = https://normalblog.net/system/mysql/myisam-state-locked/ src = https://normalblog.net/system/wp-content/uploads/2016/02/MySQLNewLogo1-150x150.png href = https://normalblog.net/system/google/map-photo-30000/ src = https://normalblog.net/system/wp-content/uploads/2016/11/google2-150x150.png href = https://normalblog.net/system/google/map-photo/ src = https://normalblog.net/system/wp-content/uploads/2016/11/google-150x150.png href = https://normalblog.net/system/vagrant/memory/ src = https://normalblog.net/system/wp-content/uploads/2015/11/vagrant-150x150.png href = https://normalblog.net/system/wordpress/apache-setenv/ src = https://normalblog.net/system/wp-content/uploads/2015/12/WordPress-150x150.png href = https://normalblog.net/system/seminar_conference/phpconf2016/ src = https://normalblog.net/system/wp-content/uploads/2016/11/CwTp-IiVIAE10DK-1-150x150.jpg href = https://normalblog.net/system/wordpress/iroiro/ src = https://normalblog.net/system/wp-content/uploads/2015/12/WordPress-150x150.png href = https://normalblog.net/system/ruby/mymemo/ src = https://normalblog.net/system/wp-content/uploads/2016/10/ruby1-150x150.jpg href = https://normalblog.net/system/mydev/crawler/ src = https://normalblog.net/system/wp-content/uploads/2016/11/normalblog-150x150.png href = https://normalblog.net/system/searchconsole/notfound/ src = https://normalblog.net/system/wp-content/uploads/2016/10/notfound-150x91.png href = https://normalblog.net/system/rbenv/install/ src = https://normalblog.net/system/wp-content/uploads/2016/10/ruby1-150x150.jpg href = https://normalblog.net/system/wget/comand/ src = https://normalblog.net/system/wp-content/uploads/2016/10/wget-150x113.png
他にも色々出来そうなので試してみます。
