Today we will learn how to do Web Scrapping in Ruby using Watir.
Web Scraping in Ruby: Complete Guide
The need to scrape data by companies and individuals has increased in recent years, and Ruby is one of the best programming languages for this purpose. Web scraping using Ruby is simply building a script that can automatically retrieve data from the web, and you can then use the extracted data however you like.
What Is Web Scraping?
Web scraping is the automated process of extracting data from websites. It involves writing a computer program or using specialized software to retrieve information from web pages and convert it into a structured format that can be analyzed and used for various purposes.
Web scraping allows you to gather data from multiple websites efficiently and quickly, without the need for manual copying and pasting. The scraped data can include text, images, links, tables, and other types of content present on the web pages.
Here’s a high-level overview of how web scraping typically works:
- Sending HTTP requests: The web scraper sends an HTTP request to the target website’s server, requesting the HTML content of a specific web page.
- Fetching the HTML content: The server responds to the request and sends back the HTML code of the web page.
- Parsing the HTML: The web scraper processes the received HTML code and extracts the desired data using various techniques such as DOM parsing or regular expressions.
- Cleaning and formatting: The extracted data may contain unwanted tags, formatting, or noise. The scraper cleans and formats the data to make it usable and consistent.
- Storing or analyzing the data: Once the data is extracted and cleaned, it can be stored in a database, exported to a spreadsheet, or used for further analysis and processing.
Web scraping has a wide range of applications. It is used by businesses to collect market research data, monitor competitors’ prices, track product reviews, or gather customer feedback. Researchers and journalists may utilize web scraping to gather data for analysis or to extract relevant information from a large number of web pages. However, it’s important to note that web scraping should be done responsibly, adhering to legal and ethical guidelines, and respecting website terms of service and the privacy of users.
In this step-by-step tutorial, you’ll learn how to do web scraping with Ruby using libraries like Watir.
Let’s first set up the Environment for Ruby
1) Installing Ruby version –
sudo apt-get update sudo apt-get install git-core curl zlib1g-dev build-essential libssl-dev libreadline-dev libyaml-dev libxml2-dev libxslt1-dev libcurl4-openssl-dev libffi-dev
git clone https://github.com/rbenv/rbenv.git ~/.rbenv echo 'export PATH="$HOME/.rbenv/bin:$PATH"' >> ~/.bashrc echo 'eval "$(rbenv init -)"' >> ~/.bashrc exec $SHELL
git clone https://github.com/rbenv/ruby-build.git ~/.rbenv/plugins/ruby-build echo 'export PATH="$HOME/.rbenv/plugins/ruby-build/bin:$PATH"' >> ~/.bashrc exec $SHELL
rbenv install 3.1.2 rbenv global 3.1.2
2) Installing Chrome browser
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb sudo apt install ./google-chrome-stable_current_amd64.deb
# You can find the latest ChromeDriver on its official download page. The Chromedriver version should match the Chrome browser installed above.
# Now execute the below commands to configure ChromeDriver on your system.
sudo mv chromedriver /usr/bin/chromedriver sudo chown root:root /usr/bin/chromedriver sudo chmod +x /usr/bin/chromedriver
sudo apt-get install xvfb
5) Ruby Gems
gem install watir
gem install headless
gem install csv
How to Scrape a Website in Ruby
Let’s use ScrapeMe as our target website and we’ll use our Ruby spider to visit each page and retrieve their product data.
# -*- coding: utf-8 require 'watir' require 'headless' require 'csv' @headless = Headless.new @headless.start @browser = Watir::Browser.new @browser.goto "https://scrapeme.live/shop/" urls = Array.new @browser.links(css: 'ul.products li.product a.woocommerce-LoopProduct-link').collect(&:href).each do |link| urls << link end puts urls
Now save the above script name as scraping.rb. To rub this open the terminal and run the code –
All product URLs will be displayed on the terminal after running the code.
Now Let’s Scrap the products one by one to get the product name, price, image, etc.
# -*- coding: utf-8 require 'watir' require 'headless' require 'csv' @headless = Headless.new @headless.start @browser = Watir::Browser.new @browser.goto "https://scrapeme.live/shop/" urls = Array.new @browser.links(css: 'ul.products li.product a.woocommerce-LoopProduct-link').collect(&:href).each do |link| urls << link end puts urls urls.each do |link| @browser.goto link sleep 2 puts link puts title = @browser.h1(class: 'product_title').text puts price = @browser.p(class: 'price').text puts image = @browser.div(class: 'woocommerce-product-gallery__image').img.src end