Web Scraping dynamic & static websites in Nodejs and Reactjs

Total
0
Shares
web scraping of static and dynamic websites using reactjs nodejs and nightmarejs

Introduction

Web scraping is a technique used for extracting data from websites. And a piece of code or software that helps you to scrape data is known as a scraper. The different ways of web scrapping include the common copy-and-paste, text pattern matching, HTTP programming, HTML parsing, DOM parsing, Vertical aggregation, Semantic annotation recognition, and computer vision web-page analysis. Depending on the data to be scrapped, different software utilize the above methods. In this particular article, we will concentrate on HTML parsing.

We can create scrapers in languages like JavaScript, Python, Java, Php, Golang etc. In this article we will use JavaScript on both NodeJS and ReactJS.

Why Web Scrapping?

These scraping bots or software retrieve the database’s contents as well as the underlying HTML code, in contrast to screen scraping, which only replicates the content that pixels on a screen show. A business can learn which advertisement would be most appropriate for which online customers by scraping webpages. This gets hits that are frequently converted while also saving marketing revenue.

Later, data is used for analytics, comparisons, investment choices, hiring decisions, and other purposes. For instance, if you launch an online health portal and employ information about all surrounding hospitals, pharmacies, nursing homes, and physicians, you will get a lot of traffic to your website. Websites must be updated with both breaking news and other popular content that users are accessing online.

The best and simplest approach to satisfy your company’s need for data is to outsource your online data extraction requirements to a service provider. You don’t need a separate staff to handle your data problems when your data source assists you with data extraction and cleaning. All you have to do is give them your specifications, then sit back and watch as they work their magic to provide your data.

DIY web scraping programs may not be able to interpret more advanced rendering approaches and are frequently too sluggish for large data extraction. Those who lack the funds to hire an internal web scraping crew must work with simple DIY tools.

Caution When Web Scraping;

Building a scraper and extracting data from the web may be done using a variety of techniques and technologies. You should take care of a few legal matters so that you do not run afoul of the law. Any website you intend to scrap should always have a Robots.txt file. Lawsuits and penalties may result from scraping in a way that violates these regulations. To prevent being caught up in online traffic and server outages, it is preferable to scrap data during off-peak hours.

Getting Started;

Determine the type of information you would require to address that issue. You must respond to inquiries like, “Do you have an example of the sort of data?” or “Which websites would be most useful to you if they were scraped?” After that, you must select how to complete the task.

We will be using React, Nodejs, Cheerio, and Axios packages for Static Web Sites.

For Dynamic or React and Angular Built websites we will use Nodejs, Cheerio, and another library called Nightmarejs. NightmareJS is headless browser automation library used to automate browsing tasks for sites that don’t have APIs.

Web Scrapping Static Websites Code Example

First we will need to install Axios and Cheerio into our node environment –

Installing Cheerio –

npm install cheerio

Installing Axios –

# Using npm:
npm install axios

# Using bower:
bower install axios

# Using yarn:
yarn add axios

Here is the complete code –

import { useState } from "react";

const Scrapper = () => {
  const cheerio = require("cheerio");
  const axios = require("axios");
  const [data, setData] = useState([]);
  const links = [];
  axios.get("https://akashmittal.com/").then((urlResponse) => {
    const $ = cheerio.load(urlResponse.data);

    $("div.cs-homepage-category").each((i, element) => {
      const link = $(element).find("a.cs-overlay-link").attr("href");
      links.push(link);
    });
    // console.log(links)
    setData(links);
  });
  return (
    <ul>
      {data.map((read) => {
        return (
          <li>
            <a href={read}>{read} </a>
          </li>
        );
      })}
    </ul>
  );
};
export default Scrapper;

import Scrapper from "./Scrapper";
import "./styles.css";

export default function App() {
  return (
    <div className="App">
      <h2>Web Scrapping, data from our main website</h2>
      <div>
        <Scrapper />
      </div>

      <h3>Click above for the scrapped data links!</h3>
    </div>
  );
}

Let’s break it down and understand piece by piece.

axios is used to fetch source code of a website in the form of string. Here we are using our main domain – https://akashmittal.com.

cheerio can convert the string into parse-able elements. With it’s help, we can pick elements based on their ids, classes, tags etc.

In this code, we are fetching few anchor links on the scraped webpage. To store these links, we are using a temporary array links and a state variable data.

import { useState } from "react";

const Scrapper = () => {
  ...
  ...
  const [data, setData] = useState([]);
  const links = [];
  ...
  ...
};
export default Scrapper;


In the next step, we are going to fetch the source code using axios –

import { useState } from "react";

const Scrapper = () => {
  ...
  const axios = require("axios");
  const [data, setData] = useState([]);
  const links = [];

  axios.get("https://akashmittal.com/").then((urlResponse) => {
    ...
    ...
  });

  return (
    ...
  );
};
export default Scrapper;

Here we used axios.get() function and the result is stored in urlResponse parameter variable.

Now we got the source code, the next step should be making it parse-able. Cheerio will come into play. We have used cheerio.load() function which takes the html source code string and convert it into internal dom tree. The returned value will act as jQuery flavored semantics.

import { useState } from "react";

const Scrapper = () => {
  const cheerio = require("cheerio");
  const axios = require("axios");
  const [data, setData] = useState([]);
  const links = [];

  axios.get("https://akashmittal.com/").then((urlResponse) => {
    const $ = cheerio.load(urlResponse.data);

    ...
  });

  return (
    ...
  );
};
export default Scrapper;

We have the whole DOM tree available now. It’s time to pick some anchor urls. If you open the source code of https://akashmittal.com, you will find that there are links in div with class cs-homepage-category and anchors have class cs-overlay-link. Check this image –

scraping anchor tags on akashmittal.com

With the help of a loop, we will fetch all the div blocks with class cs-homepage-category and get the children anchors with class cs-overlay-link. Then, we will store those urls in our links array and data state variable for a re-render.

import { useState } from "react";

const Scrapper = () => {
  const cheerio = require("cheerio");
  const axios = require("axios");
  const [data, setData] = useState([]);
  const links = [];

  axios.get("https://akashmittal.com/").then((urlResponse) => {
    const $ = cheerio.load(urlResponse.data);

    $("div.cs-homepage-category").each((i, element) => {
      const link = $(element).find("a.cs-overlay-link").attr("href");
      links.push(link);
    });
    // console.log(links)
    setData(links);
  });

  return (
    ...
  );
};
export default Scrapper;

The last step would be displaying the scraped links. We have used <ul>

import { useState } from "react";

const Scrapper = () => {
  const cheerio = require("cheerio");
  const axios = require("axios");
  const [data, setData] = useState([]);
  const links = [];
  axios.get("https://akashmittal.com/").then((urlResponse) => {
    const $ = cheerio.load(urlResponse.data);

    $("div.cs-homepage-category").each((i, element) => {
      const link = $(element).find("a.cs-overlay-link").attr("href");
      links.push(link);
    });
    // console.log(links)
    setData(links);
  });
  return (
    <ul>
      {data.map((read) => {
        return (
          <li>
            <a href={read}>{read} </a>
          </li>
        );
      })}
    </ul>
  );
};
export default Scrapper;

Import the Scrapper.jsx file into the App.js file and re-run your react application. The output will look like this –

Result of web scraping akashmittal.com

Live Demo

Open Live Demo

Web Scraping Dynamic Websites Code Example

Let’s setup the project. You should have nodejs installed for this.

1. Create your project folder somewhere in your computer. I am creating on Desktop. Let’s call it ScraperProject.

2. Open your cmd/terminal and move to this folder using cd command –

cd %userprofile%\Desktop\ScraperProject

3. Initialize your node project –

npm init -y

4. Install nightmarejs and cheerio –

npm install nightmare cheerio --unsafe-perm=true

5. Create index.js file and put this code in it –

const Nightmare = require("nightmare");
const cheerio = require("cheerio");

const nightmare = Nightmare({ show: true });

const url = "https://www.flipkart.com/";
const data = [];

nightmare
    .goto(url)
    .wait("body")
    .click("div._3OO5Xc")
    .type("input._3704LK", "nodejs books")
    .click("button.L0Z3Pu")
    .wait("div._13oc-S")
    .evaluate(() => document.querySelector("body").innerHTML)
    .end()
    .then((response) => {
        getData(response);
    })
    .catch((err) => {
        console.log(err);
    });

let getData = (html) => {
    const $ = cheerio.load(html);
    $("div._13oc-S div div._4ddWXP a.s1Q9rs").each((i, elem) => {
        const title = $(elem).text();
        const link = $(elem).attr("href");
        data.push([title, link]);
    });
    console.log(data);
};

6. Run the file –

node index.js

It will open the Electron app. The flipkart website will load automatically as shown in the image below –

nightmarejs open flipkart website in electron

Then these automatic operations will take place –

  • Clicking on search bar
  • Typing nodejs books
  • Clicking on search button
  • Wait for page to load
  • Get page data

The final output you will get in your node console window is –

Title and urls of nodejs books on flipkart website scraped using nightmarejs

This is the array of titles and urls of nodejs books listed on Flipkart website. You got the whole source code so you can fetch any data you want.

Let’s understand the code.

First of all we are importing nightmarejs, cheerio, setting url variable and an array data to hold the result.

const Nightmare = require("nightmare");
const cheerio = require("cheerio");

const url = "https://www.flipkart.com/";
const data = [];

Next, we create an instance of Nightmare with show attribute as true. This attribute decides whether to show Electron window or not. If you want to run the automation headless, then you can set it to false.

const Nightmare = require("nightmare");
const cheerio = require("cheerio");

const nightmare = Nightmare({ show: true });

const url = "https://www.flipkart.com/";
const data = [];

Now we are ready to open our url and perform some automation. We want to load all nodejs books on flipkart. So, we need to follow the steps which we would do manually if we had to find those books –

  • Step 1: Open flipkart.com
  • Step 2: Wait for website to load.
  • Step 3: Click on search bar
  • Step 4: Type search text, for e.g. nodejs books.
  • Step 5: Click on search button
  • Step 6: Wait for page to load
  • Step 7: Get the list of books
  • Step 8: Close website

We will follow the same steps through nightmare commands –

  • Step 1: goto(url)
  • Step 2: wait("body")
  • Step 3: click("div._3OO5Xc") (Clicking search box)
  • Step 4: type("input._3704LK", "nodejs books")
  • Step 5: click("button.L0Z3Pu")
  • Step 6: wait("div._13oc-S")
  • Step 7: evaluate(() => document.querySelector("body").innerHTML)
  • Step 9: end()

After end() we have attached a promise then() to get the scraped data string and a catch() block for catching errors.

const Nightmare = require("nightmare");
const cheerio = require("cheerio");

const nightmare = Nightmare({ show: true });

const url = "https://www.flipkart.com/";
const data = [];

nightmare
    .goto(url)
    .wait("body")
    .click("div._3OO5Xc")
    .type("input._3704LK", "nodejs books")
    .click("button.L0Z3Pu")
    .wait("div._13oc-S")
    .evaluate(() => document.querySelector("body").innerHTML)
    .end()
    .then((response) => {
        ...
    })
    .catch((err) => {
        console.log(err);
    });

Now we can parse the data of interest using cheerio as we did for static website –

const Nightmare = require("nightmare");
const cheerio = require("cheerio");

const nightmare = Nightmare({ show: true });

const url = "https://www.flipkart.com/";
const data = [];

nightmare
    .goto(url)
    .wait("body")
    .click("div._3OO5Xc")
    .type("input._3704LK", "nodejs books")
    .click("button.L0Z3Pu")
    .wait("div._13oc-S")
    .evaluate(() => document.querySelector("body").innerHTML)
    .end()
    .then((response) => {
        getData(response);
    })
    .catch((err) => {
        console.log(err);
    });

let getData = (html) => {
    const $ = cheerio.load(html);
    $("div._13oc-S div div._4ddWXP a.s1Q9rs").each((i, elem) => {
        const title = $(elem).text();
        const link = $(elem).attr("href");
        data.push([title, link]);
    });
    console.log(data);
};

Conclusion

In this article we saw how to scrape websites using Javascript in ReactJS and NodeJS. Along with Scraping we saw how to automate the operations using nightmareJS. The static websites with complete source code is easy to scrape as you can get the source code directly after page load. But in case of dynamic websites, the javascript may generate the html code and we need to wait for JS to finish execution. NightmareJS helped us in this. You can get the dynamic scraper code on my github repository.