What is Web Scraping ?

Question

What is Web Scraping ?

commented May 1, 2020 by lakshita18 (100 points)

Goeduhub's Top Online Courses @Udemy

For Indian Students- INR 360/- || For International Students- $9.99/-

S.No.	Course Name	Coupon
1.	Tensorflow 2 & Keras:Deep Learning & Artificial Intelligence	Apply Coupon
2.	Natural Language Processing-NLP with Deep Learning in Python	Apply Coupon
3.	Computer Vision OpenCV Python \| YOLO\| Deep Learning in Colab	Apply Coupon

More Courses

5 Answers

answered Apr 8, 2020 by Nisha Goeduhub's Expert (3.1k points)
edited Jan 16 by Sharda Chaudhary

Best answer

What is Web Scraping

Extracting the data (Basically Html Files) of any website is called web scraping.This saves us a lot of time because we get a lot of data (huge amount of data) in structured form

As we already know if we want to use data in programming it should be in structure form, If not we have to make it in structure form,

How to Scrap Data form any Website

We can extract the data of any website in two ways.

By using API of a website. Some websites like Twitter, Facebook , youtube have an API. By using API of a website we can extract data of the website.
But all Websites do not have their API.In this situation we extract data of websites by accessing their HTML page, it is called web scraping.

Web Scraping using python

To scrap data from Python we have to install some libraries

requests
html5lib
bs4

You can install them from pip install in python. We will see where they will be used.

Let's start with an example

#Importing libraries

import requests

import urllib.request

from bs4 import BeautifulSoup

Note

Requests is a Python HTTP library.So, basically with the help of this library we make a request to a web page.
Beautifulsoup is a tool for web scraping that helps you clean up and parse the documents you have pulled down from the web. See the official website of Beautifulsoup Documentation.

#Get the HTML content

url="https://goeduhub.com/"

response= requests.get(url)

print(response)

htmlcontent=response.content

#print(htmlcontent)

Output

Note

In this code, we have firstly taken a website whose data we want to extract and request its url with help of request module of python.
After that we printed response (a variable) ,If you get the output like the above output means your request has been accessed.
I have commented the HTML content here because it will print the entire HTML file of the site here which was very big. If you want you can give a try to it.

#Parse the html content

soup = BeautifulSoup(htmlcontent, "html.parser")

#print(soup)

print(soup.prettify )

Output

Note

Here we use Beautifulsoup to parse the HTML content with the help of html parser.
I did not get the soup printed here and got the prettify soup printed. It shows the html page in a better way.
I have shown a little sanpshot above in output, take a look.

#priting title of web page

print(soup.findAll('title'))

#It will give all the div in html page

print(soup.findAll('div'))

#It will give all the text files in the html page

print(soup.findAll('text'))

#It will give an anchor tag of number fourth in web page

one_tag = soup.findAll('a')[4]

print(one_tag)

Output1 (title of page)

Output2 (div of page)

Output3 (text in page)

Output4 (anchor of page)

Note

If we want to access HTML file, then we can do it only with accessing HTML tags.
As I have done in the code above where some of the html tags are accessed.
If you look at the third output carefully, there is an empty list. Meaning there is no text in our web page. And all the tags we get form beautifulsoup are in a list format.
We know how to access the list in Python, this is what we have shown in our fourth output.

#Alternate way for accessing tags

#Get the title of html page

title=soup.title

print(title)

Output

Note

In similar way we can also access other html tags. But the problem in this is that can access only one tag at a time in a web page (Generally the first one).

HTML Tree

The HTML Document Tree

Note: This is just i want to show you how the tags of HTML are connected to each other. We can also use it to access the html tags.

#Get all the links form html page

anchor= soup.findAll('a')

all_link=set()

for link in anchor:

if (link.get('href') != '#'):

linkprint="https://goeduhub.com/"+link.get('href')

all_link.add(link)

print(linkprint)

Output

Note

Here we are trying to find out all the links in the URL that we have taken here.
Now you must be thinking that we can do it with findall beautifulsoup method.Yes, but we only see the links, we cannot click on the links and go to the page (actual website)
Here we first store the links in a variable named anchor.
After that we put a loop on it, as we know many times we use # instead of the link, we put a condition here to avoid empty links.
After that we add each link to the original url.
Here we use python set instead of python list to avoid repetition of a link. (Unique value concept).
When you run this code, all the links in the output are working and you can go to the page by clicking on them.

Click here for- Scraping data from live Flipkart Website-a Example

Machine Learning Tutorial

Free Online Tutorials

Artificial Intelligence(AI) Training in Jaipur

Machine Learning(ML) Training in Jaipur

commented May 2, 2020 by vitesh_18 (100 points)

Why Python for Web Scraping?

You’ve probably heard of how awesome Python is. But, so are other languages too. Then why should we choose Python over other languages for web scraping?

Here is the list of features of Python which makes it more suitable for web scraping.

Ease of Use: Python is simple to code. You do not have to add semi-colons “;” or curly-braces “{}” anywhere. This makes it less messy and easy to use.
Large Collection of Libraries: Python has a huge collection of libraries such as Numpy, Matlplotlib, Pandas etc., which provides methods and services for various purposes. Hence, it is suitable for web scraping and for further manipulation of extracted data.
Dynamically typed: In Python, you don’t have to define datatypes for variables, you can directly use the variables wherever required. This saves time and makes your job faster.
Easily Understandable Syntax: Python syntax is easily understandable mainly because reading a Python code is very similar to reading a statement in English. It is expressive and easily readable, and the indentation used in Python also helps the user to differentiate between different scope/blocks in the code.
Small code, large task: Web scraping is used to save time. But what’s the use if you spend more time writing the code? Well, you don’t have to. In Python, you can write small codes to do large tasks. Hence, you save time even while writing the code.
Community: What if you get stuck while writing the code? You don’t have to worry. Python community has one of the biggest and most active communities, where you can seek help from.

How does Web Scraping work?

When you run the code for web scraping, a request is sent to the URL that you have mentioned. As a response to the request, the server sends the data and allows you to read the HTML or XML page. The code then, parses the HTML or XML page, finds the data and extracts it.

To extract data using web scraping with python, you need to follow these basic steps:

Find the URL that you want to scrape
Inspecting the Page
Find the data you want to extract
Write the code
Run the code and extract the data
Store the data in the required format

What is Web Scraping ?

Goeduhub's Top Online Courses @Udemy

For Indian Students- INR 360/- || For International Students- $9.99/-

Please log in or register to answer this question.

5 Answers

What is Web Scraping

How to Scrap Data form any Website

Web Scraping using python

HTML Tree

Please log in or register to add a comment.

Why Python for Web Scraping?

How does Web Scraping work?

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions