What is Web Scraping
Extracting the data (Basically Html Files) of any website is called web scraping.This saves us a lot of time because we get a lot of data (huge amount of data) in structured form
As we already know if we want to use data in programming it should be in structure form, If not we have to make it in structure form,
How to Scrap Data form any Website
We can extract the data of any website in two ways.
- By using API of a website. Some websites like Twitter, Facebook , youtube have an API. By using API of a website we can extract data of the website.
- But all Websites do not have their API.In this situation we extract data of websites by accessing their HTML page, it is called web scraping.
Web Scraping using python
To scrap data from Python we have to install some libraries
- requests
- html5lib
- bs4
You can install them from pip install in python. We will see where they will be used.
Let's start with an example
#Importing libraries
import requests
import urllib.request
from bs4 import BeautifulSoup
|
Note
- Requests is a Python HTTP library.So, basically with the help of this library we make a request to a web page.
- Beautifulsoup is a tool for web scraping that helps you clean up and parse the documents you have pulled down from the web. See the official website of Beautifulsoup Documentation.
#Get the HTML content
url="https://goeduhub.com/"
response= requests.get(url)
print(response)
htmlcontent=response.content
#print(htmlcontent)
|
Output
Note
- In this code, we have firstly taken a website whose data we want to extract and request its url with help of request module of python.
- After that we printed response (a variable) ,If you get the output like the above output means your request has been accessed.
- I have commented the HTML content here because it will print the entire HTML file of the site here which was very big. If you want you can give a try to it.
#Parse the html content
soup = BeautifulSoup(htmlcontent, "html.parser")
#print(soup)
print(soup.prettify )
|
Output
Note
- Here we use Beautifulsoup to parse the HTML content with the help of html parser.
- I did not get the soup printed here and got the prettify soup printed. It shows the html page in a better way.
- I have shown a little sanpshot above in output, take a look.
#priting title of web page
print(soup.findAll('title'))
#It will give all the div in html page
print(soup.findAll('div'))
#It will give all the text files in the html page
print(soup.findAll('text'))
#It will give an anchor tag of number fourth in web page
one_tag = soup.findAll('a')[4]
print(one_tag)
|
Output1 (title of page)
Output2 (div of page)
Output3 (text in page)
Output4 (anchor of page)
Note
- If we want to access HTML file, then we can do it only with accessing HTML tags.
- As I have done in the code above where some of the html tags are accessed.
- If you look at the third output carefully, there is an empty list. Meaning there is no text in our web page. And all the tags we get form beautifulsoup are in a list format.
- We know how to access the list in Python, this is what we have shown in our fourth output.
#Alternate way for accessing tags
#Get the title of html page
title=soup.title
print(title)
|
Output
Note
In similar way we can also access other html tags. But the problem in this is that can access only one tag at a time in a web page (Generally the first one).
HTML Tree
Note: This is just i want to show you how the tags of HTML are connected to each other. We can also use it to access the html tags.
#Get all the links form html page
anchor= soup.findAll('a')
all_link=set()
for link in anchor:
if (link.get('href') != '#'):
linkprint="https://goeduhub.com/"+link.get('href')
all_link.add(link)
print(linkprint)
|
Output
Note
- Here we are trying to find out all the links in the URL that we have taken here.
- Now you must be thinking that we can do it with findall beautifulsoup method.Yes, but we only see the links, we cannot click on the links and go to the page (actual website)
- Here we first store the links in a variable named anchor.
- After that we put a loop on it, as we know many times we use # instead of the link, we put a condition here to avoid empty links.
- After that we add each link to the original url.
- Here we use python set instead of python list to avoid repetition of a link. (Unique value concept).
- When you run this code, all the links in the output are working and you can go to the page by clicking on them.
Click here for- Scraping data from live Flipkart Website-a Example
Machine Learning Tutorial
Free Online Tutorials
Artificial Intelligence(AI) Training in Jaipur
Machine Learning(ML) Training in Jaipur