How to Scrap Data form Flipkart
We need to follow certain steps for data extraction
- Importing necessary libraries like BeautifulSoup, requests, Pandas, csv etc.
- Find url that we want to extract.
- Inspect the page, we need to specify the content variable from html which we want to extract.
- Writing code for scraping.
- Store the result in desired format.
Step 1 Importing necessary libraries
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
Requests is a Python HTTP library.So, basically with the help of this library we make a request to a web page.
Step 2 Find url that we want to extract
In this example we want to extract data from flipkart website and will compare price and ratings of different laptops. URL of the Flipkart website containing laptops information is
https://www.flipkart.com/search?q=laptop&sid=6bo%2Cb5g&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_1_6_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_1_6_na_na_na&as-pos=1&as-type=RECENT&suggestionId=laptop%7CLaptops&requestId=7ec220e8-4f02-4150-9e0b-9e90cf692f4b&as-searchtext=laptop
To get the contents of the specified URL, submit a request using the requests library.
url="https://www.flipkart.com/search?q=laptop&sid=6bo%2Cb5g&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_1_6_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_1_6_na_na_na&as-pos=1&as-type=RECENT&suggestionId=laptop%7CLaptops&requestId=7ec220e8-4f02-4150-9e0b-9e90cf692f4b&as-searchtext=laptop"
response = requests.get(url)
htmlcontent = response.content
soup = BeautifulSoup(htmlcontent,"html.parser")
print(soup.prettify)
- Here we use Beautifulsoup to parse the HTML content with the help of html parser.
- I did not get the soup printed here and got the prettify soup printed. It shows the html page in a better way.
- Check your output in form of html code.
This is snapshot of flipkart page which contains different laptops information. We want to extract product name and product price and product ratings.
Step 3 Inspect the page, we need to specify the content variable from html which we want to extract
Just right click on flipkart page and inspect elements and then select elements which you want to get.
We will see a “Browser Inspector Box” open after clicking on inspect. We observe that the class name of the descriptions is ‘_4rR01T’ so we use the find method to extract the descriptions of the laptops.
products=[]
prices=[]
ratings=[]
product=soup.find('div',attrs={'class':'_4rR01T'})
print(product.text)
Output-
HP 14s Core i3 10th Gen - (8 GB/256 GB SSD/Windows 10 Home) 14s-cf3074TU Thin and Light Laptop
Here we are getting particular laptop description. For all laptop descriptions we need to write a loop.
Step 4 Writing code for scraping.
Get all the classes corresponding to price and rating and for complete product.
for a in soup.findAll('a',href=True, attrs={'class':'_1fQZEK'}):
name=a.find('div',attrs={'class':'_4rR01T'})
price=a.find('div',attrs={'class':'_30jeq3 _1_WHN1'})
rating=a.find('div',attrs={'class':'_3LWZlK'})
products.append(name.text)
prices.append(price.text)
ratings.append(rating.text)
Step 5 Store the result in desired format.
import pandas as pd
df = pd.DataFrame({'Product Name':products,'Prices':prices,'Ratings':ratings})
df.head()
Output-
|
Product Name |
Prices |
Ratings |
0 |
HP 14s Core i3 10th Gen - (8 GB/256 GB SSD/Win... |
₹36,990 |
4.2 |
1 |
HP 15 Ryzen 3 Dual Core 3200U - (4 GB/1 TB HDD... |
₹29,990 |
4.1 |
2 |
Lenovo Ideapad S145 Core i3 7th Gen - (4 GB/1 ... |
₹30,990 |
4.1 |
3 |
HP 15s Ryzen 5 Quad Core 3450U - (8 GB/1 TB HD... |
₹40,990 |
4.2 |
4 |
Asus TUF Gaming A17 Ryzen 5 Hexa Core 4600H - ... |
₹63,990 |
4.7 |
Store in a csv file-
df.to_csv('products.csv')
A file name “products.csv” is created and this file contains the extracted data.