Pull to refresh

Step by Step Tutorial on Python Web Scraping

Scope of the Article

→ In this article, we will discuss what python is and its importance.

→ Then, we will be discussing web scraping, what web scraping is, and applications of web scraping.

→ The process of collecting huge amounts of information from different websites is known as web scraping.

→ we had known, why python can be chosen for web scraping.

→ Web scraping in python uses some specific libraries, we will be discussing them in brief.

→ Next, we will be discussing the steps for web scraping with python.

Now, let us get started with the topic!

Introduction

Python is a programming language, it is a high-level programming language that supports object-oriented principles and concepts. So, we call python the object-oriented programming language. 

We can also say that python is very simple to learn and start the programming journey as it is an object-oriented programming language.

Web Scraping using python, web scraping is defined as the process of collecting huge amounts of information from different websites. 

Web scraping extracts information from websites according to user specifications.

Web scraping is mainly used for the comparison of different applications and websites, it can also be called the automatic process of collecting huge amounts of information from different websites. 

Web scraping can be done with various software (web scraping software) which are called web scrapers.

We also have different methods for scraping the websites like online services, writing the code on its own, we will look into the implementation of web scraping with python.

Web Scraping

Web scraping is mainly used for the comparison of different applications and websites, it can also be called the automatic process of collecting vast amounts of information from different websites. 

We have some applications of web scraping such as Job listings, R & D (Research and Development), Price Comparison, Social Media Scraping, and Price Comparisons.

We also have real-world applications of web scraping like NLP (Natural Language Processing), Machine Learning, Predictive Analysis, Real-time analysis, Price monitoring, and many more.

If we talk about the legality, whether web scraping is legal or not? Web scraping is legal when the information we are trying to use is publicly available or freely available. 

We can check the access with the help of the robots.txt file. If we append this robots.txt file at the end of the URL we will get to know “/robots.txt”. So, we can say that some web scraping is legal and some are not legal.

Why Python for Web Scrapping?

Python is good for web scraping because it has easily understandable syntax, small syntaxes or small codes, a large collection of libraries, and is easy to use, hence is best for data abstraction.

→ As python has simple and small syntax codes will also be small and simple, it doesn’t include any braces (‘{’, ‘}’) or semicolons (‘;’) in the entire program.

→ We can handle large amounts of data easily using libraries like Numpy, Pandas, and Matplotlib, which help in the manipulation of extracted data.

→ Basically web scraping is used to reduce or save the time to be taken, so python anyways has small codes for performing large tasks that use less time.

Python web scraping uses some specific libraries such as Requests, Beautiful Soup, LXML, Selenium, Scarpy, urllib, and Mechanical Soup.

→ Requests library in python is responsible for sending the HTTP requests with GET and POST servers. We can call it the standard python HTTP library which is difficult to use and has effectiveness.

→ Beautiful Soup is also a python library used in web scraping which extracts the data from websites or HTML or XML files using the parser.

→ Selenium is a python library used in web scraping which is responsible for open-source automation tools.

→ Scrapy is a python library used in web scraping, which is also responsible for open source and has more speed.

Steps for Python Web Scraping

  1. Choose the URL which has to be scrapped.

  2. Check the HTML content by inspecting the page.

  3. Select the information/data to be extracted.

  4.  Start writing the code,

  5. Importing the suitable libraries

  6. Create a chrome drive object

  7. Using the driver.get method to extract the information from the URL.

  8. Extracting data in <div> tags in HTML with respect to the class names.

  9. Execute the Code.

  10. Saving the results of information extracted in the required format.

Now let us discuss each step by considering an example of the Flipkart website,

Step 1: Choose the URL which has to be scrapped


⇒ As discussed, taking the example of Flipkart, from this we can extract any kind of data like price, name, ratings, and many more.

⇒ Consider this URL,

https://www.flipkart.com/mobiles/~cs-6f6pseptap/pr?sid=tyy%2C4io&collection-tab-name=Samsung+Galaxy+F13&param=2355&otracker=clp_bannerads_1_41.bannerAdCard.BANNERADS_Samsung%2BF13_mobile-phones-store_H1SUOYYL67LZ

Step 2: Check the HTML content by inspecting the page


⇒ Right Click on the page and select the inspect option,

Step 3: Select the information/data to be extracted

⇒ Selecting the highlighted part which includes the properties of the product, which was written in the <div> tag.

Step 4: Start writing the code


  1. Importing the suitable libraries

from BeautifulSoupimport BeautifulSoup

from selenium import webdriver

import requests

import statistics

2. Create a chrome drive object

driver = webdriver.Chrome(“/user/chromium-browser/chromedriver”)

3. Using the driver.get method to extract the information from the URL.

Prices = []

Reviews = []

driver.get(“ <a href = “ https://www.flipkart.com/mobiles/ ”> https://www.flipkart.com/mobiles/ </a> ~cs-6f6pseptap/pr?sid=tyy%2C4io&collection-tab-name=Samsung+Galaxy+F13&param=2355&otracker=clp_bannerads_1_41.bannerAdCard.BANNERADS_Samsung%2BF13_mobile-phones-store_H1SUOYYL67LZ ”)

Step 5. Extract data in <div> tags in HTML with respect to the class names.

con = driver.page_source

soup = BeautifulSoup(content)

for a in soup.findAll(“a”,href=True, attrs={‘class’: ‘_1fQZEK’}):

prices=a.find(‘div’, attrs={‘class’:‘_30jeq3 _1_WHN1’})

reviews=a.find(‘div’, attrs={‘class’:‘_2_R_DZ’})

Prices.append(prices.text)

Reviews.append(reviews.text)

Step 6: Execute the Code

To run the program written, use the following command

python web-s.py

Step 7: Saving the results of information extracted in the required format

df = pd.DataFrame({‘Prices’:prices, ‘Reviews’ :reviews})

df.to_csv(‘details.csv’, index=False, encoding=’utf-8')

Finally, we extracted/collected the information and stored it in the CSV file named details.csv.

This is the procedure for web scraping with python.

Conclusion

1. We had discussed python, python is very simple to learn and start the programming journey as it is an object-oriented programming language.

2. We had known why python could be chosen for web scraping.

3. Web scraping was done mostly in python because it has an easy syntax and less code used for large tasks.

4. Web Scraping is the process of collecting huge amounts of information from different websites.

5. We discussed some applications of web scraping like Job listings, Price Comparison, Social Media Scraping, and Price Comparisons.

6. Some more real-world applications include, NLP (Natural Language Processing), Machine Learning, Predictive Analysis, Real-time analysis, and Price monitoring.

7. Then we knew web scraping is a legal activity or illegal activity.

8. Python uses some specific libraries for web scraping such as Requests, Beautiful Soup, LXML, Selenium, Scarpy, urllib, and Mechanical Soup.

9. Then we discussed steps for web scraping in python.

Hope you learned how web scraping can be done by using python from the article and hope you have gained some knowledge from this small piece of article. ❤

Tags:
Hubs:
You can’t comment this publication because its author is not yet a full member of the community. You will be able to contact the author only after he or she has been invited by someone in the community. Until then, author’s username will be hidden by an alias.