Learn Python this summer Day 14: Web Scraping

Welcome back! Yesterday, we learned about creating graphical user interfaces (GUIs) with tkinter. Today, we’ll dive into web scraping, which allows you to extract data from websites using Python. By the end of this day, you’ll know how to scrape web data using libraries like requests and BeautifulSoup. Let’s get started!

What is Web Scraping?
Web scraping is the process of extracting data from websites. It involves making HTTP requests to a website and parsing the HTML to retrieve the desired information.

Libraries for Web Scraping
requests: Allows you to send HTTP requests and get the content of web pages.
BeautifulSoup: A library for parsing HTML and XML documents.
Installing the Libraries
You can install the requests and BeautifulSoup libraries using pip:

bash
Copy code
pip install requests
pip install beautifulsoup4
Making HTTP Requests
Use the requests library to make HTTP requests and get the content of a web page.

Example:

python
Copy code
import requests

url = "https://example.com"
response = requests.get(url)

print(response.text) # Output: HTML content of the page
Parsing HTML with BeautifulSoup
Use the BeautifulSoup library to parse the HTML content and extract the desired information.

Example:

python
Copy code
import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

Extract the title of the page

title = soup.title.text
print(title) # Output: Example Domain
Finding Elements
You can use BeautifulSoup methods to find specific elements in the HTML.

find(): Finds the first occurrence of an element.
find_all(): Finds all occurrences of an element.
Example:

python
Copy code
import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

Find the first paragraph

first_paragraph = soup.find("p")
print(first_paragraph.text)

Find all paragraphs

all_paragraphs = soup.find_all("p")
for paragraph in all_paragraphs:
print(paragraph.text)
Extracting Attributes
You can extract attributes of HTML elements using the attrs property.

Example:

python
Copy code
import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

Extract the href attribute of the first link

first_link = soup.find("a")
print(first_link["href"]) # Output: https://www.iana.org/domains/example
Practice Time!
Let’s put what we’ve learned into practice. Write a Python program that scrapes data from a website and extracts specific information.

Example: Scraping the titles of articles from a news website.

python
Copy code
import requests
from bs4 import BeautifulSoup

url = "https://news.ycombinator.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

Find all article titles

titles = soup.findall("a", class="storylink")

Print the titles

for title in titles:
print(title.text)
Conclusion
Great job today! You’ve learned how to scrape web data using Python’s requests and BeautifulSoup libraries. Tomorrow, we’ll dive into working with APIs, which will allow you to interact with web services and retrieve data programmatically. Keep practicing and having fun coding!

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *