# Web Scraping with Python

You don't need to be a guru in python, just a basics of HTML and python is sufficient for this <b>web scraping</b> tutorial.

#### Let's dive in..

The tools we're going to use are:
<ul>
<li><b>Request</b> will allow us to send HTTP requests to get HTML files</li>
<li><b>BeautifulSoup</b> will help us parse the HTML files</li>
<li><b>Pandas</b> will help us assemble the data into a DataFrame to clean and analyze it</li>
<li><b>csv</b>(optional)- If you want to share data in csv file format</li>
</ul>

#### Let's Begin..
In this tutorial, we're going to scrape <b><a href="https://www.imdb.com/search/title/?groups=top_1000">IMDB</a></b> website, which we can get title, year, ratings, genre etc..

First, we'll import the tools to build our scraper
``` python
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
```
Getting the contents of webpage into <b>results</b> variable
``` python
url = "https://www.imdb.com/search/title/?groups=top_1000"
results = requests.get(url)
```
In order to make content easy to understand, we are using <b>BeautifulSoup</b> and the content is stored in <b>soup</b> variable
``` python
soup = BeautifulSoup(results.text, "html.parser")
```
And now initializing the lists to store data
``` python 
titles = []        #Stores the title of movie
years = []         #Stores the launch year of the movie
time = []          #Stores movie duration
imdb_ratings = []  #Stores the rating of the movie
genre = []         #Stores details regarding the genre of the movie
votes = []         #Store the no.of votes for the movie
```
Now find the right movie container by inspecting it, and hover over the movie div, which looks like below image

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/i/z2p7b4j24xlggm3sk8sr.png)


And we can see 50 div with class names:<code>lister-item mode-advanced</code>
So, find all div's with that classname by
``` python
movie_div = soup.find_all("div", class_="lister-item mode-advanced")
```
<b>find_all</b> attribute extracts all the div's which has class 
name:"lister-item mode-advanced"

Now get into each <code>lister-item mode-advanced</code> div and get the title, year, ratings, genre, movie duration

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/i/a0gyg59chadij4tj86wp.png)

So we iterate every div to get title, year, ratings etc..
``` python
for movieSection in movie_div:
```
#### Extracting the title

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/i/dh2m0s3o9verzg5lu54h.png)

From image, we can see that the movie name is placed under <b>div>h3>a</b>
```python
name = movieSection.h3.a.text  #we're iterating those divs using <b>movieSection<b> variable
titles.append(name) #appending the movie names into <b>titles</b> list  
```
#### Extracting Year

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/i/y8yfx5mdglbqqv8xgpsy.png)

From image, we can see that the movie launch year is placed under div>h3>span(class name="lister-item-year") and we extract it using <b>text</b> keyword
``` python
year = movieSection.h3.find("span", class_="lister-item-year").text
years.append(year)   #appending into years list
```

Similarly, we can get ratings, genre, movieDuration w.r.t classname
```python
ratings = movieSection.strong.text
imdb_ratings.append(ratings)   #appending ratings into list
category = movieSection.find("span", class_="genre").text.strip()
genre.append(category)         #appending category into Genre list
runTime = movieSection.find("span", class_="runtime").text
time.append(runTime)           #appending runTime into time list
```
#### Extracting votes

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/i/59fniprg1lixxk2geu64.png)

As from the image, we can see that we have two span tags with classname="nv". So, for votings we need to consider <b>nv[0]</b> and for gross collections <b>nv[1]</b>
```python
nv = movieSection.find_all("span", attrs={"name": "nv"})
vote = nv[0].text
votes.append(vote)
```
<b>Now we will build a DataFrame with pandas</b>
To store the data we have to create nicely into a table, so that we can really understand
And we can do it..
``` python
movies = pd.DataFrame(
    {
        "Movie": titles,
        "Year": years,
        "RunTime": time,
        "imdb": imdb_ratings,
        "Genre": genre,
        "votes": votes,
    }
)
``` 
And now let's print the dataframe

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/i/w9cos6fh7kghxzqe7w3n.png)

As we can see on row 16 and 25, there is some inconsistent of data. So we need to clean
``` python
 movies["Year"] = movies["Year"].str.extract("(\\d+)").astype(int) #Extracting only numerical values. so we can commit "I"
 movies["RunTime"] = movies["RunTime"].str.replace("min", "minutes") #replacing <b>min</b> with <b>minutes</b>
 movies["votes"] = movies["votes"].str.replace(",", "").astype(int) #removing "," to make it more clear
```
And now after cleaning we will see, how it looks
``` python
print(movies)
```

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/i/j81csuuq2iuujmugtr21.png)

You can also export the data in <b>.csv</b> file format.
In order to export, 
Create a file with <b>.csv</b> file extension
``` python
movies.to_csv(r"C:\Users\Aleti Sunil\Downloads\movies.csv", index=False, header=True)
```

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/i/tw1e7uwf8m60pzcahe8e.png)

You can get Final code from my <a href="https://github.com/aletisunil/Scraping_IMDB/blob/master/IMDB.py">Github repo</a>

Hope it's useful


%%[buymeacoffee]

%%[links]