Jun Hyuk Kim's Blog
[Terms] Web Crawling & Web Scraping 본문
Web crawling and web scraping are miss understood. I had some problems because the only term I knew was wab crawling.
Web crawling is used to find URL's from a web page where we can web scrape or use the URL to get indexes that can be used to create a search index. You can think of this as a spider trying to create a web connecting URLs together. First a starting URL is used to get all the URL's inside the web page, then another web page is used again till all pages are visited with no URL's left. All URL's tend to lead to other ones ultimatly creating a list of all the possible URL's from the initial site. By crawling we might get too much data so filtering the URL's is a nessecity. Some useful filters are the domain, response code, HTML formatting, etc. Web crawling can be used for multiple purposes. We can set the domain and HTML format we want to get the URL's with the data we want for web scraping. We can use the crawler to get through the whole website and get the response code to check for broken URL's. Most web crawling is used by big companies to create search indexes.
Web scraping is the term we are more familiar with. Web scraping is when we use the HTML data of the page to get the data we want. When we give the URL of the page the scraper goes through the HTML to get the data we want. This is used by all companies and used for data used in ML training.
Most of the times web crawling is used first then web scraping is used to get data. Scrapy is a good library that can be used to get data using the python language. It supports both web crawling and web scraping and, from the tutorial code, is easy to use and understand. When I first tried to get data using the normal webbrowser and os it was really hard to get the URLs, set the filters, and make the code run fast. Scrapy, while I didn't use the same page to get the differences, was faster and easier to run and understand.
'Coding Journal > AI' 카테고리의 다른 글
| [Terms] CNN Basics (0) | 2023.06.05 |
|---|---|
| [Terms] Types of gradient descent (0) | 2023.05.27 |
| Stochastic Gradient Descent and Batch Sizes (0) | 2023.05.26 |
| [Terms] Boosting (0) | 2023.05.22 |
| [Terms] Bagging (0) | 2023.05.21 |