ProgrammingRecipeTutorial

How To Build A Web Scraper Using Python/Replit

A Python coding recipe for a web scraper that retrieves Dogecoin prices.

In this Open Labs tutorial the presenter walks through how to write Python code that ‘scrapes[1]https://en.wikipedia.org/wiki/Web_scraping Dogecoin prices every six minutes.

To achieve this he uses a Replit[2]https://replit.com/ account and Beautiful Soup[3]https://www.crummy.com/software/BeautifulSoup/, a Python library for web scraping. The Replit Repo for this project is available here[4]https://replit.com/@openlabs/dogecoin-price-scraper.

From 0m:50s: To begin he creates a new Python Replit[5]https://replit.com/languages/python3 and imports the library. This makes the scraping very easy. The first line of code simply requires specification of the URL to be retrieved, so he inputs the Dogecoin address.


url = "https://coinmarketcap.com/currencies/dogecoin/"

He then defines a variable called Content, which is populated with the text retrieved from the web site. This requires installing the Requests module. He notes that a key benefit of using Replit is that it makes this type of importing very easy.


def run_scraper();
print ("running")
content = requests.get(url).text
run_scraper()

From 2m:28sTo make the content easier to navigate he then uses Beautiful Soup to convert the text into HTML. This initially causes an error, requiring the installation of lxml[6]https://lxml.de/.


content = requests.get(url).text
soup = BeautifulSoup(content, 'lxml')
# print(soup)

From 3m:18s: The next main step is then to find and extract the pricing data from the raw HTML. This is achieved through the content being organized via elements each with classes and tags. He inspects the pricing field to identify it’s specific class.

He uses Beautiful soup to then find those instances, noting that it also includes random numbers which are generated dynamically on the web page each time, so those shouldn’t be included. He creates a new variable which is assigned the results.


# find element with class that contains 'priceValue'
regex = re.compile('.*priceValue.*')

# find ^ in the html
current_price = soup.find('div', {'class': regex}).text
print(current_price)

From 6m:13s: He then moves on to enhancing the app to scrape every 5 minutes and then saves the price over time into a JSON file.

This is achieved through preparing a JSON object named export, defining it’s fields to be time and price. The HTML is searched into the JSON file, saved within the Replit object. The time is retrieved using the Datetime object[7]https://docs.python.org/3/library/datetime.html.


# get current time
dt = datetime.datetime.now(timezone.utc)

utc_time = dt.replace(tzinfo=timezone.utc)
utc_timestamp = utc_time.timestamp()

# prepare json object
export_object = {
'time': utc_timestamp,
'price': current_price
}
print(export_object)

data.append(export_object)

# save data to json file
with open('dogecoin_prices.json', 'w') as f:
json.dump(data, f, ensure_ascii=False, indent=4)

From 9m:03s: When it is run it overwrites the previous data so he precedes this with code that uses the OS package[8]https://docs.python.org/3/library/os.html to check if the file already exists, and if it does open the file and append data to it, rather than overwriting.


# load existing prices
if os.path.exists('dogecoin_prices.json'):
with open('dogecoin_prices.json') as f:
data = json.load(f)

From 10m:33s: To run the code it requires the user to manually click Run each time, so the Schedule library[9]https://schedule.readthedocs.io/en/stable/ is added and code added so that it continually repeats. This also requires importing of the time module[10]https://docs.python.org/3/library/time.html.


schedule.every(6).seconds.do(run_scraper)
# run_scraper()

while True:
  schedule.run_pending()
  time.sleep(1)

End.

References

References
1 https://en.wikipedia.org/wiki/Web_scraping
2 https://replit.com/
3 https://www.crummy.com/software/BeautifulSoup/
4 https://replit.com/@openlabs/dogecoin-price-scraper
5 https://replit.com/languages/python3
6 https://lxml.de/
7 https://docs.python.org/3/library/datetime.html
8 https://docs.python.org/3/library/os.html
9 https://schedule.readthedocs.io/en/stable/
10 https://docs.python.org/3/library/time.html

Related Articles

Leave a Reply

Your email address will not be published.

Back to top button