As a follow-on to my Chrome post here is the process needed for Headless Firefox. It's the same 5 steps re-tailored to a different browser. You should read that post first. At a minimum have it open in a separate browser window as I refer back many times.
The Process
Get Firefox Working
We are going with ubuntu again because of Firefox dependences. And again I strongly advise against building things yourself.
Spin up an ubuntu and just "apt-get firefox" -- this is a lot easier than Chrome.
Get Scraping Working
Install selenium per the Chrome instructions.
Then we need to install the geckodriver. Find the latest version number at https://github.com/mozilla/geckodriver/releases. Then:
mkdir /geckodriver
wget -q --continue -P /geckodriver "https://github.com/mozilla/geckodriver/releases/download/v"driverversion"/geckodriver-v"driverversion"-linux64.tar.gz"
tar -zxvf /geckodriver/geckodriver-vversionumber-linux64.tar.gz -C /geckodriver
export PATH=/geckodriver:$PATH
Take a pause and make sure your scraper works with head-on Firefox before proceeding.
If you aren’t selenium-python fluent start with this simple code snippet adapted from the official docs here:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://google.com/?hl=en")
driver.quit()
quit()
That should open a Firefox window, load the Google home page and then shut everything down.
Get Headless Working
As with Chrome there are a bunch of options you need to set in/via selenium to get a fully-functioning environment. Also you need to install xvfb with "apt-get xvfb" for some unknown reason. Headless shouldn't require it but it appears to.
Create a temporary place to download files. Handle this however you’d like.
downloadDirectory = tempfile.mkdtemp()
Create an options object so we can muck around.
from selenium.webdriver.firefox.options import Options
options = Options()
Set directory and make sure the browser will actually download.
profile = webdriver.FirefoxProfile()
profile.set_preference("browser.download.dir", downloadDirectory)
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.helperApps.neverAsk.saveToDisk",
"application/pdf,application/x-pdf,application/octet-stream")
Run headless and without sandbox. Note that we need both a profile and an options here, and the naming is different from before.
options.add_argument('-headless')
theDriver = webdriver.Firefox(firefox_profile=profile, options=options)
With that done we can use theDriver for headless scraping. Rerun your tests from before.
Containerize It
We use a Docker container here. Again you can probably do this with other technologies. This one works and plays well with AWS. Put whatever you need in requirements.txt for python — including selenium — and your Dockerfile is:
FROM ubuntu:17.10
RUN apt-get update \
&& apt-get install -y sudo \
&& apt-get install -y wget \
&& apt-get install -y curl \
&& apt-get install -y dpkg \
&& apt-get install -y software-properties-common \
&& apt-get install -y gdebi unzip
RUN apt-get update \
&& apt-get install -y python3.6 python3.6-dev python3-pip python3.6-venv \
&& apt-get install -y xvfb firefox
RUN mkdir /geckodriver
RUN wget -q --continue -P /geckodriver "https://github.com/mozilla/geckodriver/releases/download/v"$driverversion"/geckodriver-v"driverversion"-linux64.tar.gz"
RUN tar -zxvf /geckodriver/geckodriver-vdriverversion-linux64.tar.gz -C /geckodriver
ENV PATH /geckodriver:$PATH
ADD requirements.txt .
RUN pip3 install --trusted-host pypi.python.org -r ./requirements.txt
That container can fire up a working headless chrome that can download files. Use whatever CMD and/or ENTRYPOINT suits you.
Firefox requires a lot of space in /dev/shm. If you are going to run docker yourself be sure to add --shm-size 2g to the command. If you are going to deploy in AWS read this. Prior that feature we ran a customized machine image for a while that upped docker's default shm size with --default-shm-size in /etc/sysconfig/docker. If your orchestration doesn't support shm-size you'll need to do something different.
Deploy
With a working container in hand deploy however you’d like. We’ve run that image on AWS EC2 directly, using AWS ECS and Batch to EC2 and via Docker straight to MacOS, various Linuxes and Windows. This image is about the same size as Chrome.
Memory usage is also heavy and depends a lot on what sorts of pages you are loading.
Some Comments
Far Cleaner Than Chrome
Firefox is a lot cleaner to set up and run than Chrome. With the shm issue ironed out there is nothing too weird to do here. The profile/options/parameters documentation could use some work. That is the main reason for this post: getting downloads to work required a lot of grepping and guessing.
Firefox Deals With Changing HTML Better
When a page uses javascript to modify HTML without page reloads Firefox seems to do a better job. That's why we use it sometimes. I have no opinion between the two browsers except to say there are some complex pages out there that only scrape on one of them.
So no matter which browser you prefer it's worth getting both running. I have found that where both browsers work the scraping code -- ex the initialization discussed here -- is identical. And when I can't get one of them to work it's not a matter of modifying the code...it's just not going to work.
Surely there are edge cases I haven't found. If you come across anything please let me know.
No comments:
Post a Comment