Skip to content

Patrick2ooo/Job_Scraper

Repository files navigation

Job Scraper for Switzerland

Hello, this small project was created to improve my Python skills by automating a daily task. As you can see, my code and skills still need improvement, but I'm happy to have achieved the goal of this project. I've included a short list of tasks to complete in the main file for potential improvements, but I'm not working on it for the moment as I want to focus on another project.

Conception

The different scrapers were developed in the following order:

  • Linkedin
  • Jobup
  • Glassdoor
  • Indeed

As you can see, each code snippet is quite different. I used Playwright, , various SeleniumBase features, and, of course, BeautifulSoup. My conclusion regarding these different frameworks is that I highly recommend using SeleniumBase for extracting data from these types of websites, as it performs best at bypassing bot detection (CDP mode) and for data extraction in general. Playwright is more useful if you want to run automated tests on a website you've developed. I haven't modified my LinkedIn and Jobup scrapers to use SeleniumBase's CDP mode to demonstrate how it works with other frameworks, but if I were to recreate them, I would use this mode. Why not use SeleniumBase's UC mode? This is because UC mode has been deprecated to bypass bot detection and other issues, but it still works correctly on some websites, such as LinkedIn.

It's important to note that this code currently works, but because websites are constantly being updated, it might stop working. You'll need to modify it to adapt to updates. Another issue arises when performing rapid and intensive data extractions from certain websites (Indeed and Glassdoor) from the same IP address (so use a proxy if you want to perform rapid extractions). You might then be censored, or your IP address could be blocked. The code may also generate errors. It's possible I haven't accounted for all the issues when extracting data from these websites. I'll do my best to fix them as I identify them and use the code. Please excuse me in advance for any problems you may encounter.

Prerequisites

I used the JetBrains PyCharm IDE on Windows 10 to run and debug this code. I haven't tested it with other IDEs, so I don't know if it works everywhere. To extract the data, Chrome must be installed on your computer.

Install needed

pip install playwright
playwright install
pip install beautifulsoup4
pip install seleniumbase
pip install panda
pip install xlsxwriter

How it works

Once the installation is complete, you will simply need to modify the main file according to your needs. Here is the first part to modify in main.py:

linkedin_dict = linkedin_scraper("https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=Software+Engineer+OR+Embedded+Engineer&location=Switzerland&f_TPR=r86400")
jobup_dict = jobup_scraper("https://www.jobup.ch/fr/emplois/?publication-date=1&term=software%20engineer")
glassdoor_dict = glassdoor_scraper("https://fr.glassdoor.ch/Emploi/software-engineer-emplois-SRCH_KO0,17.htm?fromAge=7")
indeed_dict = indeed_scraper("https://ch-fr.indeed.com/jobs?q=ing%C3%A9nieur+informatique&l=&fromage=7")

all_job_dict = linkedin_dict | jobup_dict | glassdoor_dict | indeed_dict

You'll first need to find the correct URL for each website, except for LinkedIn. Simply go to the job search site (you can use the link already in the code and modify the search terms). Find your job posting, apply the various filters, and then copy and paste the link into the function call corresponding to the website. For LinkedIn, I'm using the public API; therefore, you'll need to enter the URL yourself. You can find all the necessary information here

If you don't want to use certain websites, you'll need to comment out the line calling the function and remove *dict from all_job_dict. Here's an example without Jobup or Glassdoor:

linkedin_dict = linkedin_scraper("https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=Software+Engineer+OR+Embedded+Engineer&location=Switzerland&f_TPR=r86400")
# jobup_dict = jobup_scraper("https://www.jobup.ch/fr/emplois/?publication-date=1&term=software%20engineer")
# glassdoor_dict = glassdoor_scraper("https://fr.glassdoor.ch/Emploi/software-engineer-emplois-SRCH_KO0,17.htm?fromAge=7")
indeed_dict = indeed_scraper("https://ch-fr.indeed.com/jobs?q=ing%C3%A9nieur+informatique&l=&fromage=7")

all_job_dict = linkedin_dict | indeed_dict

Now that you've set up the website for analysis, you can filter the job postings that interest you using keywords. These keywords allow you to exclude postings that don't contain them in their description, and you'll get an Excel file like this: one sheet containing the unfiltered postings and another containing the filtered postings. image image

To configure your keyword, you will need to modify the filters array in main.py. Here is an example below:

filters = ["python", "vhdl", r"\Wc\W", "linux", "IOT", "systemverilog"]

Keywords are not case-sensitive. You can add as many as you like. If a keyword contains only one character, like the "c" in my example, you must add a \W before it and the prefix "r" to the end of the string, exactly as shown in the example.

You can now run the code. I hope you find this useful.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages