Hands-on webscraping
It is time to do some webscraping ourselves. In what follows is a short first tutorial on webscraping where we will be collecting data from webpages on the internet. We will use the specific use case of the political science department staff of the university of Leiden.
What do they publish? Where? And with whom do they collaborate? We assume you have at least some experience with coding in R. In the rest of this part of the tutorial, we will switch between base R and Tidyverse (just a bit), whatever is most convenient. (Note that this will happen often if you become an applied computational sociologist.) We also offer Python code.
There are different strategies in scraping. There is often a trade-off between complex scraping techniques versus complex string manipulations. In this first example, we will use quite a lot of string manipulations.
For even more info see our SNASS book - Chapter 11
rm(list = ls())
gc()
fpackage.check
: Check if packages are installed (and install if not) in R.fsave
: Save to processed data in repositoryfload
: To load the files back after an fsave
fshowdf
: To print objects (tibbles / data.frame) nicely on screen in .rmdrm(list = ls()) #clean up your environment
fpackage.check <- function(packages) {
lapply(packages, FUN = function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x, dependencies = TRUE)
library(x, character.only = TRUE)
}
})
}
fsave <- function(x, file = NULL, location = "./data/processed/") {
ifelse(!dir.exists("data"), dir.create("data"), FALSE)
ifelse(!dir.exists("data/processed"), dir.create("data/processed"), FALSE)
if (is.null(file))
file = deparse(substitute(x))
datename <- substr(gsub("[:-]", "", Sys.time()), 1, 8)
totalname <- paste(location, datename, file, ".rda", sep = "")
save(x, file = totalname) #need to fix if file is reloaded as input name, not as x.
}
fload <- function(filename) {
load(filename)
get(ls()[ls() != "filename"])
}
fshowdf <- function(x, ...) {
knitr::kable(x, digits = 2, "html", ...) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>%
kableExtra::scroll_box(width = "100%", height = "300px")
}
colorize <- function(x, color) {
sprintf("<span style='color: %s;'>%s</span>", color, x)
}
tidyverse
: for piping etc.httr
: Tools for Working with URLs and HTTPxml2
: Work with XML files using a simple, consistent interface.rvest
: Wrappers around the ‘xml2’ and ‘httr’ packages to make it easy to download, then manipulate, HTML and XML.reshape2
: Flexibly Reshape Datapackages = c("tidyverse", "httr", "rvest", "reshape2", "xml2")
fpackage.check(packages)
Make sure you have the required libraries by typing in the code below into your console
pip install requests beautifulsoup4 pandas
and import necessary modules
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
What do we mean by anchor data? Our goal is to get to know:
So that means at least three data sources we need to collect from somewhere. What would be a nice starting (read: anchor) point be? First, we have to know who is staff. Let’s check out the Leiden political science staff website. Here we see a nice list on who is on the staff in several pages. How do we get that data? It is actually quite simple, the package rvest
has a very nice function html_read()
(actually this comes from the xml2
package) which simply derives the source html of a static webpage:
Let’s first simply get the staff pages.
read_html()
is a function that simply extracts html webpages and puts them in xml format.
lpol_staff <- read_html("https://www.universiteitleiden.nl/en/social-behavioural-sciences/political-science/staff#tab-1")
head(lpol_staff)
#> $node
#> <pointer: 0x000001ba7b7fedd0>
#>
#> $doc
#> <pointer: 0x000001ba87cfd050>
That looks kinda weird. What type of object did we store it by putting the html into lpol_staff1
?
class(lpol_staff)
#> [1] "xml_document" "xml_node"
So it is is stored in something that’s called an xml object. Not important for now what that is. But it is important to extract the relevant table that we saw on the staff website. How do we do that? Go to one of the links above in a browser and then press “Inspect” on the webpage (usually: right click–>Inspect). In the html code we extracted, we need to go to one of the nodes first. If you move your cursor over “div” in the html code on the screen, the entire “body” of the page should become some shade of blue. This means that the elements encapsulated in the “body” node captures everything that turned blue.
# Fetch the webpage
url = "https://www.universiteitleiden.nl/en/social-behavioural-sciences/political-science/staff#tab-1"
response = requests.get(url)
webpage = response.content
# Parse the webpage
soup = BeautifulSoup(webpage, 'html.parser')
#print(soup.prettify())
#or to print just a few lines
result = soup.prettify().splitlines()
print('\n'.join(result[:10]))
<!DOCTYPE html>
<html data-version="1.178.00" lang="en">
<head>
<!-- standard page html head -->
<title>
Staff - Leiden University
</title>
<meta content="o8KYuFAiSZi6QWW1wxqKFvT1WQwN-BxruU42si9YjXw" name="google-site-verification"/>
<meta content="hRUxrqIARMinLW2dRXrPpmtLtymnOTsg0Pl3WjHWQ4w" name="google-site-verification"/>
<link href="https://www.universiteitleiden.nl/en/social-behavioural-sciences/political-science/staff" rel="canonical"/>
Seems like more useful data now. But can we improve by deleting some elements we do not need? Let’s first delete some of the useless information.
head(lpol_staff)
#> [1] "Leiden University" "Students" "Staff members"
#> [4] "Organisational structure" "Library" "\n"
lpol_staff <- lpol_staff[-c(1:39)]
lpol_staff <- lpol_staff[-c(133:length(lpol_staff))]
head(lpol_staff)
#> [1] "\n\n\n\n\n \n Adina Akbik\n Senior Assistant Professor\n \n"
#> [2] "\n\n\n\n\n \n Femke Bakker\n Senior assistant professor\n \n"
#> [3] "\n\n\n\n\n \n Ingrid van Biezen\n Professor of Comparative Politics\n \n"
#> [4] "\n\n\n\n\n \n Nicolas Blarel\n Associate Professor\n \n"
#> [5] "\n\n\n\n\n \n Arjen Boin\n Professor of Public Institutions and Governance\n \n"
#> [6] "\n\n\n\n\n \n Theo Brinkel\n Professor by Special Appointment Military-social studies\n \n"
fshowdf(lpol_staff)
x |
---|
Adina Akbik Senior Assistant Professor |
Femke Bakker Senior assistant professor |
Ingrid van Biezen Professor of Comparative Politics |
Nicolas Blarel Associate Professor |
Arjen Boin Professor of Public Institutions and Governance |
Theo Brinkel Professor by Special Appointment Military-social studies |
Manuel Cabal Lopez Assistant professor |
Valentina Carraro Assistant Professor in Global Transformations and Governance Challenges |
Stefan Cetkovic Assistant Professor |
Leila Demarest Associate Professor |
Matthew di Giuseppe Director of Studies / Associate Professor |
Roos van der Haer Assistant Professor |
Gisela Hirschmann Senior Assistant Professor |
Joop van Holsteijn Professor Political Behaviour and Research Methods |
Corinna Jentzsch Assistant Professor |
Petr Kopecky Professor of Comparative Studies Political Parties and Party Systems |
Matthew Longo Senior Assistant Professor |
Tom Louwerse Director of Research / Associate Professor |
Floris Mansvelt Beck Assistant Professor |
Juan Masullo Jimenez Assistant Professor |
Hilde van Meegdenburg Assistant Professor |
Michael Meffert Assistant Professor |
Frits Meijerink Assistant professor |
Tim Mickler Senior assistant professor |
Martijn Mos Assistant Professor |
Katharina Natter Senior assistant professor |
Paul Nieuwenburg Professor Political Philosophy |
Simon Otjes Senior Assistant Professor |
Hans Oversloot Senior Assistant Professor |
Jonathan Phillips Assistant Professor |
Rebecca Ploof Assistant Professor |
Karolina Pomorska Associate professor |
Francesco Ragazzi Associate professor |
Babak Rezaeedaryakenari Senior Assistant Professor |
Josh Robison Assistant Professor |
Michael Sampson Senior Assistant Professor |
Jan Aart Scholte Professor Global Transformations and Governance Challenges |
Jonah Schulhofer-Wohl Senior assistant professor |
Maria Spirova Associate Professor |
Tom Theuns Senior Assistant Professor |
Daniel Thomas Professor of International Relations |
Christina Luise Toenshoff Assistant professor |
Vasiliki (Billy) Tsagkroni Senior assistant professor |
Wouter Veenendaal Professor by Special Appointment Kingdom Relations |
Claire Vergerio Guest |
Marco Verschoor Assistant Professor |
Cynthia van Vonno Assistant Professor |
Niels van Willigen Director of Studies/Associate Professor |
Nikoleta Yordanova Associate professor |
Yuan Yi Zhu Assistant professor |
Frank de Zwart Guest |
Alessia Aspide PhD candidate |
Cyan Bae PhD candidate |
Kathleen Brown PhD candidate |
Mateo Cohen External PhD candidate |
Josette Daemen Postdoc |
Jesse Doornenbal Lecturer |
Eleftherios Karchimakis Lecturer |
Aleksandra Khokhlova PhD candidate |
Stijn Koenraads PhD candidate |
Hannah Kuhn PhD candidate |
Alex Schilin Guest researcher |
Pawan Sen PhD candidate |
Ruben van de Ven PhD candidate |
Anouk van Vliet Lecturer |
Denny van der Vlist PhD candidate |
Thijs Vos PhD candidate |
Rick van Well PhD candidate |
Daan van den Wollenberg PhD candidate / self funded |
Elina Zorina PhD candidate |
Rudy Andeweg Professor emeritus of Empirical Political Science |
Ivan Bakalov Lecturer |
Agha Bayramov Lecturer |
Jelena Belic Lecturer |
Jelke Bethlehem Professor emeritus Survey Methodology |
Peter Castenmiller Lecturer |
Diana Davila Gordillo Guest - researcher |
Henk Dekker Emeritus professor of Political Socialization and Integration |
Katerina Galanopoulou Lecturer |
Rutger Hagen Lecturer |
Henriëtte van den Heuvel Scientific Director ad interim |
Galen Irwin |
Devrim Kabasakal Badamchi- Guest researcher |
Eleftherios Karchimakis Lecturer |
Müge Kinacioglu Lecturer |
Ruud Koole Professor emeritus Politicologie |
Amber Lauwers Lecturer |
José Lourenço Lecturer |
Gjovalin Macaj Assistant professor |
Jan Meijer Lecturer |
Marijn Nagtzaam Lecturer |
Christoph Niessen Guest |
Alexandros Ntaflos Lecturer |
Joyce Outshoorn Emeritus professor vrouwenstudies |
Jimena Pacheco Miranda Lecturer |
Julia Puente Duyn Lecturer |
Ellen van Reuler Lecturer |
Thomas Scarff Lecturer |
Radostina Sharenkova-Toshkova Guest |
Justin Spruit Lecturer |
Vishwesh Sundar Lecturer |
Harmen van der Veer Lecturer |
Amy Verdun Guest professor |
Ruben Verheul Web editor |
Anouk van Vliet Lecturer |
Carina van de Wetering Lecturer |
Gul-i-Hina van der Zwan Postdoc / guest |
Wencke Appelman Study adviser |
Ester Blom Study Adviser |
Anna van Dijk Management/office-assistent |
Nathalie van Dooren Coordinator Marketing & Student Recruitment |
Desiree van Drongelen Staff Member Studentregistrations |
Nynke Heegstra Study Advisor |
Ingrid van Heeringen-Göbbels Institute manager |
Marit van der Heide Student member institute board |
Katie Hudson Data Steward |
Lianne Janssen Policy Officer Education and Quality Assurance |
Eline Joor Internationalisation Officer |
Ian Lau Study adviser |
Daniëlle Lovink Study Adviser |
Carien Nelissen Education manager |
Caroline Remmerswaal Secretary board of examiners |
Marjan Rijnja Teaching Coordinator |
Elka Smith Research Project Manager |
Judy Spruit Management/office-assistent |
Tessa Thomas Management assistant |
Debbie Tromper Management assistant |
Gerard van der Veer Secretary board of examiners |
Ruben Verheul Web editor |
Jeanne Viet Communication staff member |
Denise Zeeuw-van Veen Management/office assistant |
Bachelor’s programmes |
Still looks a bit messy. Can we get it into a dataframe and split the column into useful columns?
lpol_staff <- data.frame(lpol_staff)
lpol_staff <- colsplit(lpol_staff$lpol_staff, " ", names = c("v1", "v2", "v3", "v4", "v5"))
fshowdf(lpol_staff, caption = "lpol_staff")
v1 | v2 | v3 | v4 | v5 |
---|---|---|---|---|
NA | Adina Akbik | Senior Assistant Professor | NA | |
NA | Femke Bakker | Senior assistant professor | NA | |
NA | Ingrid van Biezen | Professor of Comparative Politics | NA | |
NA | Nicolas Blarel | Associate Professor | NA | |
NA | Arjen Boin | Professor of Public Institutions and Governance | NA | |
NA | Theo Brinkel | Professor by Special Appointment Military-social studies | NA | |
NA | Manuel Cabal Lopez | Assistant professor | NA | |
NA | Valentina Carraro | Assistant Professor in Global Transformations and Governance Challenges | NA | |
NA | Stefan Cetkovic | Assistant Professor | NA | |
NA | Leila Demarest | Associate Professor | NA | |
NA | Matthew di Giuseppe | Director of Studies / Associate Professor | NA | |
NA | Roos van der Haer | Assistant Professor | NA | |
NA | Gisela Hirschmann | Senior Assistant Professor | NA | |
NA | Joop van Holsteijn | Professor Political Behaviour and Research Methods | NA | |
NA | Corinna Jentzsch | Assistant Professor | NA | |
NA | Petr Kopecky | Professor of Comparative Studies Political Parties and Party Systems | NA | |
NA | Matthew Longo | Senior Assistant Professor | NA | |
NA | Tom Louwerse | Director of Research / Associate Professor | NA | |
NA | Floris Mansvelt Beck | Assistant Professor | NA | |
NA | Juan Masullo Jimenez | Assistant Professor | NA | |
NA | Hilde van Meegdenburg | Assistant Professor | NA | |
NA | Michael Meffert | Assistant Professor | NA | |
NA | Frits Meijerink | Assistant professor | NA | |
NA | Tim Mickler | Senior assistant professor | NA | |
NA | Martijn Mos | Assistant Professor | NA | |
NA | Katharina Natter | Senior assistant professor | NA | |
NA | Paul Nieuwenburg | Professor Political Philosophy | NA | |
NA | Simon Otjes | Senior Assistant Professor | NA | |
NA | Hans Oversloot | Senior Assistant Professor | NA | |
NA | Jonathan Phillips | Assistant Professor | NA | |
NA | Rebecca Ploof | Assistant Professor | NA | |
NA | Karolina Pomorska | Associate professor | NA | |
NA | Francesco Ragazzi | Associate professor | NA | |
NA | Babak Rezaeedaryakenari | Senior Assistant Professor | NA | |
NA | Josh Robison | Assistant Professor | NA | |
NA | Michael Sampson | Senior Assistant Professor | NA | |
NA | Jan Aart Scholte | Professor Global Transformations and Governance Challenges | NA | |
NA | Jonah Schulhofer-Wohl | Senior assistant professor | NA | |
NA | Maria Spirova | Associate Professor | NA | |
NA | Tom Theuns | Senior Assistant Professor | NA | |
NA | Daniel Thomas | Professor of International Relations | NA | |
NA | Christina Luise Toenshoff | Assistant professor | NA | |
NA | Vasiliki (Billy) Tsagkroni | Senior assistant professor | NA | |
NA | Wouter Veenendaal | Professor by Special Appointment Kingdom Relations | NA | |
NA | Claire Vergerio | Guest | NA | |
NA | Marco Verschoor | Assistant Professor | NA | |
NA | Cynthia van Vonno | Assistant Professor | NA | |
NA | Niels van Willigen | Director of Studies/Associate Professor | NA | |
NA | Nikoleta Yordanova | Associate professor | NA | |
NA | Yuan Yi Zhu | Assistant professor | NA | |
NA | Frank de Zwart | Guest | NA | |
NA | Alessia Aspide | PhD candidate | NA | |
NA | Cyan Bae | PhD candidate | NA | |
NA | Kathleen Brown | PhD candidate | NA | |
NA | Mateo Cohen | External PhD candidate | NA | |
NA | Josette Daemen | Postdoc | NA | |
NA | Jesse Doornenbal | Lecturer | NA | |
NA | Eleftherios Karchimakis | Lecturer | NA | |
NA | Aleksandra Khokhlova | PhD candidate | NA | |
NA | Stijn Koenraads | PhD candidate | NA | |
NA | Hannah Kuhn | PhD candidate | NA | |
NA | Alex Schilin | Guest researcher | NA | |
NA | Pawan Sen | PhD candidate | NA | |
NA | Ruben van de Ven | PhD candidate | NA | |
NA | Anouk van Vliet | Lecturer | NA | |
NA | Denny van der Vlist | PhD candidate | NA | |
NA | Thijs Vos | PhD candidate | NA | |
NA | Rick van Well | PhD candidate | NA | |
NA | Daan van den Wollenberg | PhD candidate / self funded | NA | |
NA | Elina Zorina | PhD candidate | NA | |
NA | Rudy Andeweg | Professor emeritus of Empirical Political Science | NA | |
NA | Ivan Bakalov | Lecturer | NA | |
NA | Agha Bayramov | Lecturer | NA | |
NA | Jelena Belic | Lecturer | NA | |
NA | Jelke Bethlehem | Professor emeritus Survey Methodology | NA | |
NA | Peter Castenmiller | Lecturer | NA | |
NA | Diana Davila Gordillo | Guest - researcher | NA | |
NA | Henk Dekker | Emeritus professor of Political Socialization and Integration | NA | |
NA | Katerina Galanopoulou | Lecturer | NA | |
NA | Rutger Hagen | Lecturer | NA | |
NA | Henriëtte van den Heuvel | Scientific Director ad interim | NA | |
NA | Galen Irwin | NA | ||
NA | Devrim Kabasakal Badamchi- | Guest researcher | NA | |
NA | Eleftherios Karchimakis | Lecturer | NA | |
NA | Müge Kinacioglu | Lecturer | NA | |
NA | Ruud Koole | Professor emeritus Politicologie | NA | |
NA | Amber Lauwers | Lecturer | NA | |
NA | José Lourenço | Lecturer | NA | |
NA | Gjovalin Macaj | Assistant professor | NA | |
NA | Jan Meijer | Lecturer | NA | |
NA | Marijn Nagtzaam | Lecturer | NA | |
NA | Christoph Niessen | Guest | NA | |
NA | Alexandros Ntaflos | Lecturer | NA | |
NA | Joyce Outshoorn | Emeritus professor vrouwenstudies | NA | |
NA | Jimena Pacheco Miranda | Lecturer | NA | |
NA | Julia Puente Duyn | Lecturer | NA | |
NA | Ellen van Reuler | Lecturer | NA | |
NA | Thomas Scarff | Lecturer | NA | |
NA | Radostina Sharenkova-Toshkova | Guest | NA | |
NA | Justin Spruit | Lecturer | NA | |
NA | Vishwesh Sundar | Lecturer | NA | |
NA | Harmen van der Veer | Lecturer | NA | |
NA | Amy Verdun | Guest professor | NA | |
NA | Ruben Verheul | Web editor | NA | |
NA | Anouk van Vliet | Lecturer | NA | |
NA | Carina van de Wetering | Lecturer | NA | |
NA | Gul-i-Hina van der Zwan | Postdoc / guest | NA | |
NA | Wencke Appelman | Study adviser | NA | |
NA | Ester Blom | Study Adviser | NA | |
NA | Anna van Dijk | Management/office-assistent | NA | |
NA | Nathalie van Dooren | Coordinator Marketing & Student Recruitment | NA | |
NA | Desiree van Drongelen | Staff Member Studentregistrations | NA | |
NA | Nynke Heegstra | Study Advisor | NA | |
NA | Ingrid van Heeringen-Göbbels | Institute manager | NA | |
NA | Marit van der Heide | Student member institute board | NA | |
NA | Katie Hudson | Data Steward | NA | |
NA | Lianne Janssen | Policy Officer Education and Quality Assurance | NA | |
NA | Eline Joor | Internationalisation Officer | NA | |
NA | Ian Lau | Study adviser | NA | |
NA | Daniëlle Lovink | Study Adviser | NA | |
NA | Carien Nelissen | Education manager | NA | |
NA | Caroline Remmerswaal | Secretary board of examiners | NA | |
NA | Marjan Rijnja | Teaching Coordinator | NA | |
NA | Elka Smith | Research Project Manager | NA | |
NA | Judy Spruit | Management/office-assistent | NA | |
NA | Tessa Thomas | Management assistant | NA | |
NA | Debbie Tromper | Management assistant | NA | |
NA | Gerard van der Veer | Secretary board of examiners | NA | |
NA | Ruben Verheul | Web editor | NA | |
NA | Jeanne Viet | Communication staff member | NA | |
NA | Denise Zeeuw-van Veen | Management/office assistant | NA | |
Bachelor’s programmes | NA | NA |
Nice! I think we only need column 4 and 5? And let’s name them nicely and delete any trailing or leading whitespace.
lpol_staff <- lpol_staff[, c("v3", "v4")]
names(lpol_staff) <- c("name", "func")
lpol_staff$name <- trimws(lpol_staff$name, which = c("both"), whitespace = "[ \t\r\n]")
lpol_staff$func <- trimws(lpol_staff$func, which = c("both"), whitespace = "[ \t\r\n]")
fshowdf(lpol_staff, caption = "lpol_staff")
name | func |
---|---|
Adina Akbik | Senior Assistant Professor |
Femke Bakker | Senior assistant professor |
Ingrid van Biezen | Professor of Comparative Politics |
Nicolas Blarel | Associate Professor |
Arjen Boin | Professor of Public Institutions and Governance |
Theo Brinkel | Professor by Special Appointment Military-social studies |
Manuel Cabal Lopez | Assistant professor |
Valentina Carraro | Assistant Professor in Global Transformations and Governance Challenges |
Stefan Cetkovic | Assistant Professor |
Leila Demarest | Associate Professor |
Matthew di Giuseppe | Director of Studies / Associate Professor |
Roos van der Haer | Assistant Professor |
Gisela Hirschmann | Senior Assistant Professor |
Joop van Holsteijn | Professor Political Behaviour and Research Methods |
Corinna Jentzsch | Assistant Professor |
Petr Kopecky | Professor of Comparative Studies Political Parties and Party Systems |
Matthew Longo | Senior Assistant Professor |
Tom Louwerse | Director of Research / Associate Professor |
Floris Mansvelt Beck | Assistant Professor |
Juan Masullo Jimenez | Assistant Professor |
Hilde van Meegdenburg | Assistant Professor |
Michael Meffert | Assistant Professor |
Frits Meijerink | Assistant professor |
Tim Mickler | Senior assistant professor |
Martijn Mos | Assistant Professor |
Katharina Natter | Senior assistant professor |
Paul Nieuwenburg | Professor Political Philosophy |
Simon Otjes | Senior Assistant Professor |
Hans Oversloot | Senior Assistant Professor |
Jonathan Phillips | Assistant Professor |
Rebecca Ploof | Assistant Professor |
Karolina Pomorska | Associate professor |
Francesco Ragazzi | Associate professor |
Babak Rezaeedaryakenari | Senior Assistant Professor |
Josh Robison | Assistant Professor |
Michael Sampson | Senior Assistant Professor |
Jan Aart Scholte | Professor Global Transformations and Governance Challenges |
Jonah Schulhofer-Wohl | Senior assistant professor |
Maria Spirova | Associate Professor |
Tom Theuns | Senior Assistant Professor |
Daniel Thomas | Professor of International Relations |
Christina Luise Toenshoff | Assistant professor |
Vasiliki (Billy) Tsagkroni | Senior assistant professor |
Wouter Veenendaal | Professor by Special Appointment Kingdom Relations |
Claire Vergerio | Guest |
Marco Verschoor | Assistant Professor |
Cynthia van Vonno | Assistant Professor |
Niels van Willigen | Director of Studies/Associate Professor |
Nikoleta Yordanova | Associate professor |
Yuan Yi Zhu | Assistant professor |
Frank de Zwart | Guest |
Alessia Aspide | PhD candidate |
Cyan Bae | PhD candidate |
Kathleen Brown | PhD candidate |
Mateo Cohen | External PhD candidate |
Josette Daemen | Postdoc |
Jesse Doornenbal | Lecturer |
Eleftherios Karchimakis | Lecturer |
Aleksandra Khokhlova | PhD candidate |
Stijn Koenraads | PhD candidate |
Hannah Kuhn | PhD candidate |
Alex Schilin | Guest researcher |
Pawan Sen | PhD candidate |
Ruben van de Ven | PhD candidate |
Anouk van Vliet | Lecturer |
Denny van der Vlist | PhD candidate |
Thijs Vos | PhD candidate |
Rick van Well | PhD candidate |
Daan van den Wollenberg | PhD candidate / self funded |
Elina Zorina | PhD candidate |
Rudy Andeweg | Professor emeritus of Empirical Political Science |
Ivan Bakalov | Lecturer |
Agha Bayramov | Lecturer |
Jelena Belic | Lecturer |
Jelke Bethlehem | Professor emeritus Survey Methodology |
Peter Castenmiller | Lecturer |
Diana Davila Gordillo | Guest - researcher |
Henk Dekker | Emeritus professor of Political Socialization and Integration |
Katerina Galanopoulou | Lecturer |
Rutger Hagen | Lecturer |
Henriëtte van den Heuvel | Scientific Director ad interim |
Galen Irwin | |
Devrim Kabasakal Badamchi- | Guest researcher |
Eleftherios Karchimakis | Lecturer |
Müge Kinacioglu | Lecturer |
Ruud Koole | Professor emeritus Politicologie |
Amber Lauwers | Lecturer |
José Lourenço | Lecturer |
Gjovalin Macaj | Assistant professor |
Jan Meijer | Lecturer |
Marijn Nagtzaam | Lecturer |
Christoph Niessen | Guest |
Alexandros Ntaflos | Lecturer |
Joyce Outshoorn | Emeritus professor vrouwenstudies |
Jimena Pacheco Miranda | Lecturer |
Julia Puente Duyn | Lecturer |
Ellen van Reuler | Lecturer |
Thomas Scarff | Lecturer |
Radostina Sharenkova-Toshkova | Guest |
Justin Spruit | Lecturer |
Vishwesh Sundar | Lecturer |
Harmen van der Veer | Lecturer |
Amy Verdun | Guest professor |
Ruben Verheul | Web editor |
Anouk van Vliet | Lecturer |
Carina van de Wetering | Lecturer |
Gul-i-Hina van der Zwan | Postdoc / guest |
Wencke Appelman | Study adviser |
Ester Blom | Study Adviser |
Anna van Dijk | Management/office-assistent |
Nathalie van Dooren | Coordinator Marketing & Student Recruitment |
Desiree van Drongelen | Staff Member Studentregistrations |
Nynke Heegstra | Study Advisor |
Ingrid van Heeringen-Göbbels | Institute manager |
Marit van der Heide | Student member institute board |
Katie Hudson | Data Steward |
Lianne Janssen | Policy Officer Education and Quality Assurance |
Eline Joor | Internationalisation Officer |
Ian Lau | Study adviser |
Daniëlle Lovink | Study Adviser |
Carien Nelissen | Education manager |
Caroline Remmerswaal | Secretary board of examiners |
Marjan Rijnja | Teaching Coordinator |
Elka Smith | Research Project Manager |
Judy Spruit | Management/office-assistent |
Tessa Thomas | Management assistant |
Debbie Tromper | Management assistant |
Gerard van der Veer | Secretary board of examiners |
Ruben Verheul | Web editor |
Jeanne Viet | Communication staff member |
Denise Zeeuw-van Veen | Management/office assistant |
Not bad.
Suppose you do not like datawrangling but simply love scraping.
lpol_staff_names <- read_html("https://www.universiteitleiden.nl/en/social-behavioural-sciences/political-science/staff#tab-1") %>%
html_element("section.tab.active") %>%
html_elements("ul.table-list") %>%
html_elements("li") %>%
html_elements("a") %>%
html_elements("div") %>%
html_elements("strong") %>%
html_text()
lpol_staff_functions <- read_html("https://www.universiteitleiden.nl/en/social-behavioural-sciences/political-science/staff#tab-1") %>%
html_element("section.tab.active") %>%
html_elements("ul.table-list") %>%
html_elements("li") %>%
html_elements("a") %>%
html_elements("div") %>%
html_elements("span") %>%
html_text()
lpol_staff2 <- data.frame(name = lpol_staff_names, funct = lpol_staff_functions)
fshowdf(lpol_staff2)
name | funct |
---|---|
Adina Akbik | Senior Assistant Professor |
Femke Bakker | Senior assistant professor |
Ingrid van Biezen | Professor of Comparative Politics |
Nicolas Blarel | Associate Professor |
Arjen Boin | Professor of Public Institutions and Governance |
Theo Brinkel | Professor by Special Appointment Military-social studies |
Manuel Cabal Lopez | Assistant professor |
Valentina Carraro | Assistant Professor in Global Transformations and Governance Challenges |
Stefan Cetkovic | Assistant Professor |
Leila Demarest | Associate Professor |
Matthew di Giuseppe | Director of Studies / Associate Professor |
Roos van der Haer | Assistant Professor |
Gisela Hirschmann | Senior Assistant Professor |
Joop van Holsteijn | Professor Political Behaviour and Research Methods |
Corinna Jentzsch | Assistant Professor |
Petr Kopecky | Professor of Comparative Studies Political Parties and Party Systems |
Matthew Longo | Senior Assistant Professor |
Tom Louwerse | Director of Research / Associate Professor |
Floris Mansvelt Beck | Assistant Professor |
Juan Masullo Jimenez | Assistant Professor |
Hilde van Meegdenburg | Assistant Professor |
Michael Meffert | Assistant Professor |
Frits Meijerink | Assistant professor |
Tim Mickler | Senior assistant professor |
Martijn Mos | Assistant Professor |
Katharina Natter | Senior assistant professor |
Paul Nieuwenburg | Professor Political Philosophy |
Simon Otjes | Senior Assistant Professor |
Hans Oversloot | Senior Assistant Professor |
Jonathan Phillips | Assistant Professor |
Rebecca Ploof | Assistant Professor |
Karolina Pomorska | Associate professor |
Francesco Ragazzi | Associate professor |
Babak Rezaeedaryakenari | Senior Assistant Professor |
Josh Robison | Assistant Professor |
Michael Sampson | Senior Assistant Professor |
Jan Aart Scholte | Professor Global Transformations and Governance Challenges |
Jonah Schulhofer-Wohl | Senior assistant professor |
Maria Spirova | Associate Professor |
Tom Theuns | Senior Assistant Professor |
Daniel Thomas | Professor of International Relations |
Christina Luise Toenshoff | Assistant professor |
Vasiliki (Billy) Tsagkroni | Senior assistant professor |
Wouter Veenendaal | Professor by Special Appointment Kingdom Relations |
Claire Vergerio | Guest |
Marco Verschoor | Assistant Professor |
Cynthia van Vonno | Assistant Professor |
Niels van Willigen | Director of Studies/Associate Professor |
Nikoleta Yordanova | Associate professor |
Yuan Yi Zhu | Assistant professor |
Frank de Zwart | Guest |
url = "https://www.universiteitleiden.nl/en/social-behavioural-sciences/political-science/staff#tab-1"
response = requests.get(url)
webpage = response.content
# Parse the webpage
soup = BeautifulSoup(webpage, 'html.parser')
#save the tags 'li' in a list
soup2 = soup.select_one('section.tab.active').select_one('ul.table-list').find_all("li")
#loop through the list to find what we want
lpol_staff_names = [taga.select_one("strong").get_text() for taga in soup2]
lpol_staff_functions = [taga.select_one("span").get_text() for taga in soup2]
# Combine data into a DataFrame
lpol_staff_df = pd.DataFrame({
'name': lpol_staff_names,
'funct': lpol_staff_functions
})
# Display the DataFrame
print(lpol_staff_df.head())
name funct
0 Adina Akbik Senior Assistant Professor
1 Femke Bakker Senior assistant professor
2 Ingrid van Biezen Professor of Comparative Politics
3 Nicolas Blarel Associate Professor
4 Arjen Boin Professor of Public Institutions and Governance
Copyright © 2024 Jochem Tolsma