Instructor: Thomas Monk (t.d. monk@lse.ac.uk)

^{ Room 2.01 H, Centre for Economic Performance, London School of Economics}

This intensive course will introduce students to the Python programming language as a tool for applied data science. In PP455 we utilised Stata, the primary environment used by economists and public policy academics for regression analysis. Python is a more general-purpose tool from which we can perform a range of tasks, from data cleaning, transformation, and visualisation to more advanced techniques on the social science research frontier, such as natural language processing and machine learning.

The two-week course, containing a semester’s worth of material, will take students from the first principles of Python programming to the application of data science packages such as NumPy and Pandas. Each class will be practical and hands-on, with the course focusing on the application of these tools in the public policy space.

A pass mark in PP455, or equivalent.

We meet daily 11:00-13:00 in NAB 2.09, with an additional class scheduled on Tuesday, 30 August 2022 from 14:00-16:00.

This course is designed as an intensive two-week introduction to programming and data science for public policy students. The content covered will include:

- Programming in a setting of public policy - why should we care?
- Thinking algorithmically, taking policy questions to the data via a general-purpose programming language.
- Fundamentals of programming: code syntax, libraries, variables, data types, program control, functions, and IO.
- Data science through open-source python libraries. Cleaning, obtaining and analysing structured data.
- Introducing more advanced applications, such as natural language processing and machine learning.

- Class 1 - introduction to data science from the perspective of public policy
- Class 2 - variables, functions and conditionals
- Class 3 - lists and strings.
- Class 4 - loops and other flow controls.
- Class 5 - data assignment.
- Class 6 - introduction to NumPy and Pandas.
- Class 7 - more advanced Pandas, merging.
- Class 8 - text as data, sentiment analysis.
- Class 9 - introduction to machine learning concepts, relation to causal inference with linear models.
- Class 10 - machine learning with non-linear models.

Date | Class | Content I | Content II | Application |
---|---|---|---|---|

Tuesday, 30 August 2022 | Class 1 - AM | Intro to Programming | Python basics | Setting up the Python environment |

Tuesday, 30 August 2022 | Class 2 - PM | Python basics | Functions & conditionals | Working with notebooks |

Wednesday, 31 August 2022 | Class 3 | Lists, strings and dictionaries | Loops and list comprehensions | |

Thursday, 1 September 2022 | Class 4 | Recap: lists, strings and dictionaries | Loops | Nested loops: in the casino |

Friday, 2 September 2022 | Class 5 | Data assignment | Data assignment | Chicago city employee data |

Tuesday, 6 September 2022 | Class 6 | NumPy | Introduction to Pandas | Wine ratings & crime data |

Wednesday, 7 September 2022 | Class 7 | More advanced Pandas | Merging with Pandas | Chicago city employee data |

Thursday, 8 September 2022 | Class 8 | Text as Data | Sentiment analysis | Twitter as data |

Friday, 9 September 2022 | Class 9 - AM | Introduction to machine learning - linear model | Applied machine learning task | House price data |

Friday, 9 September 2022 PM | Class 10 - PM | Machine learning: non-linear models | Applied machine learning: Random Forests and XGBoost | House price data: better predictions? |

- There is not an assigned textbook for this course, but Automate the Boring Stuff in Python is a well written, free resource. It will cover everything we discuss in class, at least the Python side, and allow you to dive as deep as you wish into the language.
- I’m grateful to Eric Potash for his 30550 course at the University of Chicago, which provided a useful basis for the material presented in this course, and some of the assigned Problem Sets.
- We didn’t have time to cover the Split-Apply-Combine paradigm. This is covered well in the official Pandas documentation, and Problem Set 7b is available for you to practice this.

- For more fundamental programming instruction, Google’s Kaggle Python and Pandas courses are excellent, as are the rest of their learning materials.

- Chris Bail (Duke) provides an excellent course in Data Scraping and Text Analysis.
- His undergraduate level course Data Science and Society provides a useful accompaniment, and his graduate level Computational Social Science extends the topics we have discussed.

- The courses above use R, I also recommend the R oriented Text Mining with R as a useful reference resource . Now we have worked to understand Python and programming as an abstract concept, it will be simple for you to transition to R if you wish.

- An Introduction to Statistical Learning (2021, James et al.) is the basis of the material I presented.
- LSE’s ME314, Introduction to Data Science and Machine Learning, uses this book as a basis, and is an excellent adaption and extension. The course as a whole goes further than we were able to, and is clearly presented with many useful assignments.
- The mathematics and statistics used may be a barrier to understanding the material.
- If you’re interested in developing these skills Sydsaeter et al’s (2008) Essential & Further Mathematics for Economic Analysis and Miller & Miller’s (2021) John E. Freund’s Mathematical Statistics with Applications are my preferred resources for mathematics and statistics respectively.
- As an aside, to link this topic more formally to the econometrics we’ve covered in PP455, Bruce Hansen’s (2022) Econometrics is a well written, rigorous presentation of the material.

- Chen et al’s (2021) Data Science for Public Policy touches on and expands upon the major beats we were able to cover in this course. It also covers some interesting qualitative aspects of the use of data science in a public policy setting.