Getting started in NLP/ML research

June 17, 2020

Over the years, a number of friends have asked me about how they can get started in NLP and ML research. While no expert in CS curriculum design, I made this transition myself coming out of college (I majored in ECE, with focus in computer hardware and high power energy systems). This guide is a summary of how to get started in NLP/ML research. Following this guide will not make you an expert — that would require a formal education and years of practice. Rather, this aims to help you acquire enough experience to quickly pursue work in this area (e.g. working with a research lab).

Who is this for?

This guide is for someone who wants to pursue NLP/ML research, but does not have the prerequisite experience. For example:

  • an undergraduate student looking to get involved in graduate research but has not taken the necessary classes.
  • a graduate student in another area who wants to transition into AI.
  • someone in industry who wants to work in research.

Python

This guide aims to get you from knowing linear algebra and programming to being able to be productive in a research project. The fastest way to be productive is to be able to contribute code. At the time of writing, the majority of NLP/ML code is written in Python. If you know how to program but are not proficient in Python, I would recommend going through Google’s Python Class for people who already know how to code, taught by Nick Parlante.

Introduction to NLP

At this point, we can go one of two ways. First, we learn neural networks, which we apply to toy tasks such as MNIST to get a sense of how to implement them and how they work. Second, we learn a problem domain such as NLP, identify its subproblems, and discuss how to solve these subproblems. I tend to favor the second approach, because it emphasizes the problem instead of the method du jour.

For NLP, Dan Jurafsky’s book Speech and Language Processing provides a nice overview of the important problems. I suggest that you read only the chapters essential to your problem of interest. For example, if you would like to work in dialogue research, then you can probably skip chapters such as information extraction, word sense, etc. These are important topics, but you should quickly get to a point where you can make valuable contributions to the project instead of initially attempting to understand all sub-areas of NLP. Some important chapters regardless of your research topic include:

  • Introduction
  • Language modeling
  • Naive Bayes
  • Logistic regression
  • Vector semantics and embeddings
  • Neural nets and neural language models
  • Part-of-speech tagging (the problem is not so interesting but it should help you gain modeling intuitions)
  • Sequence processing with recurrent neural networks
  • Encoder-decoder models
  • Machine-translation

I encourage you to work through some of the problems in this book, which will evaluate your understanding of the chapters. In particular, you can try implementing the algorithms discussed in the book on real text data, in Python.

PyTorch

PyTorch is currently the most popular ML framework in academic research due to its simplicity and flexibility. Once you understand the basics of neural networks (e.g. layers, backpropagation), you can treat PyTorch as an automation tool that performs automatic differentiation. Deep Learning with PyTorch: A 60 Minute Blitz provides a quick introduction to PyTorch. You can proceed with this section in parallel with the NLP section.

Experimentation

Now you should be able to conduct your own experiments. Pick some problems of interest and try to write up a model for them in PyTorch. Here are some example tasks that should not require an expensive GPU:

The above links directly refer to raw data. There are many frameworks on top of PyTorch available. Should you use them? Generally, I recommend against the usage of other people’s frameworks when learning, because it is helpful to understand the details. For example:

  • What does the data look like?
  • What is the evaluation metric computing?
  • What kind of tokenization is used?
  • How do you represent (e.g. featurize) the problem?
  • Are there any quirks with the dataset?
  • How are the inputs and outputs distributed?

By coding these details, you will gain intuition on why things work and where things go wrong. As a result, you should always write your own preprocessing and your own training loop.

Once you have successfully done some experiments on your own, you are ready to contribute to research projects.

How to find a research project as an undergraduate student

The easiest way to find a research project is through emailing researchers (e.g. PhD students) at your local institution.

What should you expect? In large research labs, it is unlikely that you will work with the professor directly. Instead, you will likely work with a PhD student on their project, and occasionally meet with the professor with the PhD student. Once you have shown a capacity for independent work, you may lead projects with an advisor.

What opportunities should you look for? Because you are just getting started, you should look for projects that are well-defined. Open-ended projects can be more impactful, but often times they do not work out even if the researcher is experienced. Initially working on well-scoped projects will give you the experience and confidence to work on more complex, open-ended problems. On a related note, you may want to find experienced mentors (e.g. senior PhD students) who know how to scope the project for you.

Useful tools

Here are some additional tools you might find helpful when experimenting with your own data and models:

That's it! I hope you found this helpful.

Copyright2020, Victor Zhong