I do not know if anyone was noticing, but due to a bunch of various reasons I have not been writing here. For over half a year I have been busy writing my student thesis (called “Großer Beleg”, the next thing what comes close to that would be the bachelor thesis). The next goal will be the diploma thesis with an equal time span.
Well, originally there had been the plan to give updates on this site regularly, but as you can clearly see it’s not been the case after all. The rationale was to keep myself motivated and might inspire others or relate to the problems I might encounter. I ended up not to due to not prioritizing this blog and focussing on the thesis itself. In this very week my thesis defense took place and now it is time to give some sort of summary about what happened, what could have been better and what exactly are conclusions for the next time.
Dissecting my approach to my thesis
As you can see the last update was about the draft of my topic. That topic shifted now towards the new one called “Creating automatic text snippets for news events”. This new challenge now includes the task of summarizing a clustered news article about the same news events. Whereas the approach with Wikipedia included a single semi-structured document, we have here a corpus of unstructured text. Being eager, I set the goal of surpassing the current state-of-the-art ROUGE-score (that is a rough estimate of how well your summarization conveys the topic by computing the overlap of n-grams with a set of model summarizations). In the end I did not implement a complete novel approach, but used an existing approach that yielded very good results and yet offered a nice framework which allowed simple modifications.
As good this sounds in theory, it did not ended up as I had initially wished. This is due to personal organizational reasons and those inherent with this topic and the domain of Natural Language Processing itself. In the beginning of my work I started with a lengthy phase of experimenting trying different techniques. I made my own dataset and overlooked the results manually. The results were not easy to interprete both content- and linguistic-wise. Is there an improvement or not from the last state? I should have strived instead for a more systematic way.
First of all you need a dataset and get familiar with the evaluation metrics. I ended up using the DUC 2004 dataset and the current ROUGE evaluation Framework. After fiddling out how to use this (not so well documented) framework you will know if you are improving or not. Before starting any experiment always have in mind to re-evaluate how well that performs. Otherwise you will waste your time on things too long on things not worthwhile. Simple debug-prints and partial checks will not suffice in the long term. Write unit tests when implementing new functionality and use existing Gold standards/test sets when modelling a (sub-)problem.
Investigating in the topic of automatic summarization can be really cumbersome due to the sheer amount of sources. The discipline of summarization goes back to the early work of Luhn in the 1950’s. When you start diving into that topic, I can recommend reading the following sources:
- “A Survey on Automatic Text Summarization”, Dipanjan Das and André F.T. Martins (2007) — Survey to get an overview about the approaches and challenges.
- “Automatic Summarization”, Ani Nenkova and Kathleen McKeown (2011) — Exhaustive roundup covering many topics and approaches (solving and evaluating the problem). Very good source!
- “Multilingual Natural Language Processing Applications: From Theory to Practice”, Daniel Bikel and Imed Zitoun (2012) — I recommend reading the initial chapters and especially Chapter 12 which explains how to approach the problem practically.
- “Multi-source, Multilingual Information Extraction and Summarization”, Thierry Pobeau, Horacio Saggion, Jakub Piskorski, Roman Yangarber (2012) — A very good and comprehensible book emphasizing on the summarization with IE techniques (e.g template filling) and more.
Also look carefully when results being presented as the datasets behave very different in terms of difficulty. The authors also tend to use a configuration which lets their approach looking extraordinary well. Every so often I have seen papers without a full description how their results were measured exactly. In that regard it is useful to look for technical reports which explain those aspects in detail. I also made the experienced that sometimes you get no answer when asking for details like “What kind of preprocessing have you used?” or “Which parameter setting have you used for the ROUGE Toolkit?”. Accept it, that some will never answer.
Before I started to collect my sources I was told by a friend to organize my collected papers really well. The software Mendeley Desktop in that respect turned out handy, because I was able to easily index and tag my sources, keep track of what papers I have read so far and rate their usefulness. Other features I appreciated were extracting the meta-information of a paper when browsing in the web and the fact I can easily extract Bibtex entries for LateX. The only downside I have found is that there is only a single simple search box, which limits the true potential.
When it comes to writing I have always difficulties. Making the general outline, writing concise and yet precise troubled me. The general strategy would be finding the state-of-the-art approaches and how they address the occuring problems. But as aforementioned being exposed to so many source and validating them was time-consuming. Especially when I was in the meanwhile testing with my early experiments.
The ultimate conclusion is, that you don’t have to get startled and simple write your findings down. Not in lengthy notes which you maybe look up one or two times later, but rather directly in the final form.
This serves two purposes, one is that you simply have written something which you can use and re-order later anyway and the second is that you train your writing skills.
I spend too much time thinking what I could write down instead of actually doing it. This results otherwise in the long term in hasty written text bulks to have at least something to show your supervisor.
Getting into the topic can be troublesome too: In the discipline of automatic text summarization there are generally two key-aspects to take care of: how well does the generated content reflect the events and how well it is being understood by the readers.
There are a multitude of approaches which focusses either in one aspects or even both of them. At that point it is pretty difficult to choose on which general basis you want to work on. I was always afraid that by continuing a highly complex way I would end up with no working outcome. Whereas many sources showed that retaining to the simple approaches will not increase performance tremendously. So you can observe for instance that many approaches performs indistinguishly similar regarding ROUGE-scores.
In the end I wrote a concept which followed both key-aspects, working simple by extracting sentences forming a summarization and utilize a NLP- and IE-driven postprocessing step to improve readability.
My experimentation was two-fold: First, I would measure the information coverage by the ROUGE metric and secondly, I would perform a user study measuring the linguistic quality of the generated text.
In the former part I compared the performance of variants of my prototype in comparision to random, baseline and the best participant. Looking at the results I was not able to beat the best participant but still landing in a rather moderate position within the whole spectrum. Which is still upsetting having in mind, that I used an approach that initially set up the bar. My conclusions were that there are some implementation details not mentioned in the paper.
The second part rated the quality of the output and examined the impact of the postprocessing by compairing results with and without it. The general setup of the user study was acceptable, but still has some unnecessary rough edges.
For instance I presented my generated summaries, summaries produced by humans with same task and the introduction of an article from the cluster written by proffessional writer.
My fault was there, that these samples were not uniform distributed which resulted in big differences when checking their measured standard errors. Another point of improvement is the evaluation itself which could have been richer by using extensive ANOVA, t-test for hypothesis testing and so on. Arrange more time for evaluation.
While writing my student thesis I have learned many things. Here are my conclusions:
- Do something every day. Set tasks and meet personal deadlines.
- Get familiar with the tools you use. Master them.
- Before proposing something, think how you can evaluate it.
- Always write your thoughts down when doing research.
- Don’t be afraid to ask people.
- … and finally do sports and get some fresh air to clear your head!
If you are interested in Natural Language Processing in general you can do the following things: check out the free Stanford NLP course and watch their lectures, read Jurafsky’s Book Speech and Language Processing and if you are more interested in programming, then play around with the Natural Language Toolkit (version 3.0 is currently beta and supports Python 3).