Thesis recap

I do not know if anyone was noticing, but due to a bunch of various reasons I have not been writing here. For over half a year I have been busy writing my student thesis (called “Großer Beleg”, the next thing what comes close to that would be the bachelor thesis). The next goal will be the diploma thesis with an equal time span.

Well, originally there had been the plan to give updates on this site regularly, but as you can clearly see it’s not been the case after all. The rationale was to keep myself motivated and might inspire others or relate to the problems I might encounter. I ended up not to due to not prioritizing this blog and focussing on the thesis itself. In this very week my thesis defense took place and now it is time to give some sort of summary about what happened, what could have been better and what exactly are conclusions for the next time.

Dissecting my approach to my thesis

As you can see the last update was about the draft of my topic. That topic shifted now towards the new one called “Creating automatic text snippets for news events”. This new challenge now includes the task of summarizing a clustered news article about the same news events. Whereas the approach with Wikipedia included a single semi-structured document, we have here a corpus of unstructured text. Being eager, I set the goal of surpassing the current state-of-the-art ROUGE-score (that is a rough estimate of how well your summarization conveys the topic by computing the overlap of n-grams with a set of model summarizations). In the end I did not implement a complete novel approach, but used an existing approach that yielded very good results and yet offered a nice framework which allowed simple modifications.

Kickoff

As good this sounds in theory, it did not ended up as I had initially wished. This is due to personal organizational reasons and those inherent with this topic and the domain of Natural Language Processing itself. In the beginning of my work I started with a lengthy phase of experimenting trying different techniques. I made my own dataset and overlooked the results manually. The results were not easy to interprete both content- and linguistic-wise. Is there an improvement or not from the last state? I should have strived instead for a more systematic way.

First of all you need a dataset and get familiar with the evaluation metrics. I ended up using the DUC 2004 dataset and the current ROUGE evaluation Framework. After fiddling out how to use this (not so well documented) framework you will know if you are improving or not. Before starting any experiment always have in mind to re-evaluate how well that performs. Otherwise you will waste your time on things too long on things not worthwhile. Simple debug-prints and partial checks will not suffice in the long term. Write unit tests when implementing new functionality and use existing Gold standards/test sets when modelling a (sub-)problem.

Reading

Investigating in the topic of automatic summarization can be really cumbersome due to the sheer amount of sources. The discipline of summarization goes back to the early work of Luhn in the 1950’s. When you start diving into that topic, I can recommend reading the following sources:

Papers:

  • “A Survey on Automatic Text Summarization”, Dipanjan Das and André F.T. Martins (2007) — Survey to get an overview about the approaches and challenges.
  • “Automatic Summarization”, Ani Nenkova and Kathleen McKeown (2011) — Exhaustive roundup covering many topics and approaches (solving and evaluating the problem). Very good source!

Books:

  • “Multilingual Natural Language Processing Applications: From Theory to Practice”, Daniel Bikel and Imed Zitoun (2012) — I recommend reading the initial chapters and especially Chapter 12 which explains how to approach the problem practically.
  • “Multi-source, Multilingual Information Extraction and Summarization”, Thierry Pobeau, Horacio Saggion, Jakub Piskorski, Roman Yangarber (2012) — A very good and comprehensible book emphasizing on the summarization with IE techniques (e.g template filling) and more.

Also look carefully when results being presented as the datasets behave very different in terms of difficulty. The authors also tend to use a configuration which lets their approach looking extraordinary well. Every so often I have seen papers without a full description how their results were measured exactly. In that regard it is useful to look for technical reports which explain those aspects in detail. I also made the experienced that sometimes you get no answer when asking for details like “What kind of preprocessing have you used?” or “Which parameter setting have you used for the ROUGE Toolkit?”. Accept it, that some will never answer.

Before I started to collect my sources I was told by a friend to organize my collected papers really well. The software Mendeley Desktop in that respect turned out handy, because I was able to easily index and tag my sources, keep track of what papers I have read so far and rate their usefulness. Other features I appreciated were extracting the meta-information of a paper when browsing in the web and the fact I can easily extract Bibtex entries for LateX. The only downside I have found is that there is only a single simple search box, which limits the true potential.

Writing

When it comes to writing I have always difficulties. Making the general outline, writing concise and yet precise troubled me. The general strategy would be finding the state-of-the-art approaches and how they address the occuring problems. But as aforementioned being exposed to so many source and validating them was time-consuming. Especially when I was in the meanwhile testing with my early experiments.
The ultimate conclusion is, that you don’t have to get startled and simple write your findings down. Not in lengthy notes which you maybe look up one or two times later, but rather directly in the final form.
This serves two purposes, one is that you simply have written something which you can use and re-order later anyway and the second is that you train your writing skills.
I spend too much time thinking what I could write down instead of actually doing it. This results otherwise in the long term in hasty written text bulks to have at least something to show your supervisor.

Getting into the topic can be troublesome too: In the discipline of automatic text summarization there are generally two key-aspects to take care of: how well does the generated content reflect the events and how well it is being understood by the readers.
There are a multitude of approaches which focusses either in one aspects or even both of them. At that point it is pretty difficult to choose on which general basis you want to work on. I was always afraid that by continuing a highly complex way I would end up with no working outcome. Whereas many sources showed that retaining to the simple approaches will not increase performance tremendously. So you can observe for instance that many approaches performs indistinguishly similar regarding ROUGE-scores.
In the end I wrote a concept which followed both key-aspects, working simple by extracting sentences forming a summarization and utilize a NLP- and IE-driven postprocessing step to improve readability.

Experimenting

My experimentation was two-fold: First, I would measure the information coverage by the ROUGE metric and secondly, I would perform a user study measuring the linguistic quality of the generated text.

In the former part I compared the performance of variants of my prototype in comparision to random, baseline and the best participant. Looking at the results I was not able to beat the best participant but still landing in a rather moderate position within the whole spectrum. Which is still upsetting having in mind, that I used an approach that initially set up the bar. My conclusions were that there are some implementation details not mentioned in the paper.

The second part rated the quality of the output and examined the impact of the postprocessing by compairing results with and without it. The general setup of the user study was acceptable, but still has some unnecessary rough edges.
For instance I presented my generated summaries, summaries produced by humans with same task and the introduction of an article from the cluster written by proffessional writer.
My fault was there, that these samples were not uniform distributed which resulted in big differences when checking their measured standard errors. Another point of improvement is the evaluation itself which could have been richer by using extensive ANOVA, t-test for hypothesis testing and so on. Arrange more time for evaluation.

Lessons Learned

While writing my student thesis I have learned many things. Here are my conclusions:

  • Do something every day. Set tasks and meet personal deadlines.
  • Get familiar with the tools you use. Master them.
  • Before proposing something, think how you can evaluate it.
  • Always write your thoughts down when doing research.
  • Don’t be afraid to ask people.
  • … and finally do sports and get some fresh air to clear your head!

Addendum:
If you are interested in Natural Language Processing in general you can do the following things: check out the free Stanford NLP course and watch their lectures, read Jurafsky’s Book Speech and Language Processing and if you are more interested in programming, then play around with the Natural Language Toolkit (version 3.0 is currently beta and supports Python 3).

Task refinement, blog-organization, etc.

It’s been about a full month since the last update. Inspite of the fact of having no textual feedback during that time (excluding the irregular tweets), there happened a lot of things that will determine my activities for the following month, i.e. my part-time job as student assistent, a project adding some major functionality to my University’s website plus another rather time-consuming course’s task spanning over the whole semester. As consequence, I will at first tackle those urging topics as fast as possible and will focus back again to my thesis.

Speaking of which, today I met my academic advisor settling up a more refined task description replacing the one I presented in my very first blog entry:

[spoiler title=” Wikipedia summary and overview generation (current draft) “][quote style=”1”]The amount of semi-structured information, such as Wikipedia, presents a problem similar to information overload, where the quantity of information may confuse or overwhelm the user.

To address this problem we propose the generation of an intermediate stage between a document index and the detailed documents by generating a dynamic graph animation of semantically related document summaries.

This would provide a conceptual map of structured information where important or central concepts can be found with little difficulty. A dynamic visual representation could alleviate the cognitive load of the user.

The approach for generating results is focused on the used of natural language techniques and semantic web information to add to the available structured information. Although structured information can and should be used to bootstrap the process.

A special case of the results is the generation of a causal-temporal graph to represent the progression of events or activities, such as a visually animated timeline or flowchart.

  • Subtasks:
    1. Define semantic relations that are not explicit or complete within the structured data ie. Wikipedia.
    2. Define mechanisms to enhance the available semi-structured information using natural language techniques and semantic web information.
    3. Implement a visualization that updated with streaming information.
    4. Present the generated information in the visualization.
[/quote] [/spoiler]

This description now conveys a more granulated view of what I am doing. As you might see the semantic relation has shifted from the original strong ones (logical relations like “consequence”, etc.) to the weaker causal-temporal, yet very beneficial, ones. As a side remark: Even though Discourse Analysis differentiate a variety of potential relation, but there formalization seems to be vague. The visualization aspect was personal of importance, since I would like to have something in the end, which supports the reader getting a clear and intuitive overview.

*Important*  For reasons of readability and general blog organisation, I will try to distinguish between thesis-related and off-topic, personal-related entries. I also stipulate, that everything written in the whole blog reflects my personal opinion/thinking and not any other’s party (neither Wikipedia, TU Dresden, etc.) – just to prevent any potential confusions.

Literature & Preparations

xyz

Cover of "Natural Language Processing with Python"

It’s been some days of investigating and diving deep into the ocean of NLP. My current strategy includes getting a glimpse of the overall approaches from certain sub-areas. Especially those techniques from Text Mining and Computational linguistics are of importance, when dealing with Topic Detection and Tracking (TDT).

The main source of information are for the moment the Wikipedia articles and some of its linked sources, but I will soon shift to specific books and papers. I am planning to read the Book “Natural Language Processing with Python” intensively, which I found for now easily readable and comprehensive, given tons of examples. [As a little side note: Contrary to the standard Python IDE (IDLE), IPython turns out very handy when dealing with huge text output and offers nice features like object? as command, which introspects the object and provides excerpts from the documentation for that object.] Additionally “The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data” will also provide deeper insight in the formal aspects.

The very recent article I put my hands on is “Themenentdeckung und -verfolgung und ihr Einsatz bei Informationsdiensten für Nachrichten” [PDF, german] by Wolfgang G.Stock, who exemplifies the topic TDT on the basis of Google News, which automatically aggregates interesting news by actuality and topic and presents them by user’s preferences if demanded. Recently (not to say, today) Google news underwent a relaunch improving their content and presentation. They added a new section for manually written news by some authorized people from newspapers – not involving those news in the automatization process. The reasons for this are still open – probably editors provide better topics, people want to read and those are used as basis for improving the generated ones, but that’s just speculation from the my point of view.

Twitters seems to be fun. I hopefully won’t get overwhelmed by hundreds of tweets.

Hello world!

This is yet another blog in the big internet which has been published due to mere egoistic reasons. The purpose of this site is to record the progress of my student’s thesis, which addresses problems both in the field of Natural Language Processing (NLP) and Graphs to illustrate the results. My particular task involves the “Generation of Event Chains from Wikipedia Articles”. The general idea behind this is that events have a so called cause-effect relationship, e.g. in news this would be the event “Country X attacks Y” and effect “Country Y announces retaliation for Country Y” and so on. Those relationships will be extracted using certain keywords like “causes”, “background” or “trigger”. The overall result will be presented in a summary graph which readily present the sources of events.

Since NLP offers a huge collection of algorithms and the fact, that I have not been working in that field so far, means that I have to learn a lot from scratch building up basic vocabulary and get familiar with the most fundamental approaches. On the practical side, I will make intensive use of the Natural Language Toolkit (written in Python) being told that’s a great introduction into that field.

To my person: I am a computer-science student at the University TU Dresden, Germany, majoring in “Intelligent Systems”. My other interests are writing little tools to improve daily life, use Mnemonics whenever possible, indie games, canoeing and Japanese language/culture.

PS.: There will be also off-topic stuff here I find interesting or funny. 😉