Skip to content

2. article language processing

Part 2

Articles: 472 892
Period: 2000- 2019 years
Technologies: Python, Pandas, Matplotlib, Spacy

All article’s data was tokenized using regex. Lemma and Named-entity recognition (NER) was applied using Spacy.

Most common words used in titles are: “Jav”, “Lietuvos”, “žuvo”, “Rusijos”, “ES”. In this case stop words were removed but lemmatization was not applied.

Using Spacy (NER) all named entities were given six different labels:

GPE        Geopolitical location
LOC        Location  
ORG        Organisation  
PERSON     Person 
PRODUCT    Product   
TIME       Time 

Most often is used Geopolitical location (192.6 K) and person (122.9).

Top named entities in articles are: ‘JAV’, ‘Lietuva’, ‘Rusija’, ‘ES’.

Quite strange that among Lithuanian politicians only R.Paksas is in Top 30. Other most popular people mentioned in articles are: D. Kedys, A.Butkevičius, L. Graužinienė , A.Kubilius, D.Grybauskaitė.

If analyzing most popular named entities in article titles, we navigate a huge spike in keyword ‘Russia’ in year 2014. In this case all variations like ‘Rusija’, ‘Rusijos’, ‘Rusijoje’ were summed up. Very widely is used ‘USA’ which is evenly gaining popularity.

Published inLT press analysis

Comments are closed.