2. article language processing

Part 2

Articles: 472 892
Period: 2000- 2019 years
Technologies: Python, Pandas, Matplotlib, Spacy

All article’s data was tokenized using regex. Lemma and Named-entity recognition (NER) was applied using Spacy.

Most common words used in titles are: “Jav”, “Lietuvos”, “žuvo”, “Rusijos”, “ES”. In this case stop words were removed but lemmatization was not applied.

Using Spacy (NER) all named entities were given six different labels:

GPE        Geopolitical location
LOC        Location  
ORG        Organisation  
PERSON     Person 
PRODUCT    Product   
TIME       Time 

Most often is used Geopolitical location (192.6 K) and person (122.9).

Top named entities in articles are: ‘JAV’, ‘Lietuva’, ‘Rusija’, ‘ES’.

Quite strange that among Lithuanian politicians only R.Paksas is in Top 30. Other most popular people mentioned in articles are: D. Kedys, A.Butkevičius, L. Graužinienė , A.Kubilius, D.Grybauskaitė.

If analyzing most popular named entities in article titles, we navigate a huge spike in keyword ‘Russia’ in year 2014. In this case all variations like ‘Rusija’, ‘Rusijos’, ‘Rusijoje’ were summed up. Very widely is used ‘USA’ which is evenly gaining popularity.

