Skip to content

I'm a curious person and analysing world news is fun. Here I'm gathering all my Gdelt-related projects.

Notifications You must be signed in to change notification settings

albertovpd/analysing_world_news_with_Gdelt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

Analysing world news with the Gdelt Project.

Here I will be attaching all my Gdelt-related leisure projects. The list of projects briefly involves:

  • Used tools and languages.
  • Dashboard url.
  • Repository url.
  • Preview of each project (after the list).

The goal of this repo is to reinforce knowledge about Google Cloud Engineering, Data Analysis, Gdelt Project intelligence and Dashboarding. Mainly written with Python and SQL.


1. Socioeconomic Portrait Project.

5. Automated ML regression within a Cloud Function to infer unemployment searches on Google, in Spain.

2. Elon Musk influence in world news.

3. Monitoring unemployment in Spain.

I really like this one. It is the most visual, it tells a story, and sources that feed the graphs are shown, you can navigate through them.

4. Controversial public figures with Gdelt.

6. Gdelt project as marketing tool


Socioeconomic Portrait Project. A Google Cloud ETL.

Click to expand

Is there a way of monitoring some aspects of the global crisis in Spain? I believe so, and this is the motivation to develop this automated ETL process in Google Cloud involving Google Trends, sentiment analysis and influence in news through the Gdelt Project and Twitter, from raw data acquisition to the final dashboard. Thanks to it, I have been fighting with credentials, permissions, storage locations, processing locations, 3rd party authentications, Cloud Functions, pipelines, trigger schedulers with different time format, Dataprep global updates, etc... And I learned a lot in the way, quaratine fun! :D

It is worth mentioning the selector buttons are there just to have a clear picture of graphs. With them you can select the curves you want. alt


Automated ML regression within a Cloud Function to infer unemployment searches on Google, in Spain:

Click to expand

Taking advantage of this project ( https://github.com/albertovpd/automated_etl_google_cloud-social_dashboard ), i am using the gathered data to feed a ML model with which inferring unemployment searches on Google, in Spain.

  • Cloud Function A: Loads data from BigQuery tables to Cloud Storage, both in EEUU region. This tables contain requested and filtered info from the Gdelt Project, to analyse online news media in Spain (news section in the automated ETL link).
  • Cloud Function B:
    • Reads the data of Cloud Function A, and other data from a bucket in EU. This bucket contains requested info from Google Trends in Spain (Google searches section in the automated ETL link).
    • Merges datasets with different length and dates.
    • Processes them and creates a column and score for each keyword.
    • Normalises the final dataset.
    • Associate date with index, but dates are not in the game, so a time series problem was turned into a linear regression one. Check it out the full script explanation here.
    • Performs a Recursive Feature Elimination to select the best 20 features of 130 I have to play with.
    • Apply a linear regression to infer my keyword, in this case, unemployment.
    • Loads the results in a Cloud Storage bucket.
  • Both Cloud Functions are triggered by Pub/Sub and Scheduler. Scripts can be found here.

  • Weekly loaded to BigQuery tables with Transfer. Some results appended to the existing tables and some overwritten.

  • Plot the BigQuery tables.

Explanation available here => https://github.com/albertovpd/automated_ML_regression/blob/master/script_explained.ipynb

alt


Elon Musk influence in world news:

Click to expand

What the world media say about Elon Musk or his companies? That is the sentiment associated to his related news? What were the most positive and negative articles ever written about him? Let's check it out.

It has been really interesting to discover that "cheap clickbait webpages" are the ones mentioning more often Mr. Musk, even more than his own webpages like "teslamotors" or similar. In the end, he is a controversial public figure with always a really personal point of view.

Also interesting is the fact that the webpages I was expecting to see appears from under the 22th position, like Forbes, New York Times.

alt


Monitoring Unemployment

Click to expand

Everyone is afraid right now of a Global crisis like in 2008. Can we check how often is mentioned in national press unemployment-related topics, and compare nowadays results with the 2008 ones?

The answer: Yes If we also want to check the involved articles: Just from 2015.

alt


Controversial public figures with Gdelt:

Click to expand

I like reading "alternative" sources, like reddit, hackernews or meneame, and once in a while I read some news about delicate matters involving the King Emeritus of Spain. This articles always express a deep frustration about how this news are not being published in his country.

So, the questions I am trying to answer are the following:

Are the spanish news not publishing the same than the rest of world about the King Emeritus of Spain?

Do we have a method to impartially contrast it?

alt


Gdelt project as marketing tool.

Click to expand

Using the different sentiment analysis metrics provided by The Gdelt Project.

Query:

SELECT
  EXTRACT (date
  FROM
      PARSE_TIMESTAMP('%Y%m%d%H%M%S',CAST(date AS string))) AS Date,
  CAST(SPLIT(V2Tone, ",") [
  OFFSET
      (0)] AS FLOAT64) AS tone,
  CAST(SPLIT(V2Tone, ",") [
  OFFSET
      (1)] AS FLOAT64) AS pos_score,
  CAST(SPLIT(V2Tone, ",") [
  OFFSET
      (2)] AS FLOAT64) AS neg_score,
  CAST(SPLIT(V2Tone, ",") [
  OFFSET
      (3)] AS FLOAT64) AS polarity,
  CAST(SPLIT(V2Tone, ",") [
  OFFSET
      (4)] AS FLOAT64) AS arf,
  CAST(SPLIT(V2Tone, ",") [
  OFFSET
      (5)] AS FLOAT64) AS sg_rf,
  CAST(SPLIT(V2Tone, ",") [
  OFFSET
      (6)] AS FLOAT64) AS wc
  FROM
  `gdelt-bq.gdeltv2.gkg_partitioned`
  WHERE
  DATE(_PARTITIONTIME) >= "2018-01-01"
  AND lower(DocumentIdentifier) LIKE '%ironhack%'

If you are going to display results in Data Studio, always save the query in a BigQuery table and display that results.



Alberto Vargas.