In this section we will dive a little bit more into the technical considerations and the framework used to build http://www.sentcrypt.com/bitcoin/

5 Implementing a sentiment analyser

My main objective for the closing project of the CAS in Blockchain was to build something real and tangible combining two of my hobbies: machine learning and crypto currencies. I wanted to experiment with a fun and real-life project in order to familiarize myself with sentiment analysis and NLP technologies while producing a service or a good related to crypto currencies. I quickly realized that building a website from scratch while experimenting with those new technologies that passionate me was actually an interesting project and would provide me with valuable knowledge that I could eventually leverage as well in my portfolio in the future. I settled for sentiment analysis using the twitter API and the collection of bitcoin prices from APIs to plot them both into a single webpage, everything else is bonus.

I usually learn by doing so I logged in to my GoDaddy account and bought a domain name that was simple, easy to remember and actually quite relevant for the nature of the project: www.sentcrypt.com.

Once I was settled on the name, I had to think about the actual content that was going to be displayed on the website, the technology to be used, the infrastructure and all the data aspects. I was adamant about using python for the backend scripts and the data collection and processing, I decided to give a shot to Heroku concerning the infrastructure and the hosting and finally benchmarked a few different APIs and feeds in the data collection process.

5 b. Data collection

Once the domain name was secured and the minimum infrastructure deployed on Heroku I first created a database. Heroku provides a free DB on PostgreSQL natively to experiment with. The problem is that the free DB is limited to 10,000 rows and 1 GB in storage space. While the storage space is not a big problem, the number of rows is a serious limitation as I am planning on collecting hundreds of thousands of Bitcoin prices and tweets. I started with the free plan just to setup the basics, connect to the DB via the PostgreSQL GUI, create a table, insert a few lines, connect then via a python script and insert lines from there. Everything was working as expected so I upgraded to the Hobby basic plan that offers a comfortable 10,000,000 row limit.

The DB being installed I then focused on retrieving the prices of Bitcoins. Even here the choice is quite wide, do I want to retrieve only daily prices? Or intraday prices with more granularity, every 5 minutes? 10 minutes? Which API to use and how to make it scalable and robust? And how do I want to retrieve the prices, via an API or a web crawler that would scrap the precious data from the internet? While I was first interested to use beautiful soup to do some scrapping, I then decided to go against it. I wanted the website to be robust and maintainable, web scrappers might work one day and be patched the next one, potentially requiring manual workarounds and code adaptation. In the process of benchmarking the data acquisition process I actually tried to scrap the data from coinmarketcap.com at first, the code was working the first day but then was not able to retrieve the data the next one, cementing my decision to settle for a more perennial approach using an API.

The first Data I acquired was the closing price of Bitcoin. I first used the coinmarketcap.com API and was surprised to see that the free API does not offer any historical prices. Not even the closure price. So I first collected the closing price from the website directly as it is quite easy to collect it manually from the historical data page for any cryptocurrency, here. While the historical data goes back quite far, what if I miss one day through the API, should I collect the price myself manually and insert it into the DB? Out of the question. So, I had a look at the paid API subscriptions. The price for a hobbyist feed from coinmarketcap.com for personal projects is $29 per month, for 1 month of historical data. A scandalous price in my opinion.

I then looked for alternatives to collect the closing price of Bitcoin and settled for nomics which offered a free API permitting to swiftly retrieve the daily prices with historical prices as well. The API documentation was clear, the API key was sent right away and I then experimented with Nomics to retrieve the prices and inject them into my PostgreSQL DB on Heroku via python scripts.

The next problem was the fact that even if nomics offered historical daily prices, it did not offer intraday historical prices. I was then entering the phase of the project where I actually was trying to decide what did I want to plot precisely on the website. I realized that I want to plot daily prices, that’s fine, but if I want to calculate sentiment on the fly and plot it along the price of bitcoin then I need also a more granular feed called “intraday”. Nomics could offer this partly but I wanted to also retrieve past values and secure the access to at least a few days back of intraday values to palliate the risk of a deficient script.

I was also experimenting in parallel the retrieval and plotting of stock tickers such as MSFT, TSL or the SP500 in order to build a custom dashboard to follow my own stocks and was actually also including BTC and ETH in the dashboard:

I then realized that I was actually using the yahoo finance API to retrieve the data for my tickers, which as a matter fact contains intraday and best of all contain the last 30 days of intraday data for bitcoin, free of charge. The API is easy to use and requires only to pip install the yfinance library. Retrieving intraday prices for bitcoin is literally done in two lines of code:

bitcoin = yf.Ticker("BTC-USD")
hist_bitcoin = bitcoin.history(start=date_today,interval="5m" )

yahoo finance was a revelation and I quickly modified my closing + high + low daily price code to retrieve data from this source. I now had also a clean feed for my intraday data. With an available backlog of 30 days it is quite easy to retrieve eventual missing data in case of service interruption or network problems. But how to retrieve the data older than 30 days? I could of course only settle to go back 30 days in time and start to gather data from that point in time. But I wasn’t really comfortable with that, I wanted to build a fully functional and complete database containing closing, low, high and intraday prices of Bitcoin. I searched on internet for days to retrieve the data and finally bought it from fiverr from a user named “data_dealer”. This gentleman sold me 7 years of clean intraday data for Bitcoin for CHF 9. I was curious to know how did he manage to get it, and he was kind enough to provide the explanation. He was scrapping it directly from the graphs on coinmarketcap.com. In any case, I loaded the data and picked up from there with the yahoo finance feed to go forward. Here is the link to the yahoo finance API: link.  

My script could run and just pick up the last prices available with a timer, querying the API every x seconds and then sleeping. But running a script indefinitely is not really a best practice. So, I settled for script that will be launched every 5 minutes, retrieve the data that was published since last run and insert it into the DB. It would also check at every occurrence if a price for the day before is present in the DB and if not would query it and insert it as well to make sure that we maintain both the daily prices and the intraday prices.

Concerning the tweets there is a limited amount of options. I can either try to scrap the data via specific searches or I can request a Twitter developer API and start from there. I was interested to try the streaming API from twitter so I decided to go that way. The developer keys are not granted instantly so I had to wait a few days after making the request here: link. Once the keys were received I searched for a clean way to query the API and retrieve the tweets. The market solution seems to be tweepy, which is simply described as “An easy-to-use Python library for accessing the Twitter API”. In order to start receiving the tweets, I had to fill the API credentials from Twitter, create a listener class in order to receive in input the streaming tweets and then open the connection with our credentials. The tweets are received in the form of a json file and have to be opened. An example of such a tweet is provided in annex. I also signified a tracking options to only retrieve tweets with the world “bitcoin” included in order to only retrieve relevant tweets.

5 c. Applying sentiment analysis

Once the tweets are being correctly parsed in our listener and streamed live from the twitter API. What is left to do is actually to retrieve the text of the tweet itself and apply sentiment analysis on it. I could build our own sentiment analyser using one the following technics:

  • Supervised machine learning
  • Lexicon based
  • Hybrid (ML+LB)
  • Graph based approach

 For the sake of time I decided to go rather with an out of the box solution leveraging one of the available library solutions in python. While researching the subject two libraries were standing out and appeared to be the standard:

  1. TextBlob : Simplified Text Processing

TextBlob is a library used for the processing of textual data. It offers a simple and straight forward API to dive into common NLP tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation. Here is the complete list of features natively available in the library:

Features

  • Noun phrase extraction
  • Part-of-speech tagging
  • Sentiment analysis
  • Classification (Naive Bayes, Decision Tree)
  • Tokenization (splitting text into words and sentences)
  • Word and phrase frequencies
  • Parsing
  • n-grams
  • Word inflection (pluralization and singularization) and lemmatization
  • Spelling correction
  • Add new models or languages through extensions
  • WordNet integration

This is a lexicon and rule-based sentiment analysis library that is especially sensible to sentiments expressed in social media. From the GitHub repo we can read. “It Implements the grammatical and syntactical rules described in the paper, incorporating empirically derived quantifications for the impact of each rule on the perceived intensity of sentiment in sentence-level text. Importantly, these heuristics go beyond what would normally be captured in a typical bag-of-words model. They incorporate word-order sensitive relationships between terms. For example, degree modifiers (also called intensifiers, booster words, or degree adverbs) impact sentiment intensity by either increasing or decreasing the intensity.”

The GitHub offers as well a lot of examples and a good explanation concerning the scoring mechanism.

Vader outputs the sentiment in the following way:

World:{‘neg’: 0.026, ‘neu’: 0.492,’pos’: 0.482,’compound’: 0.9798}

The compound score is obtained by summing the valence scores of each world in the lexicon and then adjusted to the rules and then finally normalized to be between -1 (extreme negative) to +1 (extreme positive). It is the most straight forward metric to use for a text if we want a unidimensional output.

In general, to be usable, the consensus is to classify as such the score:

  • positive sentiment: compound score >= 0.05
  • neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
  • negative sentiment: compound score <= -0.05

At first sight it seems quite natural to go with Vader as I will be exclusively calculating sentiment analysis on social media content. After conducting additional research, Vader sentiment appears to be also better at tackling slang and emojis. Cementing my first opinion. It would be interesting to benchmark both libraries on our dataset of tweets. 

In practice using Vader sentiment is extremely simple. After installing the library via pip install and importing it, the only thing to do is call it on a text:

vs = analyzer.polarity_scores(tweet)
sentiment = vs['compound']

And that’s it! Retrieval of the sentiment done!

5 d. Storage and Archiving

Now that we have tackled the data retrieval and the application of the sentiment analysis library the challenge is to store it in a smart way and industrialize it to limit the manual work involved. With our Heroku database we are limited to 10’000’000 records, and while it can appear to be a high number, it is not that high.

We are already using 800’000 rows for our intraday historical bitcoin prices, and we are storing 288 additional rows per day. We also have also 2500 records of daily prices and growing (albeit slowly). So, the Bitcoin part seems to be under control and we do not have to worry about it for the next two years if we settle for a limit of around 1 million rows for our crypto currency.

But what about the tweets? Based on a quick benchmark I am receiving around 2500 tweets per hour, that’s 60’000 tweets per day. How am I supposed to first store them in a reliable way and plot them on the website? It seems quite tedious to do.

The way I want to plot the data on the website is also a parameter to take into account while designing the way we are going to store it. And vice versa. Both are quite interconnected and a global solution and vision has to be established before releasing anything into production. After careful consideration, here is the solution that appeared to be the most pragmatic:

Display:

  1. Daily price of bitcoin with high, low and closure values for every day for the last few years. Updated daily to incorporate the values for the value of the previous day. Interval of one day between two values
  2. Intraday price of bitcoin showing the daily price of bitcoin with a granularity of 5 minutes intervals
  3. Daily sentiment aggregate, showing for every past day the average daily sentiment, no history available at the moment but the historical values are built with our scripts daily
  4. Intraday average sentiment on rolling window of 5 minutes. The graph is updated every 5 minutes with an average value of the tweets streamed in this interval
  5. A table displaying the last streaming tweets with the polarity and a colour code to identify the negative and positive ones

Storing:

  1. One daily table containing the daily prices of bitcoin, one row per day
  2. One intraday table of prices for bitcoin, one row every 5 minutes
  3. One intraday tweets table, storing all the tweets streamed into the API + the calculated sentiment. Deletion of all the entries older than 30 days on a daily basis
  4. One daily table containing an average of the sentiment of the past day, calculated daily

This way the data is clearly segmented, we can query the tables to retrieve current tweets being streamed, we can calculate and display the rolling average sentiment and we can as well plot the daily and intraday prices of bitcoin.

Automatizing and scheduling all those scripts is a challenge but once in place everything should run smoothly and with minimal supervision.

5 g. Infrastructure and deployment

Concerning the deployment and the tools to be used by the website I chose the following options :

Hosting/Deployment: Heroku

The name comes from “Heroic” and “Haiku”, offering a shout out to “Matz” the Japanese creator of the Ruby language. Heroku is a cloud platform as a service, it is a container-based cloud platform built for developers to deploy, manage and scale modern applications. It simply offers an easy to use framework to deploy webapps quickly. It is supposed to be beginner friendly, and while I had no big problems using it, I found some of the processes quite complex. I was for instance expecting the upgrade of a database to be a one button operation, whereas it actually required bash/shell scripts to be run on my side and manual copy of the data from the initial free DB to the new paid DB. I hesitated to try to deploy the apps on a regular cloud provider, but Heroku seemed more “out-of-the box” oriented and seemed more adapted for personal projects. I am the developer, the change management and the project manager, I cannot handle a DevOps hat in addition!

The products offered by Heroku are:

  • Platform
  • Postgres
  • Redis
  • Teams
  • Enterprise
  • Connect
  • Elements

I used Platform to deploy our Python scripts and Postgres to store the data.

Web Development: Flask

Flask is a micro web framework in python, it is called a microframework because it does not require any additional libraries or technologies to build a web application. The application can take the form of  web pages similar to the what we are trying to build at www.sentcrypt.com or a blog, a wiki or even web based calendar or project management tools. The main features of flask are :

  • Built-in development server, fast debugger
  • Integrated support for unit testing
  • RESTful request dispatching
  • Jinja2 Templating
  • Supper of secure cookies
  • Lightweight and modular design allows for a flexible framework

A possible alternative to Flask would have been Django. But as I am a beginner in the domain I preferred to stick to something simpler. Django is a full featured framework and as such it comes with tons of features out-of-the box, I prefer a more iterative approach here. Starting with a bare metal skeleton and expending on top of it with additional libraries when needed. It is nice to have all features included, but especially in “simple” project as ours we don’t need all those fancy tools. Flask offered us simplicity and flexibility.

Plotting: Bokeh

Bokeh is a visualization library in python, it is especially adapted for building data applications, it offers a wide range of visualization options and chart possibilities. As we will be plotting data concerning prices and time series about sentiment, having a performant way to show it is paramount. I was considering using dash as this is another dashboard-building popular option.

All the coding was done in Python with a vast array of libraries that you can access in the code overview in annex.

Handling authentication

An interesting point to raise is the management of the credentials to handle the connections to the different APIs and to the database and how to industrialize it without hardcoding the keys into the code. When you are working with Heroku or with Github it can be quite tempting to upload your code with hardcoded access keys in order to make it work directly, it is just faster. But this approach has two direct flaws :

The first one is that from a confidentiality perspective this is just plain bad. There are crawlers on Github scanning all the public repos to gather exactly this, API keys, that are then sold on the black market for other people to use. In general it is best practice to decouple your code and your credentials, both should be stored in two different points and your code should access those centralized credentials via a config parser.

The second reason is that it makes your code more flexible and maintainable, in the future if you have hundreds of files accessing those credentials you don’t want to have to change those in every single file in case those credentials are updated. You want to be able to do it in one single file that all your code is accessing.

You can do that by using a ConfigParser class which implements a basic configuration language which provides a structure similar to what’s found in Microsoft Windows INI files. You can use this to write Python programs which can be customized by end users easily.

Here is an example of my credentials.ini file (without real keys obviously) :

And here is a sample of code accessing the credentials :

#Get Connection details in .ini file for POSTGRES
def config_POSTGRES(filename='credentials.ini', section='postgresql'):
    # create a parser
    parser = ConfigParser()
    # read config file
    parser.read(filename)
    # get section, default to postgresql
    db = {}
    if parser.has_section(section):
        params = parser.items(section)
        for param in params:
            db[param[0]] = param[1]
    else:
        raise Exception('Section {0} not found in the {1} file'.format(section, filename))
    return db

Using the configparser I simply access the credentials.ini file and then access the section that is relevant, in that case the credentials for the database, and retrieve the values !

5 e. Optimizing the data

Once the data pipeline is running smoothly and collecting all the information that I needed it was time to start to work on the data itself. No additional effort was to be done on the bitcoin prices for now, we might expand at a later stage to gather as well other metrics concerning bitcoin such as :

  • Number of wallets
  • Number of transactions
  • Wikipedia reads
  • Google searches
  • Mentions in mainstream media

We might as well expand the scope of our project to include other crypto currencies but for now we are ok on that side.

It is less the case concerning the tweets. We are calculating a sentiment on text, the calculation is quite subjective and highly depends on the quality of the data and on the processing. After reviewing samples of the tweets in the database it was very clear that the calculation of the sentiment was polluted by various factors and that as such the sentiment was not reliable enough.

My first observation was that it was hard for me to benchmark the quality of the sentiment analysis with tweets in languages I did not understand. Vader sentiment is “supposed” to work alright with other languages than English but to avoid any problems I preferred to filter out all the tweets which are not in English. This can be done quite easily as the language of the tweet is part of the API. It was only a matter of identifying the right field.

After filtering the language I also decided to filter out all the tweets by accounts  with less than 50 followers. The threshold of 50 is completely arbitrary on my part but I figured that this would filter out at least part of the bots and noise. As we are trying to do sentiment analysis in the context of bitcoin’s price I also considered that smaller accounts with less followers do not hold the same reach as bigger accounts and that as such the sentiment of their tweets is probably less impactful.

By reviewing the tweets streaming into the database I also observed that I was getting a high number of retweets. Especially in the crypto space twitter is full of contests and people trying to win some obscure alt coins by retweeting. This noise is particularly lacking any value and hence I also filtered out all the retweets. I did this by taking out all the tweets where the first two letters are “RT”.

Finally I also implemented changes in the tweets themselves. I decide to remove all mentions of other users. On twitter you can tag another user with “@”. It is self-explanatory that user names are not relevant in the context of sentiment analysis so I just deleted the strings starting with “@”. I then also deleted all the URLs from the tweets as those did no add much to the content and finally I also removed all the special characters (except #) and punctuations in order to further reduce the noise. Hashtags do hold value in the context of this study so I decided to keep those in.

Let’s have a look at a few tweets and the calculated sentiment :

“#BTC could drop to $16,000 or possibly to the $13,800 level.  However, the probability for a drop to the $13,000 r… ”            -0.4939

“I’m buying bitcoin in bulk 🙂 At the best rates💯You will be amazed by the awesome offers…I pay fast #dontmissout ” 0.9022

“What are you afraid of? Invest today and see your life changing!! 100% guarantee of profit earning! DM ME NOW!💰💵g… ” 0.7243

“NOO WAY!!!! BITCOIN READY TO SHOCK US AGAIN!!!! [WATCH BEFORE MONDAY] AL…  via  ” -0.3111

“Congrats  you have won $33 in Bitcoin #BSV 🐉 Pickaw seed: Cd55adhDu3RxEYow” “0.7964”

As we can observe easily, the sentiment analysis is working alright, the general sentiment is correctly estimated in my opinion. But the tweets are rarely relevant. On this small sample that I took randomly from the table, 3 tweets are obviously self-interested scams or shills. It is therefore fair to ask ourselves the question if the analysis of such tweets has any added value to determine the future price of bitcoin or if this noise is only bad luck in our random walk.

6 Results and conclusions

The website is still under construction but is already capable of displaying most of the information that I want it to display. It is showcasing:

  • Historical daily closing, high and low price of bitcoin (updated daily)
  • Historical intraday price of bitcoin (updated every 5 minutes)
  • Sentiment of today on a rolling window of 5 minutes, (updated every 5 minutes)
  • Sample of tweets being streamed with the calculated sentiment (updated every refresh). Green tweets are positive and red tweets are negative
  • (future work) : historical daily sentiment aggregate
  • (future work) : additional metrics
  • (future work) : Live calculation of VAR and correlation between all values

Here is a snapshot from the current state of the website :

We demonstrated in the first sections of this work that in general the literature has been going towards the idea that there is a clear link and correlation between tweets and other social medias indicators and the price of bitcoin. In our work we were not able to go as far as to benchmark the nature of this relationship. Indeed we have been more focused on the building of a website, the infrastructure and the industrialization of the whole process rather than analysing the data and trying to get insights from it. But now that all the foundations have been led and that the data is being streamed and acquired in a 100% automatic way, we will be able to use it and try to see if we can indeed confirm the hypothetical link between the price of Bitcoin and twitter sentiment.

From a technical perspective we were able to solve all our problems and challenges, from the hosting, data pipelines, DB admin, continuous deployment and integration. But we have clearly identified limitation factors concerning the quality of the data. We identified some problems in the previous section of this document, let’s elaborate further.

The twitter APi only enables to reach 1% of the total volume of tweets going through twitter. This in itself is a limitation concerning what we can retrieve for the famous micro blogging website. By only reaching 1% of the volume we miss a lot of the content and it is quite clear that we are subject to RNG and there is a possibility that we get highly skewed data. In addition, tweets are only 280 char long and usually contains hashtags, tagging, URLs, emojis and various “fillers” that further limit the number of available relevant words for analysis.

We observed the fact that in terms of quality tweets have limited value. Bots (estimated to be around 1-14% of total volume according to Olivier Kraaijeveld 2018), spam, contests are representing significant noise.

Specific to sentiment analysis, a fast evolving jargon and lexicon, consequence of the lack of maturity of the technology and the fact that it is sometimes difficult to quantify affective states and subjective information without proper interpretation of emojis represent challenges to efficiently extract an accurate sentiment benchmark. We will cover this part of the analysis as a next step of this work.

7 Code

In this section I will show some of the back end code concerning the retrieval of data and the tweets. The heroku part and the bokeh part will not be discussed or shown here.

7.a Code for BTC prices gathering and storing

#/*******************************************************
#Nom ......... : Insert Streaming BTC Price Into postgres_v02.py
#Context ......: Natural language processing and Crypto Prices
#Role .........: Get BTC prices and insert in DB         
#Auteur ...... : JDO
#Version ..... : V1
#Date ........ : 09.12.2020
#Language : Python
#Version : 3.7.8
#********************************************************/
#********************************************************/

from datetime import date, timedelta, datetime
import time
import json
from unidecode import unidecode
import time
import sqlite3 
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import psycopg2
from configparser import ConfigParser
import yfinance as yf
from pytz import timezone
import pytz

#Dates Variables
date_today = date.today()
date_yesterday = date.today() - timedelta(days=1)
#Get BITCOIN TICKER from yahoo finance
bitcoin = yf.Ticker("BTC-USD")
#Update table with missing daily prices for BITCOIN
COUNTER_INSERT = 0

#Function to get Connection details in .ini file for POSTGRES DB
def config_POSTGRES(filename='credentials.ini', section='postgresql'):
    # create a parser
    parser = ConfigParser()
    # read config file
    parser.read(filename)
    # get section, default to postgresql
    db = {}
    if parser.has_section(section):
        params = parser.items(section)
        for param in params:
            db[param[0]] = param[1]
    else:
        raise Exception('Section {0} not found in the {1} file'.format(section, filename))
    return db

#Get Connection details in .ini file for POSTGRES DB and open connection
conn = None
params = config_POSTGRES()
print('Connecting to the PostgreSQL database...')
conn = psycopg2.connect(**params)       
cur = conn.cursor()     


#---------------------- UPDATE CRYPTO_PRICE_DAY TABLE ----------------------
#if we need all history we can use period = max
#hist_bitcoin_daily = bitcoin.history(period="max")

#If not let's get the last record from the table
cur.execute("SELECT MAX(date) FROM crypto_price_day;")
x=cur.fetchone()

#We query the potential missing data from yahoo finance
hist_bitcoin_daily = bitcoin.history(start=x[0],end=date_yesterday)

#Then we loop over the data retrieved from yahoo finance
for index, row in hist_bitcoin_daily.iterrows():
    ccid='BTC'+'_'+str(index)
    cur.execute("SELECT * FROM crypto_price_day where ccid=%s;",(ccid,))
    x=cur.fetchone()
    #If no record concerning the retrieved row then we insert
    if x is None:     
        cur.execute("INSERT INTO crypto_price_day (ccid, crypto, crypto_name, date, open, high, low, close) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)",(ccid,'BTC','BITCOIN',index,row['Open'],row['High'],row['Low'],row['Close']))
        conn.commit()
        COUNTER_INSERT += 1
    else:
        continue 

#Print metrics
print('-------------------------------------')
print('FINISHED CRYPTO_PRICE_DAY UPDATE. INSERTED %s rows'%(COUNTER_INSERT))

#---------------------- UPDATE CRYPTO_PRICE_INTRADAY TABLE ----------------------
#Reinitialize counter
COUNTER_INSERT = 0

#let's get the last record from the table
cur.execute("SELECT MAX(date) FROM crypto_price_intraday;")
x=cur.fetchone()

#We query the potential missing data from yahoo finance
hist_bitcoin_intraday = bitcoin.history(start=x[0],interval="5m")

#Then we loop over the data retrieved from yahoo finance
for index, row in hist_bitcoin_intraday.iterrows():
    ccid='BTC'+'_'+str(index)
    cur.execute("SELECT * FROM crypto_price_intraday where ccid=%s;",(ccid,))
    x=cur.fetchone()
    #If no record we insert into the table
    if x is None:           
        cur.execute("INSERT INTO crypto_price_intraday (ccid, crypto, crypto_name, date, price) VALUES (%s, %s, %s, %s, %s)",(ccid,'BTC','BITCOIN',index,row['Close']))
        conn.commit()
        COUNTER_INSERT += 1
    else:
        continue

#Print metrics
print('-------------------------------------')
print('FINISHED CRYPTO_PRICE_INTRADAY UPDATE. INSERTED %s rows'%(COUNTER_INSERT))

#---------------------- STREAMING UPDATES ----------------------
print('-------------------------------------')
print('ENTERING STREAMING')

while True:
    #now = datetime.now()- timedelta(hours=1)
    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    #CRYPTO_PRICE_INTRADAY Streaming update
    hist_bitcoin_intraday = bitcoin.history(start=date_today,interval="5m" )
    ccid='BTC'+'_'+str(hist_bitcoin_intraday.index[-1])
    Date = hist_bitcoin_intraday.index[-1]
    price=hist_bitcoin_intraday.iloc[-1]['Close']
    cur.execute("SELECT * FROM CRYPTO_PRICE_INTRADAY where ccid = %s", (ccid,))
    x=cur.fetchone()
    if x is None:           
        cur.execute("INSERT INTO CRYPTO_PRICE_INTRADAY (ccid, crypto, crypto_name, date, Price) VALUES (%s, %s, %s, %s, %s)",(ccid,'BTC','BITCOIN',Date,price))    
        conn.commit()
        print("%s : Value for BTC INTRADAY inserted PRICE = %s for DATE = %s"%(current_time,price, Date))
    else:
        print("%s : No new BTC INTRADAY price"%(current_time))

    #CRYPTO_PRICE_DAY Streaming update
    hist_bitcoin_daily = bitcoin.history(start=date_yesterday, end=date_yesterday)
    ccid='BTC'+'_'+str(hist_bitcoin_daily.index[-1])
    Date = hist_bitcoin_daily.index[-1]
    Open=hist_bitcoin_daily.iloc[-1]['Open']
    High=hist_bitcoin_daily.iloc[-1]['High']
    Low=hist_bitcoin_daily.iloc[-1]['Low']
    Close=hist_bitcoin_daily.iloc[-1]['Close']

    cur.execute("SELECT * FROM CRYPTO_PRICE_DAY where ccid = %s", (ccid,))
    x=cur.fetchone()
    if x is None:           
        cur.execute("INSERT INTO CRYPTO_PRICE_DAY (ccid, crypto, crypto_name, date, Open, High, Low, Close) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)",(ccid,'BTC','BITCOIN',Date,Open, High, Low, Close))    
        conn.commit()
        print("%s : Value for BTC DAY inserted PRICE = %s for DATE = %s"%(current_time,Close, Date))
    else: 
        print("%s : No new BTC DAY closing price"%(current_time))


    time.sleep(120)

7.b Code to gather tweets, calculate a sentiment and store them in the DB

#/*******************************************************
#Nom ......... : Insert Streaming Tweets Into postgres_v03.py
#Context ......: Natural language processing and Crypto Prices
#Role .........: Get tweets, apply sentiment analysis and store in DB      
#Auteur ...... : JDO
#Version ..... : V1.1
#Date ........ : 09.12.2020
#Language : Python
#Version : 3.7.8
#********************************************************/
#********************************************************/

from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import json
from unidecode import unidecode
import time
import sqlite3 
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import psycopg2
from configparser import ConfigParser
import yfinance as yf
import random 
import numpy as np
import re
from datetime import date, timedelta, datetime
import datetime

#Get Connection details in .ini file for POSTGRES
def config_POSTGRES(filename='credentials.ini', section='postgresql'):
    # create a parser
    parser = ConfigParser()
    # read config file
    parser.read(filename)
    # get section, default to postgresql
    db = {}
    if parser.has_section(section):
        params = parser.items(section)
        for param in params:
            db[param[0]] = param[1]
    else:
        raise Exception('Section {0} not found in the {1} file'.format(section, filename))
    return db

#Opening connection to postgres database
conn = None
params = config_POSTGRES()
print('Connecting to the PostgreSQL database...')
conn = psycopg2.connect(**params)       
cur = conn.cursor()     

#Import the sentiment analyzer from VADER
analyzer = SentimentIntensityAnalyzer()

#Get Connection details in .ini file for TWITTER
def config_TWITTER(filename='credentials.ini', section='TWITTER'):
    # create a parser
    parser = ConfigParser()
    # read config file
    parser.read(filename)
    # get section, default to postgresql
    twit_config = {}
    if parser.has_section(section):
        params = parser.items(section)
        for param in params:
            twit_config[param[0]] = param[1]
    else:
        raise Exception('Section {0} not found in the {1} file'.format(section, filename))
    return twit_config


#Get the connection details     
param_twitter=config_TWITTER()


def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, '', input_txt)        
    return input_txt

def clean_tweets(tweets):
    #remove twitter Return handles (RT @xxx:)
    tweets = np.vectorize(remove_pattern)(tweets, "RT @[\w]*:") 
    #remove twitter handles (@xxx)
    tweets = np.vectorize(remove_pattern)(tweets, "@[\w]*")
    #remove URL links (httpxxx)
    tweets = np.vectorize(remove_pattern)(tweets, "https?://[A-Za-z0-9./]*")  
    #remove special characters, numbers, punctuations (except for #)
    tweets = np.core.defchararray.replace(tweets, "[^a-zA-Z]", " ")
    
    return tweets


class listener(StreamListener):
    def on_data(self, data):
        try:
            data = json.loads(data)
            print("---------STARTING----------------")
            #print(data)
            #Cleaning the tweets
            if data['text'][:2] != 'RT' and data['user']['followers_count'] > 50 and data['lang']=='en':
                id_tweet=data['id']
                #print("---------PRINT RAW----------------")
                #print(data['id_str'])
                #print(data['text'])
                tweet_list = []
                tweet_list.append(data['text'])
                tweet = clean_tweets(tweet_list)
                #print("---------PRINT CLEANED----------------")
                #print(tweet[0])
                time_ms = int(data['timestamp_ms'])/1000
                time_dt=datetime.datetime.fromtimestamp(time_ms).isoformat()
                vs = analyzer.polarity_scores(tweet)
                sentiment = vs['compound']
                #print("---------PRINT  TIME----------------")
                print(time_dt)
                #print("---------PRINT  SENTIMENT----------------")
                #print( sentiment)
                if sentiment != 0.0:
                    print("Tweet Insertion started")
                    cur.execute("INSERT INTO sent (id_tweet, date, tweet, sentiment) VALUES (%s,%s, %s, %s)",(id_tweet,time_dt, tweet[0], sentiment))                 
                    conn.commit()
                    print("Tweet Inserted")
                else:
                    print("Sentiment unclear")
            else:
                #print(data['text'])
                print("Tweet does not look relevant")
                #print("Language =%s"%data['lang'])
                #print("Follower count =%s"%data['user']['followers_count'])
                #print("RT Status = %s"%data['text'][:2])
        except KeyError as e:
            print(str(e))
        return(True)

    def on_error(self, status):
        print(status)

while True:
    try:
        auth = OAuthHandler(param_twitter['ckey'],param_twitter['csecret'])
        auth.set_access_token(param_twitter['atoken'], param_twitter['asecret'])
        twitterStream = Stream(auth, listener())
        twitterStream.filter(track=["Bitcoin"])
    except Exception as e:
        print(str(e))
        time.sleep(5)

7.C Archiving of tweets

#/*******************************************************
#Nom ......... : CleanUp and Insert History_v01.py
#Context ......: Natural language processing and Crypto Prices
#Role .........: clean up of daily tweets and archiving of sentiment           
#Auteur ...... : JDO
#Version ..... : V1.1
#Date ........ : 09.12.2020
#Language : Python
#Version : 3.7.8
#********************************************************/

#********************************************************/

from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import json
from unidecode import unidecode
import time
import sqlite3 
from configparser import ConfigParser
import yfinance as yf
import random 
import numpy as np
import re
from datetime import date, timedelta, datetime
import datetime
import pandas as pd
import pandas.io.sql as sqlio
import psycopg2



#Get Connection details in .ini file for POSTGRES
def config_POSTGRES(filename='credentials.ini', section='postgresql'):
    # create a parser
    parser = ConfigParser()
    # read config file
    parser.read(filename)
    # get section, default to postgresql
    db = {}
    if parser.has_section(section):
        params = parser.items(section)
        for param in params:
            db[param[0]] = param[1]
    else:
        raise Exception('Section {0} not found in the {1} file'.format(section, filename))
    return db


#Opening connection to postgres database
conn = None
params = config_POSTGRES()
print('Connecting to the PostgreSQL database...')
conn = psycopg2.connect(**params)       
cur = conn.cursor()     
date_yesterday = date.today() - timedelta(days=1)
date_beforeyesterday = date.today() - timedelta(days=2)
date_Month = date.today() - timedelta(days=30)

#Get all the sentiments from the sent table for the last day
print(date.today())
print(date_beforeyesterday)
dat = pd.read_sql_query("SELECT * FROM sent where sentiment is not NULL and CAST(date AS DATE)<%s and CAST(date AS DATE)>%s ;",conn, params=(date.today(),date_beforeyesterday,))
Number_tweets = len(dat.index)
Average_sentiment = 0

#calculate sum of sentiments
for index, row in dat.iterrows():
    Average_sentiment += float(row['sentiment'])

#calculate the average sentiment
if Number_tweets > 0:
    Average_sentiment=Average_sentiment/Number_tweets

    #Check if there is already a line for yesterday
    cur.execute("SELECT * FROM statistics_daily where day=%s;",(date_yesterday,))
    x=cur.fetchone()

    #insert the aggregated sentiment for yesterday
    if  x is None:
        cur.execute("INSERT INTO statistics_daily (day, tweet_nb, av_sent, google_search, transactions_nb, wallet_nb) VALUES (%s, %s, %s, %s, %s, %s)",(date_yesterday,Number_tweets,Average_sentiment,0,0,0))
        conn.commit()
    else:
        print('Already in the table bro')

    #Delete values older than a month
    cur.execute("DELETE FROM sent where date < %s;",(date_Month,))
    conn.commit()

You can see the result of the work at www.sentcrypt.com, even if the layout is not great, my goal to construct a working data pipeline, calculate a sentiment and to show it on a website is complete !