Crypto Tweets Fetch using Flume & Hadoop (PRACTICAL)
Simran: Hey! I am new to investing in cryptocurrency.
Me: Nice! At least you started investing! That’s good!
Simran: But I feel this crypto market highly depends on news, being from technical backgrounds, can we do something?
Me: Yeah! Sure I guess we can do something. I have heard of Apache Flume which is an awesome application used for logging big data, we can analyze the tweets of Elon Musk😏 and get something.
Simran: That sounds interesting. Can you tell me in brief how I can also analyze them?
Me: Sure! So let’s start!
Me: So, basically we will start streaming data from Twitter, in order to get tweets from Twitter, we will need set up a Twitter application, we need to pick keywords related to cryptocurrency Doge 🪙, and then we need to run Hadoop and Flume.
Simran: As far as I remember from your last medium article on Hadoop (WordCounter in Hadoop! (Windows PRACTICAL) | by Shubham Kumar Gupta | Jan, 2022 | Medium) was to handle big data but what this Flume now?
Me:
Flume
Apache Flume is a distributed, reliable, and available software for efficiently collecting, aggregating and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.
It can stream live logs from different cloud sources like social platforms such as Facebook, Twitter, etc. These streamed data can be passed to Hive and Hadoop for further analysis.
Flume accepts data from a source and stores it in the channel. Reading speed is generally faster than writing speed, so we need a buffer to match the read-write pace. Then these data are passed and stored in hdfs.
Simran: Can you tell me how to do this straight away 🙄? Practically
Me: 😅 Sure! Let's start! first let's create a Twitter Application
Twitter Application
i) So, First we need to visit http://apps.twitter.com/
ii) We need to give the name and click Get Keys
iii) Now you will get API_KEY, API_KEY_SECRET, & BEARER_ACCESS_TOKEN
iv) But you need some more, so let's click on setup OAuth, you can choose v2 and provide your description, T&C URL, privacy policy URL, Now you can click generate to get ACCESS_TOKEN, and ACCESS_TOKEN_SECRET.
v) Now, it may happen that tweets you are fetching is way more than the limits set by Twitter, so apply for Elevated twitter developer
Me: Cool! Now, Let’s see how to set up Apache Flume.
Flume Setup
i) Download Apache Flume : [DOWNLOAD LINK
]
ii) Extract the tar file tar -xvf flume.tar.gz
or using WinRAR
iii) Inside the conf folder, Rename flume-conf.properties.template
to flume-conf.properties
iv) Write this inside the flume-conf.properties
file
v) we need to set a path FLUME_HOME =D:\apache\flume
and, append to the path D:\apache\flume\bin
vi) Here, You can see in sources we mentioned Twitter, we named our channel as MemChannel, we mentioned jar file needed to be used, and put all tokens here.
vii) Now, we named our sink and put the path for the sink in HDFS, We set the type of output stream of data type to be text.
viii) We set the batch size(number of tweets that should be in a batch), capacity(number of events stored in the channel), and transaction capacity(number of events the channel accepts )
Now, to Fetch all tweets related to cryptocurrency Doge, we will be to use keywords like
TwitterAgent.sources.Twitter.keywords= elon musk, doge, doge coin, bitcoin, crypto, forex, tesla, coin, rocket, ether, mining
Simar: Blah Blah! When we will get results? 😟
Me: 😅, Not to worry we have to run it now.
Steps to run
i) Run start-all.cmd
ii) From the terminal you have to just run this command (I'm in this location D:\apache\flume)
>> bin\flume-ng agent --conf conf --conf-file conf/flume-conf.properties -property "flume.root.logger=INFO,console" -n TwitterAgent
iii) Now we can go to the path which we set in flume-conf his path, i.e flume_tweets
using commandhdfs dfs -ls /flume_tweets
we can see which files are there in this directory
iv) Now we can read using the cat command hdfs dfs -cat /flume_tweets/FlumeData
Me: Tada! We got our tweets! Now, let's move into analyse them properly!
Now, we have tweets related to crypto, now we can analyse them using google NLP to get further information! So time for another medium blog till then,
Thank You for reading this!
Simran: Thank you for this will wait for your next blog.