Crypto Tweets Fetch using Flume & Hadoop (PRACTICAL)
Simran: Hey! I am new to investing in cryptocurrency.
Me: Nice! At least you started investing! That’s good!
Simran: But I feel this crypto market highly depends on news, being from technical backgrounds, can we do something?
Me: Yeah! Sure I guess we can do something. I have heard of Apache Flume which is an awesome application used for logging big data, we can analyze the tweets of Elon Musk😏 and get something.
Simran: That sounds interesting. Can you tell me in brief how I can also analyze them?
Me: Sure! So let’s start!
Me: So, basically we will start streaming data from Twitter, in order to get tweets from Twitter, we will need set up a Twitter application, we need to pick keywords related to cryptocurrency Doge 🪙, and then we need to run Hadoop and Flume.
Simran: As far as I remember from your last medium article on Hadoop (WordCounter in Hadoop! (Windows PRACTICAL) | by Shubham Kumar Gupta | Jan, 2022 | Medium) was to handle big data but what this Flume now?
Apache Flume is a distributed, reliable, and available software for efficiently collecting, aggregating and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.
It can stream live logs from different cloud sources like social platforms such as Facebook, Twitter, etc. These streamed data can be passed to Hive and Hadoop for further analysis.
Flume accepts data from a source and stores it in the channel. Reading speed is generally faster than writing speed, so we need a buffer to match the read-write pace. Then these data are passed and stored in hdfs.
Simran: Can you tell me how to do this straight away 🙄? Practically
Me: 😅 Sure! Let's start! first let's create a Twitter Application
i) So, First we need to visit http://apps.twitter.com/
ii) We need to give the name and click Get Keys
iii) Now you will get API_KEY, API_KEY_SECRET, & BEARER_ACCESS_TOKEN
v) Now, it may happen that tweets you are fetching is way more than the limits set by Twitter, so apply for Elevated twitter developer
Me: Cool! Now, Let’s see how to set up Apache Flume.
i) Download Apache Flume : [
DOWNLOAD LINK ]
ii) Extract the tar file
tar -xvf flume.tar.gz or using WinRAR
iii) Inside the conf folder, Rename
iv) Write this inside the
v) we need to set a path
FLUME_HOME =D:\apache\flumeand, append to the path
vi) Here, You can see in sources we mentioned Twitter, we named our channel as MemChannel, we mentioned jar file needed to be used, and put all tokens here.
vii) Now, we named our sink and put the path for the sink in HDFS, We set the type of output stream of data type to be text.
viii) We set the batch size(number of tweets that should be in a batch), capacity(number of events stored in the channel), and transaction capacity(number of events the channel accepts )
Now, to Fetch all tweets related to cryptocurrency Doge, we will be to use keywords like
TwitterAgent.sources.Twitter.keywords= elon musk, doge, doge coin, bitcoin, crypto, forex, tesla, coin, rocket, ether, mining
Simar: Blah Blah! When we will get results? 😟
Me: 😅, Not to worry we have to run it now.
Steps to run
ii) From the terminal you have to just run this command (I'm in this location D:\apache\flume)
>> bin\flume-ng agent --conf conf --conf-file conf/flume-conf.properties -property "flume.root.logger=INFO,console" -n TwitterAgent
iii) Now we can go to the path which we set in flume-conf his path, i.e
flume_tweets using command
hdfs dfs -ls /flume_tweets we can see which files are there in this directory
iv) Now we can read using the cat command
hdfs dfs -cat /flume_tweets/FlumeData
Me: Tada! We got our tweets! Now, let's move into analyse them properly!
Now, we have tweets related to crypto, now we can analyse them using google NLP to get further information! So time for another medium blog till then,
Thank You for reading this!
Simran: Thank you for this will wait for your next blog.