WordCounter in Hadoop! (Windows PRACTICAL)
Hey! This is Shubham, and I am back with another tech writeup! HADOOP
I and Ram were once thinking of exploring some new tech! So, we thought of exploring Hadoop!
Ram: Hey! Shubham, what the heck is Hadoop? I have seen this in almost every requirement in jobs nowadays!
Me: Sure Ram, I have seen some videos and made small projects on it, will tell you about Hadoop and also how I did it!
Ram: Also, when I tried to set up I faced a lot of errors 😓, can you tell that too?
Me: Sure! there will be an easy bonus for you (Github repo 😅)
Ram: Oh! Nice! But can you tell that Hadoop ever benefited anyone till now, as very few people study regarding this?
Me: Cool let me take an example of Walmart, in 2004 there was a hurricane coming in France, so people at Walmart studied using big data and Hadoop and it resulted that before hurricane people buy emergency stuffs and strawberry pop-tarts, so Walmart filled into their stores and that increased sales 7 times
Me: Cool Let’s start then,
After reading about Hadoop and its functioning, I think now I can define Hadoop in layman terms as “It’s an opensource software framework storing and managing BIG DATA developed by Apache which can help one to analyze and retrieve vital information from those datasets and can ease scaling, management and pricing !”
Ram: Hey, What’s Big Data then?
Me: See Big data as the name suggests a collection of a large number of datasets that can be structured, semi-structured, or unstructured.
As the online user base is increasing day by day this is generating lots of data. An example can be of every second temperature fluctuations recording of weather all day, other examples can be user tapping which locations on pages and which made user close the app or go for buy button, or which post one likes how many clicks generally a user does in a website, etc.
Me: Now, let’s get back to Hadoop, This follows a Master-Slave architecture, and storage in Hadoop is done using HDFS, it’s distributed file system of Hadoop that provides high throughput access to application data. In brief, HDFS is a module in Hadoop.
I feel storage in HDFS is like a doubly-linked list type, it stores data by diving it into multiple blocks with the same maximum limit (128GB) and stores each block at a total of 3 locations(DataNodes). There is a NameNode, a Secondary Node, and DataNodes.
If we write this, then data is stored at the current location and two more, similarly, Hadoop saves so it becomes fault-tolerant.
class doubly_linked_list{
int data;
doubly_linked_list* next;
doubly_linked_list* prev;
}
i) Client Requests NameNode it returns IP address of free DataNodes
ii) Assuming a file of 200GB with two blocks as [BLK A(128GB) and BLK B(200–128=72GB)]
iii) Checks all nodes are ready and free, now writes BLK A and BLK B, at 3 locations.
NameNode and Secondary NameNode can be thought of as backups of NameNode
i) Name node is the one which stores the information of HDFS filesystem in a file called FSimage
ii)Secondary NameNode is the checkpoints of the file system metadata present on NameNode
Me: Now Let me tell you how Hadoop makes the system easy to handle, it has a feature named Map Reduce,
Hadoop MapReduce is a software framework for easily writing applications that process vast amounts of data in parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
Let’s take an example of a kitchen, there is only one chef and one box to take ingredients when orders are less than its easy for that chef to do the task but it orders increases then it will be problematic,
Now let’s assume we hired 5 more chefs, this will solve the issue, right? No, because still there is only one box so the rest all will be in a queue unless one takes the ingredient. So what we can do?
i) We can think of it as mapped and reduced terms. we have total of 6 chefs. 🤔
ii) 2 chefs will make meat, 2 will make sausage
iii) Now 2 head chefs will assemble all, this will solve our issue!
Now chefs are happy! Cool
Ram: Nice, But I’m tired now can you show me how to build something?
Me: 😅Sure, Let’s build a WordCounter, So, I followed this! Make sure you follow precisely as in windows you will face a lot of errors!
In the WordCounter system, we will have a words.txt(maybe in TB’S 🤔)and a jar function and we will pass this using Hadoop over a map-reduce the functionality, and finally return the counts.
So, what it does is first it segregates/splits the sentences, then it maps different words present, then shuffles to bring all words together, then reduces it, and then we get out Final result.
Let’s dive into the coding section
This is how our basic code segments look like
So, here basically we have worked upon tokenizing our input data using StringTokenizer, then we passed over a mapper function where it generates output as a key-value pair, the default value here used is 1, then it is shuffled and sorted and passed over a reduce function, here we summed all the values of key-value pairs and returned that.
Done!
Ram: Wow! That was easy can you tell me how to setup? I faced a lot of errors 😅.
Me: Sure Ram, Here you go lets make your hands dirty! 😅
i) First install jdk1.8 at location ‘D:\Java\jdk1.8.0_202’
ii) Now install git bash to your system
iii) Now, follow this (Step by step Hadoop 2.8.0 installation on Window 10 (securityandtechinfo.blogspot.com))
v) Add this to environment path (control panel>Settings>Advanced>EnvironmentVariable)
vi) Install Eclipse IDE for Java
vii) Create data named folder inside Hadoop or just copy folder there
viii) Now run this command hadoop namenode -format
in terminal
ix) Now run this command inside the terminal to start Hadoop start-all.cmd
x) Now export wordcount.jar with JDK 1.8 see code here[Github]
xi) Now create a word.txt file with words
mango beer beer
beer beer
mango mango chicken
chicken soup mango beer
fish soup mango beer
xii) Follow this syntax to create a dir inside hdfs system hdfs dfs -mkdir /inputdir
then put this word.txt inside this dir using this syntax as hdfs dfs -put word.txt /inputdir
xiii) Now run using jar file in terminal hadoop jar D:\WordCount.jar WordCount /inputdir/word.txt /outputdir
xiv) Now simply check output like this hdfs dfs -ls /ouputdir
, Now choose the file created to see contents as hdfs dfs -cat /ouputdir/part-r-00000
xv) This shows
beer 6
chicken 2
fish 1
mango 5
soup 2
Done! This is how simply you can create and run a simple WordCounter program in Hadoop.
Thank You!
PS — I‘m a newbie in this may be I missed many things or maybe the same thing can be explained in a much easier way, I tried my best and my understanding to share with you all!