Hadoop Notes on Amr Awadallah Lecture

Limitation

Traditional system has a instrumentation layer which is point where data is created like webservers, cache register, mobile devices, job application servers, syslog from linux machines, network devices etc.

These data are then stored on to Storage Only grid which may contain PBs of data and it doesnt have the ability to process it.

You need to use the ETL (Extract Tranform Load) module to process data ie convert unstructured data to structured data to load it into a database.

Difficulty I : The amount of data generated was so huge , it wasnt possible to complete it before the next day  data arrives. Also moving all the data from the storage grid to compute grid (ETL) puts up load on the network behind it.

Difficulty II: Data getting older needs to be archived (no delete anyway). We have to use tape or bluray. Archiving is economical for storage but extremely expensive for reterival. We need the data to be stored longer. (ROB- Return On Bytes: how much value is being able to reterieve from it as a function of cost.)

Difficulty III: DB schema modelling is done at the begining based on various business info that needs to be extracted but later if a new section of info also need to be taken then schema remodelling has to be done at DB , modifying ETL etc which is very expensive.

HADOOP

A scalable fault-tolerant distributed system for data storage and processing.

Core Hadoop has 2 main systems:

-Hadoop Distributed File System: self-healing high-bandwidth clustered storage

-MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction.

Hadoop is opposite of a VM, ie taking many many physical servers and making it look like one Big Virtual Server.

schema_point

This is the Key Diagram, which makes Hadoop to stand out.

RDBMS – employs the Schema-on-Write model. This Schema has to be created first.

So first we will have to do the load operation which means to put it up in DB that complies to the defined schema which is then put into an internal format propeitary to the db. Benefits : Indexing, compression, partitioning, optimization, special data sturcture, which makes reads very fast for operations like join, multi-table join etc.

Problem: because of this explicit schema new data cannot flow until we design for it at the beginning. (create a column for it, create ETL logic for it.) [adding a column takes upto a month due to governance, implementation.]

Hadoop takes both models: Its just copying (not Load) data. If new field of data is required just change the parser accordingly and entire history of data is recieved.

Analogy to the above Schema explanation :

Kitchen staff  arranges all vegetables in different clean container and the Chef cooks the meal out of it. So the containers are like schema and the veg is data.

Kitchen staff gives a bag full of tomatoes , carrots , livestock etc and the Chef get to make the meal out of it. Here the chef gets to control a lot parameters compared to the earlier one.

Hadoop gives the flexibility to use any language unlike SQL. SQL has limitations it cannot image processing or complex calculations.

dp-language

Java MapReduce : is like assembly language for Java because it gives most flexibility.

scalability

The system itself is scalable, when you add new nodes the data is distributed accordingly(like partitions data, redistributes data, take advantage of the new node etc).As a developer , doesnt need to worry on how to run jobs parallel on multiple machines and also take advantage of the new machines as well.

Data Beats Algorithms : The hardest algorithms can beaten if we take value from loads of data.[google paper: The unreasonable effectiveness of Data.] E.g Natural Language Translation : translation between human language, lots of algorithm tried to tag this problem but Google came line and looked into lots and lots of data, find anagrams, correlations etc was able to come up with the best translator that rivaled most of the sophisticated algorithms. ie If you have lots and lots of data and threw at a problem it can beat the best algorithm.

When to choose RDBMS – Hadoop ?

rdbms-hadoop

Car has got more acceleration than Train, but train once reached the top speed is faster than a car, also is the throughput for Train is higher than Car , considering the economics is larger for Car compared to the train.

Complex queries with join etc take milliseconds for RDBMS which might take seconds for Hadoop. However queries which take hours or days for RDBMS will take only seconds in Hadoop (having the same economics). Latency RDBMS wins , Throughput Hadoop wins.

ACID properties enables RDBMS ideals for bank transactions.

HDFS

hdfs

A File is put in HDFS, its chopped up into blocks and spread into different nodes and also replicate them (default number is 3). Replication is for HA and minimize latency.

mapreduce

MapReduce : As a developer , you will write 2 little function – the Map function and the Reducer function and Hadoop will take care of distribution, fault tolerance, aggregation, shuffle , sorting.

MapReduce also act as a resource manager.

mapreduce2

Redbox is the Job, Bluebox is the Data.

The scheduler tries to put the Tasks at the same place where the Data resides. All these nodes continously sends heartbeat to say its alive and also the Manager get to know the status of each task. Also if any of the tasks seems to be slow/ laggard then Scheduler fires up another task on another node, whichever tasks get completed first then the other one is killed.

high-level

The bottom layer are the servers that are running hadoop which contains 2 parts, Data Node substance and Task Node substance.

Data Node : manage blocks that it fires ext4 , NTFS . which reports back to Name Node which maps the filename to the block-server/s.

Task tracker – runs the task which given to it by the Job trackers.

Name node and Job tracker communicates each other so that it knows how to place the tasks closest to where the job resides.

Name Node and Job Tracker – is a SPOF. – not in terms of durability because the state of these servers are replicated but in terms of Availability.

conclusion

Other Hadoop Related Components:

components

Hadoop like a kernel

Apache Hive / Apache Pig are languages which makes it easier to work with MapReduce.

Apache HBASE is a imitation of Google’s Big Table. Its a low latency store access system based on key/indexes. HDFS is optimized for Append , HBASE add transactional support , supports row level ATOMIC support.

Apache Flume : collection of data, it has agent that picks up all the data from webservers , network equipment and put in Hadoop.

Apache Scoop :  is a framework is used transfer data between sql and hadoop. In scoop mention name of the file in HDFS and name of the Table in sql and then the transfer is done.

Apache Oozie : scheduling tasks and execution them in the right order.

Apache Manhout :  its a collections of data mining algorithms, support VMs, clustering algorithm.

HUE: Hadoop User Environment, we application running in the browser and talk to hadoop environment.

Apache Big Top : Bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s