Monday, October 22, 2018

Big Data: Spark

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Apache Spark has as its architectural foundation the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.
Spark uses cluster computing for its computational (analytics) power as well as its storage. This means it can use resources from many computer processors linked together for its analytics. It's a scalable solution meaning that if more oomph is needed, you can simply introduce more processors into the system. With distributed storage, the huge datasets gathered for Big Data analysis can be stored across many smaller individual physical hard discs. This speeds up read/write operations, because the "head" which reads information from the discs has less physical distance to travel over the disc surface. As with processing power, more storage can be added when needed, and the fact it uses commonly available commodity hardware (any standard computer hard discs) keeps down infrastructure costs.
Unlike Hadoop, Spark does not come with its own file system - instead it can be integrated with many file systems including Hadoop's HDFS, MongoDB and Amazon's S3 system.
Another element of the framework is Spark Streaming, which allows applications to be developed which perform analytics on streaming, real-time data - such as automatically analyzing video or social media data - on-the-fly, in real-time.
In fast changing industries such as marketing, real-time analytics has huge advantages, for example ads can be served based on a user's behavior at a particular time, rather than on historical behavior, increasing the chance of prompting an impulse purchase.

Sunday, October 21, 2018

Snowflake : Cloud Data Warehouse

Snowflake allows corporate users to store and analyze data using cloud-based hardware and software. The data is stored in Amazon S3. It is built for speed, even with the most intense workloads. Its patented architecture separates compute from storage so you can scale up and down on the fly, without delay or disruption. You get the performance you need exactly when you need it.
Snowflake is a fully columnar database with vectorized execution, making it capable of addressing even the most demanding analytic workloads. Snowflake’s adaptive optimization ensures queries automatically get the best performance possible – no indexes, distribution keys or tuning parameters to manage. Snowflake can support unlimited concurrency with its unique multi-cluster, shared data architecture. This allows multiple compute clusters to operate simultaneously on the same data without degrading performance. Snowflake can even scale automatically to handle varying concurrency demands with its multi-cluster virtual warehouse feature, transparently adding compute resources during peak load periods and scaling down when loads subside.
Snowflake’s mission is to enable every organization to be data-driven with instant elasticity, secure data sharing and per-second pricing, across multiple clouds. Snowflake combines the power of data warehousing, the flexibility of big data platforms and the elasticity of the cloud at a fraction of the cost of traditional solutions.