Big Data refers to huge amounts of heterogeneous data from both traditional and new sources, growing at a higher rate than ever. Due to their high heterogeneity, it is a challenge to build systems to centrally process and analyze efficiently such huge amount of data which are internal and external to an organization. A Big data architecture describes the blueprint of a system handling massive volume of data during its storage, processing, analysis and visualization. Several architectures belonging to different categories have been proposed by academia and industry but the field is still lacking benchmarks. Therefore, a detailed analysis of the characteristics of the existing architectures is required in order to ease the choice between architectures for specific use cases or industry requirements. The types of data sources, the hardware requirements, the maximum tolerable latency, the fitment to industry, the amount of data to be handled are some of the factors that need to be considered carefully before making the choice of an architecture of a Big Data system. However, the wrong choice of architecture can result in huge decline for a company reputation and business. This paper reviews the most prominent existing Big Data architectures, their advantages and shortcomings, their hardware requirements, their open source and proprietary software requirements and some of their real-world use cases catering to each industry. For each architecture, we present a set of specific problems related to particular applications domains, it can be leveraged to solve. Finally, a trade-off comparison between the various architectures is presented as the concluding remarks. The purpose of this body of work is to equip Big Data architects with the necessary resource to make better informed choices to design optimal Big Data systems.
Figures - uploaded by Rajat Kumar Behera
Author content
All figure content in this area was uploaded by Rajat Kumar Behera
Content may be subject to copyright.
Discover the world's research
- 20+ million members
- 135+ million publications
- 700k+ research projects
Join for free
Big Data Architectures : A detailed and application
oriented review
Abstract— Big Data refers to huge amounts of
heterogeneous data from both traditional and new sources,
growing at a higher rate than ever. Due to their high
heterogeneity, it is a challenge to build systems to centrally
process and analyze efficiently such huge amount of data which
are internal and external to an organization. A Big data
architecture describes the blueprint of a system handling
massive volume of data during its storage, processing, analysis
and visualization. Several architectures belonging to different
categories have been proposed by academia and industry but
the field is still lacking benchmarks. Therefore, a detailed
analysis of the characteristics of the existing architectures is
required in order to ease the choice between architectures for
specific use cases or industry requirements. The types of data
sources, the hardware requirements, the maximum tolerable
latency, the fitment to industry, the amount of data to be
handled are some of the factors that need to be considered
carefully before making the choice of an architecture of a Big
Data system. However, the wrong choice of architecture can
result in huge decline for a company reputation and business.
This paper reviews the most prominent existing Big Data
architectures, their advantages and shortcomings, their
hardware requirements, their open source and proprietary
software requirements and some of their real-world use cases
catering to each industry. For each architecture, we present a
set of specific problems related to particular applications
domains, it can be leveraged to solve. Finally, a trade-off
comparison between the various architectures is presented as
the concluding remarks. The purpose of this body of work is to
equip Big Data architects with the necessary resource to make
better informed choices to design optimal Big Data systems.
Keywords— Big Data Architecture , Big Data Architectural
Patterns, Big Data Use Cases
The non-stop growth of data, the frantic releases of new
electronic devices and the data-driven decision-making
trend in companies is fueling a constant demand for more
efficient Big Data processing systems. The investment in
Big Data architecture has been rapidly growing these past
years and according to Gartner, businesses will keep
investing more in IT in 2018 and 2019 focusing on IOT,
Block-chain and Big Data [1,2]. 178 billion dollars were
spent on Data Center Systems in 2017 and that number is
expected to increase in the coming years [5]. Considering
the important funds companies invest in their Big Data
solutions, it appears obvious that a careful planning
should be done ahead of time before the actual
implementation of a solution. However, according to the
McKinsey institute, many organizations, today, are facing
difficulties because of the absence of architectural
planning of their data management solutions [38]. They
develop overlapping functionalities and are not able to
achieve sustainability because they usually develop
technology driven solutions.
An early and careful Big data system architecting
considers a holistic data strategy while focusing on real
business objectives and requirements. It is of the utmost
importance to write down the current but also future needs
in order to take scalability considerations into account
from the earliest stages of the design of the Big data
system. Once that list of use cases and requirements is
made clear, a company can move forward and select
among the many existing Big Data architectures the most
suitable for its use.
The Lambda architecture was one of the first architectures
to be proposed for Big Data processing and it has been
established as the standard over time [4]. The Kappa
architecture came next, followed by several other
architectures [3] designed to be able to address some of
the limitations of the lambda architecture for use cases
where the former standard failed to offer satisfying results.
In this paper, we discuss different architectures with their
optimal use cases along with some of the factors that need
to be considered to make the best choice from a pool of
candidate architectures. The paper also highlighted
whether the architecture to be adopted for a given use case
should be built from scratch or incrementally constructed
from an existing architecture.
The structure of this paper is described as follows. Section
2 presents an overview of the rise of Big Data and the
challenges that have accelerated the need of new tools and
architectures. The next section reviews the work that has
been done in the Big Data field to survey the domain,
propose architectures and eventually compare them.
Section 4 gives, for each architecture, a brief description,
its advantages and disadvantages, a set of problems it can
solve, some of the fields where it can be used and the
hardware and open source configuration required to set up
an environment based on that architecture. An overall
comparison of the architectures discussed is presented in
Section 5 and Section 6 concludes the paper.
Since In 2013, the McKinsey institute reported that there
were more than 2 billion Internet users worldwide [55]. In
2018, according to an article published by Forbes, that
number has jumped to 3.7 billion users who are performing
School of Computer Engineering
School of Computer Engineering
Kalinga Institute of Industrial Technology (Deemed to be
University)
Kalinga Institute of Industrial Technology (Deemed to be
University)
Bhubaneswar, Odisha , India
Bhubaneswar, Odisha, India
over 5 billion searches every day [61, 62]. Social media
remains one of the biggest sources of the data produced in
the world. According to Domo's "Data Never Sleeps 6.0"
report, every minute, Internet users watch more than 4
million videos on YouTube, close to 13 million text
messages are exchanged, the weather channel receives 18
million forecast requests and 97 000 hours of video content
are streamed on the Internet [63]. The growth is particularly
apparent with social media considering companies like
Instagram, one of the most used social media platforms in
the world, which has grown its active user database,
between December 2016 and 2018, from 600 million to 800
million users who are now posting 95 million photos and
videos everyday [64]. Companies across various industries
are experiencing a similar frenetic data growth. In 2018,
Amazon ships 1111 packages every minute and Uber is
used to book 1389 rides every single minute [63]. Another
main contributor to the data flood is the Internet of Things
industry. The International Data Corporation (IDC) and Intel
predict that there will be 200 billion iot devices in use by
2020 [65]. Considering that only 15 billion devices were
identified in 2015 up from 2 billion in 2006, it is easier to
start getting an idea of the exponential rate at which data is
growing in size. And of course, all those devices transmit
information across networks, sometimes carrying sensitive
data intended to trigger immediate reactions. The case of the
voice control feature which is now used by 8 million people
every month illustrates the point [63].
From all that has been previously described, it is evident
that the traditional single machines can no longer process
the diverse and humongous amount of data being produced
at such a high speed. Several challenges have arisen due to
the birth of Big Data. They include data storage issues of
course, but also for instance, the need to separate qualitative
data from noise and error as fast as possible because of the
volatility of the data. Other challenges are faced during the
entirety of the data analysis process. During the acquisition
of the data, there is a need to filter it, reduce it and associate
it with metadata. There is also a need to transform
structureless data and eliminate errors from it in a cleaning
process before consuming it. Heterogeneous data
proceeding from various sources have to be integrated into
single data repositories, requiring new designs and systems
more complex than the traditional ones. Additionally, to that,
most use cases require that integration to be automated. Also,
queries need to be easily scaled over different amounts of
data and executed in a matter of second for critical use cases.
Finally, there is a need to reflect on the design of specific
tools to present human friendly interpretation of the data
that is being generated. Another category of challenges that
have led to the conception of Big Data ecosystems is the
management related issues such as privacy and security
among so many others [66, 67, 68].
The landscape of Big Data has kept changing since its birth
and the storage devices' prices have been considerably
reduced while the data collection methods have kept
increasing. Nowadays, in the same system, some data arrive
at a very fast rate in constant streams while others arrive in
big batches periodically. That diversity has led to the
creation of Big Data architectures with the intention to
accommodate various data flows and solve the issues
specific to each of them [69].
Many reviews have been done in the field of Big Data. Most
of them cover technologies, tools, challenges and
opportunities in the field [55]. They try to shed more light
on the field of Big Data, present its advantages and
inconvenient [56]. For the majority, they review for each of
traditional Big Data processing steps from data generation to
its analysis, the background, the technical challenges, and
the latest advances. There has also been some work done to
review Big Data analytics methods and tools [60].
Reference architectures for Big Data ecosystem have been
published by top tech companies as IBM [53], Oracle [51],
Microsoft [52] and the National Institute of Standards and
Technology (NIST) [54]. Various approaches have been
used by researchers in order to try to come up with a
reference architecture that could be used across industries in
a wide variety of use cases. Pekka, P. and Daniel, P. [39]
have proposed a technology independent architecture based
on the study of seven major Big Data use cases at top tech
companies like Facebook, Linkedin or Netflix. They have
decomposed the 7 reviewed architectures in a set of
components which they have then classified according to
their roles in 8 components forming their reference
architecture. Nevertheless, not all use cases required all the
components of their architecture and by considering other
use cases than the reviewed ones, they have acknowledged
more components might need to be added. Other authors
have followed a similar approach to propose a five-layer
reference architecture in [47]. Mert, O. G., & al. [40] have
fetched among more than 19 million projects in the Github
database and Apache Software Foundation projects list, the
ones related to Big Data. They have extracted from 113
documents, including whitepapers and projects
documentations across diverse industries, 241 most popular
and actively developed open-source tools. The authors have
then classified the tools in 11 groups constituting the
components of their reference architecture. They have also
discussed the suitability of different tools for their
architecture's implementation taking in account factors such
as timing, data size, platform independency and data-storage
model requirements. Reference architectures have also been
proposed to address specific issues such as security in Big
Data ecosystems. An example is the Big Data Security
reference architecture proposed in [50]. Their architecture
was extended from the NIST reference architecture to
include for each component, tools and specifications to
ensure the protection of the elements of interest: encryption
for data, authentication and authorization for networks and
containerization and isolation for processes' execution. The
authors also presented a brief and high-level comparison of
their architecture with other existing reference architectures.
There have been several industry specific propositions too,
based on the set of requirements of special use cases.
Architectural solutions have been proposed in the field of
Supply Chain Management [48], Intelligent Transportation
Systems [46], telecommunications [45], healthcare [44],
communication networks security (for fault detection and
monitoring) [43], smart grids in electrical networks [42],
Higher education and universities [41]. Those architectures
all reuse all or some of the layers defined in the common
reference architectures namely: the data sources layer, the
extraction/collection/aggregation layer, the storage layer, the
analysis layer and the visualization layer. Each of these
industry specific architectures defines its layers'
components in terms of the technological tools or features
required by the use case.
Existing architectures have extensively been documented
over time as they gained popularity. The biggest part of the
existing research focuses on two of the most popular ones:
The Lambda and Kappa architectures. Zhelev and Rozeva [5]
worked to equip data architects with decision-making
information by reviewing cloud types, data persistence
options, data processing paradigms and tools and also
briefly both the Lambda and Kappa architecture specifying
each one's strengths and flaws and mentioning in which
situation, each one would be suitable to use. Other works
have presented both Lambda and Kappa architecture along
with some of their strengths and weaknesses [6]. The
authors have also presented a short comparison of both
architectures before proposing a new architecture to
overcome the deficiencies of both the previously discussed
ones. The most exhaustive work has been done in [7] where
seven popular architectures were described with the
software requirements necessary to implement them. Our
aim is to extend the work done in [7], by describing not only
existing related use cases but also a set of specific problems
each architecture can solve given an industrial context.
From an industrial application point of view, a lot of work
has been done to provide exposure on how Big Data can be
leveraged to provide better services or increase business
profit in various fields [8, 9, 10]. [8] provides insights on the
kind of hardware required to build a Big Data processing
system discussing electric energy, storage, processing and
network requirements at a very high level. None of the
existing addressed detailed hardware requirements or
attempted to classify use cases and target problems
architecture wise.
There does not exist yet to the best of our knowledge any
reference document using which, a Big Data System
architect can be guided to choose among the most popular
Big Data architectures
knowing the industry of application, the existing hardware
architecture, the budget allotted to purchasing new
components and the problems the system is expected to
solve.
IV. BIG D ATA A RCHITECTURES
Big Data architectures are designed to manage the ingestion,
processing, visualization and analysis of data that are too
large or too complex to handle with traditional tools. From
one organization to the other, that data might consist of
hundreds of gigabytes or hundreds of terabytes. In the
context of this paper, the minimum amount we consider as
Big data is 1 TB.
A Big data architecture determines how the collection,
storing, analysis and visualization of data is done. We also
refer to it to define how to transform structured,
unstructured and semi-structured data for analysis and
reporting. We discuss in this section, five of the most
prominent Big Data architectures that have gained
recognition in the industry over the years.
The lambda architecture is an approach to big data
processing that aims to achieve low latency updates while
maintaining the highest possible accuracy. It is divided in 3
layers.
The first, "the batch layer" is composed of a distributed
file system which stores the entirety of the collected data.
The same layer stores a set of predefined functions to be run
on the dataset to produce what is called a batch view.
Those views are stored in a database constituting the
"serving layer" from which they can be queried interactively
by the user.
The third layer called "speed layer" computes
incremental functions on the new data as it arrives in the
system. It processes only data which is generated between
two consecutive batch views re-computation producing and
it produces real-time views which are also stored in the
serving layer. The different views are queried together to
obtain the most accurate possible results. A representation of
this architecture is given in Figure 1.
Fig. 1. Lambda Architecture
Nathan Marz proposed the Lambda architecture (LA) with
as first objective to palliate the problems encountered while
using fully incremental systems. Such kinds of system have
exhibited problems such as operational complexities (online
compaction for example), the need to handle eventual
consistency in highly available systems and the lack of
human fault tolerance. On the contrary, a lambda
architecture-based system provides better accuracy, higher
throughput and lower latency for reads and updates
simultaneously without compromise on data consistency. A
LA based architecture is also more resilient thanks to the
Distributed File System used to store the master dataset,
mostly because it is less subject to human errors (such as
unintended bulk deletions) than a traditional RDBMS.
Finally, the lambda architecture helps achieve the main
requirements of a reliable Big Data system among which are
robustness and fault tolerance provided through the batch
layer. Each layer of the architecture is scalable
independently and the lambda architecture can be easily
generalized or extended for a great number of use cases
while requiring only minimal maintenance [4]. This
architecture provides both real-time data analysis through
the ad-hoc querying of real-time views and historical data
analysis [11].
The main challenge that comes with the Lambda
Architecture is maintaining the synchronization of the batch
and speed layers. It consists in regularly discarding the
recent data from the speed layer once they have been
committed to the immutable dataset in the batch layer.
Another limitation to keep in mind is the fact that only
analytical operations are possible from the serving layer; no
transactional operation is possible. Finally, one of the major
disadvantages of this architecture is the need to maintain
two similar code bases: one in the speed layer and another in
the batch layer to perform the same computation on
different sets of data. That implies redundancy and it
requires two different sets of skills in order to write the logic
for the streaming and for the batch data [3].
Several companies spanning across multiple industries have
adopted the Lambda Architecture over time. Many of them
are referenced in [29] where specific use cases and best
practices around the lambda architecture are collected and
made available to those who are interested to work with it.
A particularly suitable application of the Lambda
architecture is found in Log ingestion and analytics. The
reason is that log messages are immutable and often
generated at a high speed in systems that need to offer high
availability [12]. The Lambda Architecture is preferred in
cases where there is an equal need for real-time/fluid
analysis of incoming data and for periodic analysis of the
entire repository of data collected. Social media and
especially tweets analysis is a perfect example of such an
application [12]. But the Lambda architecture can be used in
other types of systems to keep track of users subscribing to a
meet-up online for instance [13]. The system in [13] is
based on the Azure platform and HDInsight Blob Storage is
used to permanently store the data and compute the batch
views every 60 seconds while a Redis key-value storage is
used to persist and display the new registrations between
two computation of batch views. The serving layer returns a
combination of the results of the two other layers in real-
time, via REST webservices, always providing up-to-date
information without much overhead. [14] presents an
Amazon EC2 based system processing data from various
sensors across a city in order to make efficient decisions.
While some of those decisions require an on-the-fly
analyses of the sensed data, others require that the analyses
be performed on massive batches of data accumulated over
a long period of time. In such a case, the Lambda
architecture, again, reveals itself to be ideal to achieve both
objectives.
The lambda architecture is a good choice when data loss or
corruption is not an option and where numerous clients
expect a rapid feedback, for example, in the case of
fraudulent claims processing system [15]. Here, the speed
layer using Spark runs in real-time a machine learning model
that detects whether a claim is genuine or needs further
checking. In that manner, the overall processing time per
claim from a user's point of view is considerably reduced.
Batch layer. The requirements of the batch layer make
Hadoop the most suitable framework to use for its
implementation. HDFS provide the perfect append-only
technology to accommodate the master dataset. MapReduce,
PIG and Hive can be used to develop the batch functions.
Speed layer. The speed layer can be implemented using real-
time processing tools such as Storm or S4. Spark Streaming
can also be used although it treats data in micro-batches
rather than in real streams. The advantage is that the Spark
code can be reused of in the batch layer [30].
Serving layer. Any random-access NoSQL database can
host the real-time and batch views. Some examples are:
HBase, CouchDB, Voldemort or even MongoDB.
Cassandra is particularly preferred because of the write-fast
option that it provides.
Queuing system. A queuing system is necessary to ensure
asynchronous and fault-tolerant transmission of the real-
time data to the batch and speed layer. Popular options
include Apache Kafka or Flume.
The hardware requirements presented here are estimated for
1 TB of data. For the calculation, we use a method detailed
in [15]. In order to exploit this, one can make the naïve
assumption that the hardware requirements grow
proportionally with the amount of data to process. The data
in the batch layer is usually not stored in a normalized form
thus some additional storage space is required,
approximately 30% of the original size of the data
amounting to a total of 1.3 TB in our case.
TABLE I: LAMBDA ARCHITECTURE HARDWARE
REQUIREMENTS
Each worker node's raw storage per node (rpsn) was
calculated using the formula in equation (1). 2% of the
total storage per node (tspn) is reserved for the Operating
System and other applications and the remaining storage is
divided by Hadoop's default replication factor (rf) 3.
Finally, for each 4TB worker node, 653 GB rough space is
available to store data.
...(1)
The Spark documentation recommends to run Apache Spark
on the same node as Hadoop if possible [32]. Either way, to
get a proper idea of the exact Spark hardware requirements,
it is necessary to load the data in the Spark system and use
the Spark monitoring feature to see how much memory it
consumes.
Another important point to note is that, according to the
Cassandra's documentation, it is recommended to keep the
utilization of each 1TB node to around 600GB [33]. Beyond
that threshold, it is not uncommon to observe timeout rates
and mean latencies explode and node crashes.
1 replicated master node (6 cores CPU, 4 GB memory,
RAID-1 storage, 64-bit operating system)
2 worker nodes (12 cores CPU, 4 GB memory, 2 TB
storage, 1 GbE NIC)
1 dedicated resource manager (YARN) node (4 GB
memory, and 4core)
2 nodes (1TB, 4 cores, 16 GB memory)
The Kappa architecture was proposed to reduce the lambda
architecture's overhead that came with handling two
separate code bases for stream and batch processing. Its
author, Jay Kreps, observed that the necessity of a batch
processing system came from the need to reprocess
previously streamed data again when the code changed. In
Kappa architecture the batch layer was removed and the
speed layer enhanced to offer reprocessing capabilities. By
using specific stream processing tools such as Apache Kafka,
it is henceforth possible to store streamed data over a period
of time and create new stream processing jobs to reprocess
that data when it's needed replacing batch processing jobs.
The functioning process is depicted in Figure 2.
Fig. 2. Kappa Architecture
Kappa architecture-based systems bring a lot of
simplification to data processing. We get the best of both
worlds by maintaining a single code base while still
allowing historical data querying and analysis through
replays. It has fewer moving parts than the Lambda
architecture which allows for a simpler programming model
as well. Also, it allows replaying streams and recomputing
results in case the processing code changes. The incoming
data can still be stored in HDFS but we don't rely on it to
run reprocessing tasks on historical data.
One of the challenges faced while using this
architecture is that only analytical operations are possible
not transactional ones. Also, it is not possible to implement
the Kappa architecture with native cloud services because
they do not support streams with a long Time to live (TTL).
It is important to know that the data is not conserved for a
long term. In a Kappa architecture-based system, data is
kept for a limited predefined period of time after which it is
discarded [11].
The Kappa architecture is particularly suited for real-
time applications because it focuses on the speed layer. The
author of this architecture, Jay Kreps's company, Linkedin
itself has already adopted it. Seyvet & Vielahave [16]
presented a detailed implementation of a Kappa architecture
for real time analytics of users, network and social data
collected by a telco operator. We have inventoried two other
use cases, a system for real time calculation of Key
Performance Indicator (KPI) in telecommunication and
another in the IOT field [17].
The software requirements for the Kappa architecture
are quite similar to those of the Lambda Architecture minus
the Hadoop platform used to implement the batch layer
which is absent here.
The preferred ingestion tool is Apache Kafka because
of its ability to retain ordered data logs allowing data
reprocessing which is essential to the Kappa architecture.
Apache Flink is particularly suitable also for
implementing Kappa architecture because it allows building
time windows for computations. A popular alternative to it
is Apache Samza.
Table 2 summarizes the hardware requirements for a Kappa
architecture based system. IBM knowledge center published
a sizing example, recommending the above reported
hardware requirements to ingest 1 TB of data [35]. [34]
specifies the minimal hardware requirement in a production
environment to run Apache Storm in the speed layer.
TABLE II. KAPPA ARCHITECTURE HARDWARE REQUIREMENTS
Apache Zookeeper is necessary for the functioning of
Apache Kafka and can be installed on the primary Apache
Kafka server
C. The Microservice Architecture
A system based on the microservice architecture is
composed of a collection of loosely coupled services that
are able to run independently and to communicate with each
other via REST web services, remote calls or Push
Messaging. Each service is implemented with the tools and
in the language that are most suitable for the task it performs.
Each service runs on a dedicated server and has a dedicated
storage. The main difference between the microservice
architecture and a simple SOA based system is that in the
microservice architecture, each service focuses on
accomplishing only one specific task and represents a
standalone application on its own [20]. The microservice
architecture is described in figure 3.
As compared to monolithic systems, microservice
based systems allow for faster development, faster tests and
deployments because each service is small and independent
from others thus, easier to understand. Thanks to that
independence between services, fault tolerance is higher and
each service can be developed or rewritten at any time using
the newest technology stacks without compromising the
other services.
10 SERVERS HAVING EACH :12 PHYSICAL PROCESSORS , 16 GB
RAM
Minimum one server having : 16 GB RAM, 6 core CPUs of 2
GHz (or more) each, 4 x 2 TB, 1 GB Ethernet
IDEM . TO LAMBDA ARCHITECTURE
Fig. 3. Microservice architecture
Different teams can work more efficiently by being
allocated specific services each. Moreover, services are
reusable across a business and any function can be scaled
independently from the others.
On the other hand, an inter-service communication
mechanism is required and its development is quite complex.
There is a need for a strong team coordination and the
network communication among the components has to be
heavily secured. When two services using two different
technological stacks need to communicate, the changes of
format (marshalling and unmarshalling) also create an
overhead. Though the deployments are faster, they are more
complex to setup. Each service usually runs in its own
container (possibly a JVM) thus the overall memory
consumption is way higher than what is required for a
monolithic application [18].
The microservice architecture has provided a solution
for many tech giants such as Amazon, Netflix and eBay as
they have to handle a huge number of requests dayly [20].
Modularity can be achieved to a certain extent in
monolith applications so some of the factors indicating we
need to use a microservice architecture can : the need for
decentralization, a high existing (or predictable) traffic.
Before making the choice of a microservice architecture, it
is also important to keep in mind the consequent investment
in time and manpower required from the early stages of the
development before the production stage [21]. In [21], the
authors describe the implementation of a microservice based
scalable mobility service that helps blind users find the most
suitable paths for them throughout a city leveraging
facilities like bus stops, stairs and audible traffic lights. A
microservice is particularly adapted for that use case
because some of the services required such as dynamic
planner service, crowd-sensing service and travelling
service already exist (or can be developed independently as
standalone applications and reused). The only services that
needed to be developed were the one that the user invokes
and a high-level orchestrator service to fetch and provide to
the user the useful information. [22] describes the
application of microservices in a fraud detection system.
Fraud detection systems are extremely time sensitive
because, in a matter of seconds, a lot of processing has to be
done in order to determine whether or not a transaction is
genuine to prevent a potential fraudster to get away with a
customer's money. Several microservices leveraging
different databases (user past activity database, blacklist
database, white list activities database etc.) can quickly and
simultaneously perform the necessary checking required and
their results are further evaluated by another service to
decide if the user should be allowed to proceed or not.
Each microservice is technologically independent. It can be
developed using any language or technology. Many types of
components intervene in the development of microservices
and we will be listing giving examples of corresponding
tools as used in [31] where the Spring boot framework was
used to develop Java based microservices.
Container. Every microservice runs within a container. One
tool that can be used to build, deploy and manage containers
is OpenShift (particularly suitable for Docker based
containers). It orchestrates how and when container-based
applications run and allows developers to fix and scale those
applications seamlessly. The container management service
Docker is used to create containers in which the applications
will be developed, shifted and run anywhere.
Distributed version control system. Microservice
architecture projects generally imply the existence of several
independent teams working on separate microservices. In
order to facilitate the collaboration between teams, Git is
generally leveraged as source code repository.
Continuous integration tool. Continuous
integration/continuous delivery (CI/CD) pipelines are built
to facilitate the deployment of services by automating their
deployment after they have passed a test suite. Its main
objective is to allow early detection of integration bugs. The
most popular framework used for that purpose is Jenkins.
Other popular options include GitLab CI, Buildbot, Drone
and Concourse.
Table 3 summarizes the hardware requirements for the
microservice architecture. The hardware requirements for a
multi-node cluster deployment of Docker, as specified in the
IBM Cloud Private documentation, are described in [36].
The recommended hardware configuration for Jenkins in
small teams has been specified in their documentation [37].
TABLE III . MICROSERVICE ARCHITECTURE HARDWARE
REQUIREMENTS
The Zeta architecture proposes a novel approach in which
the technological solution of a company is directly integrated
with the business/enterprise architecture. Any application
required by a business can be "plugged in" this architecture.
It provides containers which are isolated environments in
which softwares can be run and made to interact together
independently of the platform incompatibilities. Due to that,
Container
management
system
1 boot node (1+ core, 4 GB RAM, 100+ GB storage)
1, 3 or 5 master nodes (2+ cores, 4+ GB RAM, 151+
GB storage)
1, 3 or 5 proxy nodes (2+ cores, 4 GB RAM, 40+ GB
storage)
1+ worker nodes (1+ cores, 4GB RAM, 100+GB
storage)
1+ optional management node (4+ cores, 8+ GB RAM,
100+ GB storage)
2.4 GHz cores recommended
1 node (1+ GB RAM, 50+ GB storage)
many types of applications can be accommodated and run in
a zeta architecture.
Fig. 4. Zeta architecture
Since the hardware is not specifically dedicated to any set
of services in particular but is common to the entire system,
it is better utilized and it can be allocated to serve the most
pressing need at any moment. The near real time backups
also help avoid over extended recovery periods from
failures. The architecture helps to discover issues more
quickly too. It facilitates the testing and deployment phases
by allowing the creation of binaries that can be deployed
seamlessly in any environment without the need to modify
them. The example of an advertising platform based on the
zeta architecture is presented in [24]. It shows how
intermediaries are suppressed by logs directly being saved,
read and processed from the same Distributed File System.
The zeta architecture is suitable for organizations handling
real-time data processing as part of their internal business
operations. For instance, the example of dynamic allocation
of parking lots based on data coming from sensors is a good
use case that has been evoked in [23]. It is the architecture
leveraged by Google for systems such as Gmail. The zeta
architecture is also particularly suitable for complex data-
centric web applications, machine learning based systems
and for Big data analytics solutions [24].
There are many components in the zeta architecture playing
different roles that can each fulfilled by several existing tools.
We list next some of the tools that can be useful to build a
decent zeta architecture-based system.
The Distributed File System hosting the master data is
generally implemented using Hadoop Distributed File
System while the real-time storage can be implemented
using NoSQL or NewSQL databases (HBase, MongoDB,
VoldDB etc.). The enterprise applications on the diagram
generally consist in web servers or any other business
application (varying from one business to the other).
The compute model/execution engine is destined to perform
all analytics operation. Any data processing tool that is
pluggable can be used for that purpose: MapReduce, Apache
Spark and even Apache Drill. Apache Mesos or Apache
YARN can serve as global resource manager. The container
management system can be chosen among Docker,
Kubernetes or Mesos.
The requirements for each of the software components of
this architecture have been described in the previous
architectures' sections already. The reader can refer to
hardware requirements section for the Lambda, Kappa and
microservice for more details.
The Internet of Things domain is so vast that no uniform
architecture has been defined so far in the field.
Nevertheless, several architectures have been proposed by
scholars over time [25]. Michael Hausenblas has made an
attempt to propose a high abstraction architecture for all
IOT projects based on the requirements of an IOT data
processing system [26]. The architecture is called iot-a and
it is the one we discuss here. It is represented by Figure 5.
Fig. 5. iot-a architecture
1) Advantages and Inconvenients
There has not yet been enough feedback on
projects done using this architecture to provide any thorough
evaluation of its performance and eventually of its flaws.
The discussed architecture is a solution designed to be a
good fit for use cases such as smart homes and smart cities
[27]. A specific example in the automotive sector describes
how the Message Queue/Stream Processing layer helps alert
in real-time a car user about failures thus preventing
eventual accidents [28]. The Database layer is used here to
query the system and obtain information about the status of
a car for checkup or in order to develop a repair strategy.
Finally, the Distributed File System layer can allow the
owner of a car to weekly or monthly assess the overall
metrics and performance of his car and possibly identify
problems. [28] also lists three other potential use cases of
this architectures respectively in biometric database creation
(the example of the Aadhaar system in India), financial
Services and waste collection and recycling. Each of those
use cases requires real-time processing of the data whether
to trigger instantaneous notifications or fraud detection
alerts. But the interactive aspect is also important in order to
generate better routes for trucks or to help target specific
companies with banking offers for instance.
Again, the Distributed File System layer can be leveraged
with aggregation-like operations to investigate fraud cases
that have been flagged over a certain period of time to build
a better detection model. The same layer can help
municipalities to generate useful reports on local waste
recycling activities on a monthly basis.
The MQ/SP (Message queuing and Stream processing) layer
can be implementing Apache Kafka or fluentd for data
collection and Apache Spark or Storm for its processing.
The interactive storage layer can be implemented using any
NoSQL database along with tools like Apache Drill to
interact with it. The DFS layer can use HDFS along with
Hive and Apache Mahout for machine learning over the
master dataset.
The requirements for each of the components of this
architecture have been described in the previous
architectures already. The reader can refer to hardware
requirements section for the Lambda, Kappa and
microservice for more details.
V. ARCHITECTURES COMPARISON
Table 4 summarizes the discussion about the 5
architectures into a simple format where it can be referred to
help Big Data architects make the right choice during the
design of a Big Data ecosystem, depending on their needs
and requirements.
This paper presents a review of the different prominent
existing Big Data architectures. We present an overall
assessment of the recent review work done in the field of
Big Data as a whole. Although there is a plethora of work
concerning the characteristics of Big Data itself, its
application domains, opportunities, challenges and
technologies, there is a lack of comprehensive review work
concerning its architecting process. We present reviews of
several architectures that have been proposed in industry
and in academics in the past years in an attempt to solve real
world problems or to serve as roadmap for Big Data
architects for a wide range of problems. Then, we focus on
five architectures: the lambda architecture, the kappa
architecture, the iot-a architecture, the micro service
architecture and the zeta architecture. We describe and
compare them in terms of the type of processing they can
perform and the type of use cases they are suitable for. We
Query and
reporting/
Analytical/
Predictive
analysis
Query and
reporting/
Analytical
Structured
, Semi-
structured
&
Unstructur
ed
Structured,
Semi-
structured &
Unstructured
Structured,
Semi-
structured &
Unstructured
Structured,
Semi-
structured &
Unstructured
Structured,Se
mi-structured
&
Unstructured
Human &
Machine
generated
, web or
social
media
Machine &
Human
generated,
Web or social
media
Internal data
sources,
machine
generated
Web and
social media,
Internal Data
sources
Human/ Other
data
repositories
TABLE IV. COMPARISON OF THE
list and briefly explain, for each architecture, a list of real
world reliable use cases that can be referred to get
guidelines concerning how to make the most out of each
architecture during its implementation. We also present
known advantages and inconvenient of each architecture as
well as a hardware requirements assessment which can help
stakeholders plan wisely in terms of budget and
infrastructures before going for a Big Data solution.
Big Data architecting is still in its early age and there is still
a lack of reliable information about the technical aspects of
how the industry is leveraging it. There will need to be a lot
more experimentation and applications in order to establish
standards and performance statistics to refine the choice of
an appropriate architecture. Our work can be further
extended with more architectures provided the data
concerning their performance and requirements in real
world use cases is made available. Nevertheless, even at this
stage, we hope it will of great contribution to efficient Big
REFERENCES
[1] Gartner Says Global IT Spending to Reach $3.7 Trillion in 2018. (2018,
January 16). Retrieved from
https://www.gartner.com/newsroom/id/3845563
[2] Press, G. (2017, January 20). 6 Predictions For The $203 Billion Big
Data Analytics Market.Retrieved
fromhttps://www.forbes.com/sites/gilpress/2017/01/20/6-predictions-for-
the-203-billion-big-data-analytics-market/#599b23752083
[3] Kreps, J. (2014, July 2). Questioning the Lambda
Architecture. Retrieved May 26, 2018,
https://www.oreilly.com/ideas/questioning-the-lambda-architecture
[4] Marz, N., & Warren J. (2015). Big Data : Principles and best practices
of scalable realtime data systems. Retrieved from
https://www.manning.com/books/big-data
[5] Zhelev, S.& Rozeva, A.(2017, December).Big Data Processing in the
Cloud - Challenges and Platforms. Paper presented atthe 43rd international
conference applications of mathematics in engineering and economics,
Sozopol, Bulgaria. http://dx.doi.org/10.1063/1.5014007
[6] Ounacer S., Talhaoui M. A., Ardchir S., Daif A.& Azouazi M. (2017).
A New Architecture for Real Time Data Stream Processing. International
Journal of Advanced Computer Science and Applications,8(11), 44-51.
http://dx.doi.org/10.14569/IJACSA.2017.081106
[7] Singh, K. , Behera R. J. &Mantri, J. K.(2018, February).Big Data
Ecosystem - Review On Architectural Evolution. Paper presented at the
International Conference on Emerging Technologies in Data Mining and
Information Security, Kolkata, India. Retrieved from
https://www.researchgate.net/publication/323387483_Big_Data_Ecosystem
_-_Review_on_Architectural_Evolution
[8] Kambatla, K., Kollias, G., Kumar,V. &Grama, A. (2014). Trends in Big
Data Analytics. Journal of Parallel and Distributed Computing,
74(7),2561-2573. https://doi.org/10.1016/j.jpdc.2014.01.003
[9] Chen, M., Mao, S. & Liu, Y.(2014). Big Data : A Survey . Mobile
Networks and Applications, 19(2),171-
209.https://doi.org/10.1007/s11036-013-0489-0
[10] Latinović, T. S., Preradović, D. M., Barz, C. R., Latinović, M. T.,
Petrica, P. P. & Pop-Vadean A. (2015, November).Big Data in Industry.
Paper presented at theInternational Conference on Innovative Ideas in
Science (IIS2015) , Baia Mare, Romania.https://doi.org/10.1088/1757-
899X/144/1/012006
[11] Buckley-Salmon, O. (2017). Using Hazelcast as the Serving Layer in
the Kappa Architecture [PowerPoint slides]. Retrieved from
https://fr.slideshare.net/OliverBuckleySalmon/using-hazelcast-in-the-
kappa-architecture
[12] Kumar, N. (2017, January 31). Twitter's tweets analysis using Lambda
Architecture [Blog post]. Retrieved
fromhttps://blog.knoldus.com/2017/01/31/twitters-tweets-analysis-using-
lambda-architecture/
[13] Dorokhov, V. (2017, March 23). Applying Lambda Architecture on
Azure. Retrieved
fromhttps://www.codeproject.com/Articles/1171443/Applying-Lambda-
Architecture-on-Azure
[14] Katkar, J. (2015).Study of Big Data Architecture Lambda Architecture
(Master's thesis). Retrieved from
http://scholarworks.sjsu.edu/etd_projects/458/
[15] Lakhe, B. (2016). Case Study : implementing Lambda Architecture. In
R. Hutchinson, M. Moodie & C. Collins (Eds.)Practical Hadoop Migration
(pp. 209-251). https://doi.org/10.1007/978-1-4842-1287-5
[16] Seyvet, N. & Viela, I. M. (2016, May 19). Applying the Kappa
Architecture in the telco industry. Retrieved
fromhttps://www.oreilly.com/ideas/applying-the-kappa-architecture-in-the-
telco-industry
[17] Garcia, J. (2015). Kappa Architecture [PowerPoint slides]. Retrieved
from https://fr.slideshare.net/juantomas/aspgems-kappa-architecture
[18] Richardson, C. (n.d.). Pattern : Microservice architecture. Retrieved
from http://microservices.io/patterns/microservices.html
[19] Huston, T. (n.d.).What is microservice architecture?. Retrieved
fromhttps://smartbear.com/learn/api-design/what-are-microservices/
[20] Kumar, M. (2016, January 5).Microservices Architecture : What,
When, And How?. Retrieved fromhttps://dzone.com/articles/microservices-
architecture-what-when-how
[21] Melis, A., Mirri, S., Prandi, C., Prandini, M., Salomoni, P. & Callegati,
F. (2016, November). A Microservice Architecture Use Case for Persons
with disabilities.Paper presented at Smart Objects and Technologies for
Social Good : Second International Conference, GOODTECHS 2016,
Venice, Italy. https://doi.org/10.1007/978-3-319-61949-1_5
[22] Scott, J. (2017,February 21).Using microservices to evolve beyond the
data lake. Retrieved fromhttps://www.oreilly.com/ideas/using-
microservices-to-evolve-beyond-the-data-lake
[23] Pal, K. (2015, September 28). What can the zeta Architecture do for
Enterprise?. Retrieved
fromhttps://www.techopedia.com/2/31357/technology-trends/what-can-the-
zeta-architecture-do-for-enterprise
[24] Konieczny, B. (2017, April 9). General Big Data. Retrieved from
http://www.waitingforcode.com/general-big-data/zeta-architecture/read
[25] Madakam, S., Ramaswamy, R. & Tripathi, S. (2015). Internet of
Things (IoT) : A Literature Review. Journal of Computer Science&
Communications, 3(5), 164-173. http://dx.doi.org/10.4236/jcc.2015.35021
[26] Hausenblas, M. (2015, January 19). Key Requirements for an IOT data
platform. Retrieved fromhttps://mapr.com/blog/key-requirements-iot-data-
platform/
[27] Hausenblas, M. (2014, September 9). iot-a : the internet of things
architecture. Retrieved from https://github.com/mhausenblas/iot-a.info
[28] Hausenblas, M. (2015, April 4). A Modern IoT data processing
toolbox [PowerPoint slides]. Retrieved from
https://fr.slideshare.net/Hadoop_Summit/a-modern-iot-data-processing-
toolbox
[29] Hausenblas, M. & Bijnens, N. (2014, July 1). Lambda Architecture.
Retrieved from http://lambda-architecture.net/
[30] Chu, A. (2016, March 28). Implementing Lambda Architecture to
track real-time updates. Retrieved from
https://blog.insightdatascience.com/implementing-lambda-architecture-to-
track-real-time-updates-f99f03e0c53
[31] Eudy, K. (2018, March 7). A healthcare use case for Business Rules in
a Microservices Architecture. Retrieved from
https://blog.vizuri.com/business-rules-in-a-microservices-architecture
[32] Hardware provisioning - Spark 2.3.1 documentation (n.d.) . Retrieved
from https://spark.apache.org/docs/latest/hardware-provisioning.html
[33] Cassandra/Hardware (2017, May 12). Retrieved from
https://wikitech.wikimedia.org/wiki/Cassandra/Hardware
[34] Simplilearn (n.d.). Apache Storm - Installation and Configuration
Tutorial. Retrieved from https://www.simplilearn.com/apache-storm-
installation-and-configuration-tutorial-video
[35] Example sizing (n.d.). Retrieved from
https://www.ibm.com/support/knowledgecenter/en/SSPFMY_1.3.5/com.ib
m.scala.doc/config/iwa_cnf_scldc_hw_exmple_c.html
[36] Hardware requirements and recommendations (n.d.). Retrieved from
https://www.ibm.com/support/knowledgecenter/en/SSBS6K_2.1.0/supporte
d_system_config/hardware_reqs.html
[37] Installing Jenkins (n.d.). Retrieved from
https://jenkins.io/doc/book/installing/
[38] Blumberg, G., Bossert, O., Grabenhorst, H. & Soller, H. (2017,
November). Why you need a digital data architecture to build a sustainable
digital business. Retrieved from https://www.mckinsey.com/business-
functions/digital-mckinsey/our-insights/why-you-need-a-digital-data-
architecture
[39] Pekka , P., & Daniel, P. (2015). Reference Architecture and
Classification of Technologies, Products and Services for Big Data Systems.
Big Data Research 2(4). 166-186. doi :
https://doi.org/10.1016/j.bdr.2015.01.001
[40] Mert, O. G., & al. (2017). Big-Data Analytics Architecture for
Businesses: a comprehensive review on new open-source big-data tools.
Cambridge Service Alliance. Retrieved from
https://cambridgeservicealliance.eng.cam.ac.uk/news/2017OctPaper
[41] Peter, M., Ján, Š. & Iveta Z. (2014). Concept Definition for Big Data
Architecture in the Education System. Paper presented at the 12th
International Symposium on Applied Machine Intelligence and Informatics,
Herl'any, Slovakia, 2014. https://doi.org/10.1109/SAMI.2014.6822433
[42] Xing, H., Qi & al. (2017). A Big Data Architecture Design for Smart
Grids based on Random Matrix Theory. IEEE Transactions on Smart Grid
8(2). 674-686. Doi : https://doi.org/10.1109/TSG.2015.2445828
[43] Samuel, M., Xiuyan, J., Radu, S. & Thomas, E. (2014). A Big Data
architecture for Large Scale Security Monitoring. Paper presented at IEEE
International Congress of Big Data, Anchorage, AK, USA, 2014.
https://doi.org/10.1109/BigData.Congress.2014.18
[44] Yichuan, W., LeeAnn, K. & Terry, A., B. (2016). Big Data Analytics :
Understanding its capabilities and potential benefits for healthcare
organizations. Technological forecasting and social change 126 . 3-13. doi :
https://doi.org/10.1016/j.techfore.2015.12.019
[45] Fei, S., Yi, P., Xu, M., Xinzhou, C., & Weiwei, C. (2016). The
research of Big Data on Telecom industry. Paper presented at 16th
International Symposium on Communications and Information
Technologies (ISCIT), QingDao, China, 2016.
https://doi.org/10.1109/ISCIT.2016.7751636
[46] Guilherme, G., Paulo, F., Ricardo, S., Ruben, C. & Ricardo, J. (2016).
An Architecture for Big Data Processing on Intelligent Transportation
Systems. Paper presented at IEEE 8th International Conference on
Intelligent Systems, Sofia, Bulgaria, 2016.
https://doi.org/10.1109/IS.2016.7737393
[47] Go, M. S., Lai, X., & Paul, V. (2016). A reference Architecture for Big
Data Systems. Paper presented at 10th International Conference on
Software, Knowledge, Information Management & Applications (SKIMA),
Chengdu, China, 2016. Doi : 10.1109/SKIMA.2016.7916249
[48] Sanjib, B. & Jaydip, S. (2017). A Proposed Architecture for Big Data
Driven Supply Chain Analytics. ICFAI University Press (IUP) Journal of
Supply Chain Management 13(3). 7-34. Doi :
https://arxiv.org/ct?url=https%3A%2F%2Fdx.doi.org%2F10.2139%252Fss
rn.2795906&v=b6e0857f
[49] Julio, M., Manuel A. S., Eduardo, F. & Eduardo, B. F. ( 2018).
Towards a Security Reference Architecture for Big Data. Paper presented at
21st International Conference on Extending Database Technology and 21st
International Conference on Database Theory joint conference, Vienna,
Austria, 2018. Retrieved from
https://www.semanticscholar.org/paper/Towards-a-Security-Reference-
Architecture-for-Big-Moreno-
Serrano/3966ce99e50c741dcb401707c6c8cacd8420d27e
[50] Yuri, D., Canh, N. & Peter, M. (2013). Architecture Framework and
Components for the Big Data Ecosystem. Paper presented at International
Conference on Collaboration Technologies and Systems (CTS),
Minneapolis, MN, USA, 2014. Doi :
https://doi.org/10.1109/CTS.2014.6867550
[51] Doug, C., Oracle. (2014). Information Management and Big Data : A
Reference Architecture [White paper]. Retrieved from
https://www.oracle.com/technetwork/topics/entarch/articles/info-mgmt-big-
data-ref-arch-1902853.pdf
[52] Microsoft. (2014). Microsoft Big Data : Solution Brief. Retrieved from
http://download.microsoft.com/download/f/a/1/fa126d6d-841b-
4565bb26d2add4a28f24/microsoft_big_data_solution_brief.pdf
[53] IBM Corporation. (2014). IBM Big Data & Analytics Reference
Architecture v1. Retrieved from
https://www.ibm.com/developerworks/community/files/form/anonymous/a
pi/library/e747a4bd-614d-4c5d-a411-856255c9ddc4/document/bbc80340-
3bf4-4e0a-8caf-a43f64a22f05/media
[54] NIST NBD-WG. (2017). Draft NIST Big Data Interoperability
Framework : Volume 6, Reference Architecture. Retrieved
https://bigdatawg.nist.gov/_uploadfiles/M0639_v1_9796711131.docx
[55] Nawsher, K. & al. (2014). Big Data: Survey, Technologies,
Opportunities, and Challenges. The Scientific World Journal 2014 (2014).
1-19. doi : http://dx.doi.org/10.1155/2014/712826
[56] Seref, S. & Duygu, S., (2013). Big Data : A Review. Paper presented
at International Conference on Collaboration Technologies and Systems
(CTS), San Diego, CA, USA. Doi :
https://doi.org/10.1109/CTS.2013.6567202
[57] Andrea, M., Marco, G., & Michele, G. (2015). What is Big Data? A
Consensual Definition and a Review of Key Research Topics. Paper
presented at 4th International Conference on Integrated Information,
Madrid, Spain, 2014. Doi : https://doi.org/10.1063/1.4907823
[58] Amir, G. & Murtaza, H. (2014). Beyond the hype : Big data concepts,
methods and analytics. International Journal of Information Management
35(2). 137–144. Doi : https://doi.org/10.1016/j.ijinfomgt.2014.10.007
[59] Chen, M., Mao, S. & Liu, Y. (2014). Big Data: A Survey. Mobile
Networks and Applications 19(2). 171-209. Doi :
https://doi.org/10.1007/s11036-013-0489-0
[60] Elgendy N. & Elragal A. (2014). Big Data Analytics: A Literature
Review Paper. Paper presented at Industrial Conference on Data Mining, St.
Petersburg, Russia, 2014. doi : https://doi.org/10.1007/978-3-319-08976-
8_16
[61] Bernard M. (2018, May 21). How Much Data Do We Create Everyday?
The Mind-Blowing Stats Everyone Should Read. Retrieved from
https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-
we-create-every-day-the-mind-blowing-stats-everyone-should-
read/#4c87a72b60ba
[62] Tom, H. (2017, July 26). How much data does the world generate
every minute? Retrieved from https://www.iflscience.com/technology/how-
much-data-does-the-world-generate-every-minute/
[63] Josh J. (DOMO) , (2018, June 5). Data Never Sleeps 6.0. Retrieved
from https://www.domo.com/blog/data-never-sleeps-6/
[64] Mary, L. (WordStream) (2018, October 2017). 33 Mind-Boggling
Instagram Stats & Facts for 2018. Retrieved from
https://www.wordstream.com/blog/ws/2017/04/20/instagram-statistics
[65] International Data Corporation (IDC), Intel. A Guide to the Internet of
Things. Retrieved from https://www.intel.in/content/www/in/en/internet-of-
things/infographics/guide-to-iot.html
[66] Nasser, T., & Tariq, R. S. (2015). Big Data Challenges. Journal of
Computer Engineering and Informatiion Technology 4(3). 1-10.
http://dx.doi.org/10.4172/2324-9307.1000135
[67] Chaowei, Y., Qunying, H., Zhenlong, L., Kai, L. & Fei H. (2017). Big
Data and Cloud Computing : Innovation Opportunities and Cloud
Computing. International Journal of Digital Earth 10 (1). 13-53.
https://doi.org/10.1080/17538947.2016.1239771
[68] Uthayasankar, S., Muhammad, M. K., Zahir, I. & Vishanth, W. (2016).
Critical analysis of Big Data Challenges and Analytical Methods. Journal
of Business Research 70. 263-286.
https://doi.org/10.1016/j.jbusres.2016.08.001
[69] Zoiner, T., Mike, W. (2018, March 31). Big Data architectures.
Retrieved from https://docs.microsoft.com/en-us/azure/architecture/data-
guide/big-data/
... To the best of our knowledge, there is no airline ancillary services recommendation system. In this paper, we propose an architecture based on Big Data Lambda Architecture [6] for distributed ancillary services data. In the proposed architecture, we use Apriori and FP-Growth algorithms as association rule mining algorithms to generate recommendations. ...
... Typical examples are microservices, Lambda, Kappa, and Zeta architectures. [6]. Among them, Lambda and Kappa architectures stand out as the most modern architectures. ...
... • BigData -основное хранилище с данными большого объёма, требующими значительных ресурсов на использование и обслуживание [14][15][16][17][18] . ...
- V. M. Alexeev
- L. A. Baranov
- M. A. Kulagin
- V. G. Sidorenko
The increase in the volume of passenger transportation in megalopolises and large urban agglomerations is efficiently provided by the integration of urban public transit systems and city railways. Traffic management under those conditions requires creating intelligent centralised multi-level traffic control systems that implement the required indicators of quality, comfort, and traffic safety regarding passenger transportation. Besides, modern control systems contribute to traction power saving, are foundation and integral part of the systems of digitalisation of urban transit and the cities. Building systems solving the traffic planning and control tasks is implemented using algorithms based on the methods of artificial intelligence, principles of hierarchically structured centralised systems, opportunities provided by Big Data technology. Under those conditions it is necessary to consider growing requirements towards software as well as theoretical design and practical implementation of network organisation. This article discusses designing architecture and shaping requirements for developed applications and their integration with databases to create a centralised intelligent control system for the urban rail transit system (CICS URTS). The article proposes the original architecture of the network, routing of information flows and software of CICS URTS. The routing design is based on a fully connected network. This allows to significantly increase the network bandwidth and meet the requirements regarding information protection, since information flows are formed based on the same type of protocols, which prevents emergence of covert transmission channels. The implementation of the core using full connectivity allows, according to the tags of information flows, to pre-form the routes for exchange of information between servers and applications deployed in CICS URTS. The use of encrypted tags of information flows makes it much more difficult to carry out attacks and organise collection of information about the network structure. Platforms for development of intelligent control systems (ICS), which include CICS URTS, high computing power, data storage capacity and new frameworks are becoming more available for researchers and developers and allow rapid development of ICS. The article discusses the issues of interaction of applications with databases through a combination of several approaches used in the field of Big Data, substantiates combination of Internet of Things (IoT) methodology and microservice architecture. This combination will make it possible to single out business processes in the system and form streaming data processing requiring operational analysis by a human, which is shown by relevant examples. Thus, the objective of the article is to formalise the principles of organising data exchange between CICS URTS and automated control systems (ACS) of railway companies (in our case, using the example of JSC Russian Railways), URTS services providers, and city government bodies, implement the developed approaches into the architecture of CICS URTS and formalise principles of organisation of microservice architecture of CICS URTS software. The main research methods are graph theory, Big Data and IoT methods.
... Further, in this discussion, we compare the merits of our work in this paper with a review on various architectural models and their stereotypical use cases that were profiled recently [57]. We observe from such reviews that the hybrid architecture is preferred in situations where there is a need for both real-time analytics and periodic batch analytics on the entire data collection. ...
Highly populated cities depend highly on intelligent transportation systems (ITSs) for reliable and efficient resource utilization and traffic management. Current transportation systems struggle to meet different stakeholder expectations while trying their best to optimize resources in providing various transport services. This paper proposes a Microservice-Oriented Big Data Architecture (MOBDA) incorporating data processing techniques, such as predictive modelling for achieving smart transportation and analytics microservices required towards smart cities of the future. We postulate key transportation metrics applied on various sources of transportation data to serve this objective. A novel hybrid architecture is proposed to combine stream processing and batch processing of big data for a smart computation of microservice-oriented transportation metrics that can serve the different needs of stakeholders. Development of such an architecture for smart transportation and analytics will improve the predictability of transport supply for transport providers and transport authority as well as enhance consumer satisfaction during peak periods.
Choosing the appropriate architecture and technologies for a big data project is a difficult task, which requires extensive knowledge in both the problem domain and in the big data landscape. The paper analyzes the main big data architectures and the most widely implemented technologies used for processing and persisting big data. Clouds provide for dynamic resource scaling, which makes them a natural fit for big data applications. Basic cloud computing service models are presented. Two architectures for processing big data are discussed, Lambda and Kappa architectures. Technologies for big data persistence are presented and analyzed. Stream processing as the most important and difficult to manage is outlined. The paper highlights main advantages of cloud and potential problems.
Applications supporting the independent living of people with disabilities are usually built in a monolithic fashion for a specific purpose. On the other hand, a crucial sector for the livability of urban spaces such as mobility is undergoing a deep transformation, heading towards flexible composition of standardized services. This paper shows how this approach allows to build better applications for people with specific needs, making them seamlessly integrated in the most modern approach to smart mobility.
Big Data has emerged in the past few years as a new paradigm providing abundant data and opportunities to improve and/or enable research and decision-support applications with unprecedented value for digital earth applications including business, sciences and engineering. At the same time, Big Data presents challenges for digital earth to store, transport, process, mine and serve the data. Cloud computing provides fundamental support to address the challenges with shared computing resources including computing, storage, networking and analytical software; the application of these resources has fostered impressive Big Data advancements. This paper surveys the two frontiers – Big Data and cloud computing – and reviews the advantages and consequences of utilizing cloud computing to tackling Big Data in the digital earth and relevant science domains. From the aspects of a general introduction, sources, challenges, technology status and research opportunities, the following observations are offered: (i) cloud computing and Big Data enable science discoveries and application developments; (ii) cloud computing provides major solutions for Big Data; (iii) Big Data, spatiotemporal thinking and various application domains drive the advancement of cloud computing and relevant technologies with new requirements; (iv) intrinsic spatiotemporal principles of Big Data and geospatial sciences provide the source for finding technical and theoretical solutions to optimize cloud computing and processing Big Data; (v) open availability of Big Data and processing capability pose social challenges of geospatial significance and (vi) a weave of innovations is transforming Big Data into geospatial research, engineering and business values. This review introduces future innovations and a research agenda for cloud computing supporting the transformation of the volume, velocity, variety and veracity into values of Big Data for local to global digital earth science and applications.
Big Data (BD), with their potential to ascertain valued insights for enhanced decision-making process, have recently attracted substantial interest from both academics and practitioners. Big Data Analytics (BDA) is increasingly becoming a trending practice that many organizations are adopting with the purpose of constructing valuable information from BD. The analytics process, including the deployment and use of BDA tools, is seen by organizations as a tool to improve operational efficiency though it has strategic potential, drive new revenue streams and gain competitive advantages over business rivals. However, there are different types of analytic applications to consider. Therefore, prior to hasty use and buying costly BD tools, there is a need for organizations to first understand the BDA landscape. Given the significant nature of the BD and BDA, this paper presents a state-of-the-art review that presents a holistic view of the BD challenges and BDA methods theorized/proposed/employed by organizations to help others understand this landscape with the objective of making robust investment decisions. In doing so, systematically analysing and synthesizing the extant research published on BD and BDA area. More specifically, the authors seek to answer the following two principal questions: Q1 – What are the different types of BD challenges theorized/proposed/confronted by organizations? and Q2 – What are the different types of BDA methods theorized/proposed/employed to overcome BD challenges?. This systematic literature review (SLR) is carried out through observing and understanding the past trends and extant patterns/themes in the BDA research area, evaluating contributions, summarizing knowledge, thereby identifying limitations, implications and potential further research avenues to support the academic community in exploring research themes/patterns. Thus, to trace the implementation of BD strategies, a profiling method is employed to analyze articles (published in English-speaking peer-reviewed journals between 1996 and 2015) extracted from the Scopus database. The analysis presented in this paper has identified relevant BD research studies that have contributed both conceptually and empirically to the expansion and accrual of intellectual wealth to the BDA in technology and organizational resource management discipline.
The amount of data at the global level has grown exponentially. Along with this phenomena, we have a need for a new unit of measure like exabyte, zettabyte, and yottabyte as the last unit measures the amount of data. The growth of data gives a situation where the classic systems for the collection, storage, processing, and visualization of data losing the battle with a large amount, speed, and variety of data that is generated continuously. Many of data that is created by the Internet of Things, IoT (cameras, satellites, cars, GPS navigation, etc.). It is our challenge to come up with new technologies and tools for the management and exploitation of these large amounts of data. Big Data is a hot topic in recent years in IT circles. However, Big Data is recognized in the business world, and increasingly in the public administration. This paper proposes an ontology of big data analytics and examines how to enhance business intelligence through big data analytics as a service by presenting a big data analytics services-oriented architecture. This paper also discusses the interrelationship between business intelligence and big data analytics. The proposed approach in this paper might facilitate the research and development of business analytics, big data analytics, and business intelligence as well as intelligent agents.
One of the buzzwords in the Information Technology is Internet of Things (IoT). The future is In-ternet of Things, which will transform the real world objects into intelligent virtual objects. The IoT aims to unify everything in our world under a common infrastructure, giving us not only control of things around us, but also keeping us informed of the state of the things. In Light of this, present study addresses IoT concepts through systematic review of scholarly research papers, corporate white papers, professional discussions with experts and online databases. Moreover this research article focuses on definitions, geneses, basic requirements, characteristics and aliases of Internet of Things. The main objective of this paper is to provide an overview of Internet of Things, architectures, and vital technologies and their usages in our daily life. However, this manuscript will give good comprehension for the new researchers, who want to do research in this field of Internet of Things (Technological GOD) and facilitate knowledge accumulation in efficiently .
Size is the first, and at times, the only dimension that leaps out at the mention of big data. This paper attempts to offer a broader definition of big data that captures its other unique and defining characteristics. The rapid evolution and adoption of big data by industry has leapfrogged the discourse to popular outlets, forcing the academic press to catch up. Academic journals in numerous disciplines, which will benefit from a relevant discussion of big data, have yet to cover the topic. This paper presents a consolidated description of big data by integrating definitions from practitioners and academics. The paper's primary focus is on the analytic methods used for big data. A particular distinguishing feature of this paper is its focus on analytics related to unstructured data, which constitute 95% of big data. This paper highlights the need to develop appropriate and efficient analytical methods to leverage massive volumes of heterogeneous data in unstructured text, audio, and video formats. This paper also reinforces the need to devise new tools for predictive analytics for structured big data. The statistical methods in practice were devised to infer from sample data. The heterogeneity, noise, and the massive size of structured big data calls for developing computationally efficient algorithms that may avoid big data pitfalls, such as spurious correlation.
- Bhushan Lakhe
Re-architect relational applications to NoSQL, integrate relational database management systems with the Hadoop ecosystem, and transform and migrate relational data to and from Hadoop components. This book covers the best-practice design approaches to re-architecting your relational applications and transforming your relational data to optimize concurrency, security, denormalization, and performance. Winner of IBM's 2012 Gerstner Award for his implementation of big data and data warehouse initiatives and author of Practical Hadoop Security, author Bhushan Lakhe walks you through the entire transition process. First, he lays out the criteria for deciding what blend of re-architecting, migration, and integration between RDBMS and HDFS best meets your transition objectives. Then he demonstrates how to design your transition model. Lakhe proceeds to cover the selection criteria for ETL tools, the implementation steps for migration with SQOOP- and Flume-based data transfers, and transition optimization techniques for tuning partitions, scheduling aggregations, and redesigning ETL. Finally, he assesses the pros and cons of data lakes and Lambda architecture as integrative solutions and illustrates their implementation with real-world case studies. Hadoop/NoSQL solutions do not offer by default certain relational technology features such as role-based access control, locking for concurrent updates, and various tools for measuring and enhancing performance. Practical Hadoop Migration shows how to use open-source tools to emulate such relational functionalities in Hadoop ecosystem components. What You'll Learn • Decide whether you should migrate your relational applications to big data technologies or integrate them • Transition your relational applications to Hadoop/NoSQL platforms in terms of logical design and physical implementation • Discover RDBMS-to-HDFS integration, data transformation, and optimization techniques • Consider when to use Lambda architecture and data lake solutions • Select and implement Hadoop-based components and applications to speed transition, optimize integrated performance, and emulate relational functionalities Who This Book Is For Database developers, database administrators, enterprise architects, Hadoop/NoSQL developers, and IT leaders. Its secondary readership is project and program managers and advanced students of database and management information systems.
- Andrea De Mauro
Although Big Data is a trending buzzword in both academia and the industry, its meaning is still shrouded by much conceptual vagueness. The term is used to describe a wide range of concepts: from the technological ability to store, aggregate, and process data, to the cultural shift that is pervasively invading business and society, both drowning in information overload. The lack of a formal definition has led research to evolve into multiple and inconsistent paths. Furthermore, the existing ambiguity among researchers and practitioners undermines an efficient development of the subject. In this paper we have reviewed the existing literature on Big Data and analyzed its previous definitions in order to pursue two results: first, to provide a summary of the key research areas related to the phenomenon, identifying emerging trends and suggesting opportunities for future development; second, to provide a consensual definition for Big Data, by synthesizing common themes of existing works and patterns in previous definitions.
Posted by: elenorewildaye0198295.blogspot.com
Source: https://www.researchgate.net/publication/336915402_Big_Data_Architectures_A_detailed_and_application_oriented_review
Post a Comment