I like you very much, just as you are...: August 2, 2009

Tuesday, August 4, 2009

Announcing release of HadoopDB

If you have a short attention span see the shorter blog post.
If you have a large attention span, see the complete 12-page paper.

There are two undeniable trends in analytical data management. First, the amount of data that needs to be stored and processed is exploding. This is partly due to the increased automation with which data can be produced (more business processes are becoming digitized), the proliferation of sensors and data-producing devices, Web-scale interactions with customers, and government compliance demands along with strategic corporate initiatives requiring more historical data to be kept online for analysis. It is no longer uncommon to hear of companies claiming to load more than a terabyte of structured data per day into their analytical database system and claiming data warehouses of size more than a petabyte (see the end of this write-up for some links to large data warehouses).

The second trend is what I talked about in my last blog post: the increased desire to perform more and more complex analytics and data mining inside of the DBMS.

I predict that the combination of these two trends will lead to a scalability crisis for the parallel database system industry. This prediction flies in the face of conventional wisdom. If you talk to prominent DBMS researchers, they'll tell you that shared-nothing parallel database systems horizontally scale indefinitely, with near linear scalability. If you talk to a vendor of a shared-nothing MPP DBMS, such as Teradata, Aster Data, Greenplum, ParAccel, and Vertica, they'll tell you the same thing. Unfortunately, they're all wrong. (Well, sort of.)

Parallel database systems scale really well into the tens and even low hundreds of machines. Until recently, this was sufficient for the vast majority of analytical database applications. Even the enormous eBay 6.5 petabyte database (the biggest data warehouse I've seen written about) was implemented on a (only) 96-node Greenplum DBMS. But as I wrote about previously, this implementation allows for only a handful of CPU cycles to be spent processing tuples as they are read off disk. As the second trend kicks in, resulting in an increased amount and complexity of data analysis that is performed inside the DBMS, this architecture will be entirely unsuitable, and will be replaced with many more compute nodes with a much larger horizontal scale. Once you add the fact that many argue that it is far more efficient from a hardware cost and power utilization perspective to run an application on many low-cost, low-power machines instead of fewer high-cost, high-power machines (see e.g., the work by James Hamilton), it will not be at all uncommon to see data warehouse deployments on many thousands of machines (real or virtual) in the future.

Unfortunately, parallel database systems, as they are implemented today, do not scale well into the realm of many thousands of nodes. There are a variety of reasons for this. First, they all compete with each other on performance. The marketing literature of MPP database systems are littered with wild claims of jaw-dropping performance relative to their competitors. These systems will also implement some amount of fault tolerance, but as soon as performance becomes a tradeoff with fault tolerance (e.g. by implementing frequent mid-query checkpointing) performance will be chosen every time. At the scale of tens to hundreds of nodes, a mid-query failure of one of the nodes is a rare event. At the scale of many thousands of nodes, such events are far more common. Some parallel database systems lose all work that has been done thus far in processing a query when a DBMS node fails; others just lose a lot of work (Aster Data might be the best amongst its competitors along this metric). However, no parallel database system (that I'm aware of) is willing to pay the performance overhead to lose a minimal amount of work upon a node failure.

Second, while it is possible to get reasonably homogeneous performance across tens to hundreds of nodes, this is nearly impossible across thousands of nodes, even if each node runs on identical hardware or on an identical virtual machine. Part failures that do not cause complete node failure, but result in degraded hardware performance become more common at scale. Individual node disk fragmentation and software configuration errors can also cause degraded performance on some nodes. Concurrent queries (or, in some cases, concurrent processes) further reduce the homogeneity of cluster performance. Furthermore, we have seen wild fluctuations in node performance when running on virtual machines in the cloud. Parallel database systems tend to do query planning in advance and will assign each node an amount of work to do based on the expected performance of that node. When running on small numbers of nodes, extreme outliers from expected performance are a rare event, and it is not worth paying the extra performance overhead for runtime task scheduling. At the scale of many thousands of nodes, extreme outliers are far more common, and query latency ends up being approximately equal to the time it takes these slow outliers to finish processing.

Third, many parallel databases have not been tested at the scale of many thousands of nodes, and in my experience, unexpected bugs in these systems start to appear at this scale.

In my opinion the "scalability problem" is one of two reasons why we're starting to see Hadoop encroach on the structured analytical database market traditionally dominated by parallel DBMS vendors (see the Facebook Hadoop deployment as an example). Hadoop simply scales better than any currently available parallel DBMS product. Hadoop gladly pays the performance penalty for runtime task scheduling and excellent fault tolerance in order to yield superior scalability. (The other reason Hadoop is gaining market share in the structured analytical DBMS market is that it is free and open source, and there exists no good free and open source parallel DBMS implementation.)

The problem with Hadoop is that it also gives up some performance in other areas where there are no tradeoffs for scalability. Hadoop was not originally designed for structured data analysis, and thus is significantly outperformed by parallel database systems on structured data analysis tasks. Furthermore, it is a relatively young piece of software and has not implemented many of the performance enhancing techniques developed by the research community over the past few decades, including direct operation on compressed data, materialized views, result caching, and I/O scan sharing.

Ideally there would exist an analytical database system that achieves the scalability of Hadoop along with with the performance of parallel database systems (at least the performance that is not the result of a tradeoff with scalability). And ideally this system would be free and open source.

That's why my students Azza Abouzeid and Kamil Bajda-Pawlikowski developed HadoopDB. It's an open source stack that includes PostgreSQL, Hadoop, and Hive, along with some glue between PostgreSQL and Hadoop, a catalog, a data loader, and an interface that accepts queries in MapReduce or SQL and generates query plans that are processed partly in Hadoop and partly in different PostgreSQL instances spread across many nodes in a shared-nothing cluster of machines. In essence it is a hybrid of MapReduce and parallel DBMS technologies. But unlike Aster Data, Greenplum, Pig, and Hive, it is not a hybrid simply at the language/interface level. It is a hybrid at a deeper, systems implementation level. Also unlike Aster Data and Greenplum, it is free and open source.

Our paper (that will be presented at the upcoming VLDB conference in the last week of August) shows that HadoopDB gets similar fault tolerance and ability to tolerate wild fluctuations in runtime node performance as Hadoop, while still approaching the performance of commercial parallel database systems (of course, it still gives up some performance due to the above mentioned tradeoffs).

Although HadoopDB currently is built on top of PostgreSQL, other database systems can theoretically be substituted for PostgreSQL. We have successfully been able to run HadoopDB using MySQL instead, and are currently working on optimizing connectors to open source column-store database systems such as MonetDB and Infobright. We believe that swtiching from PostgreSQL to a column-store will result in even better performance on analytical workloads.

The initial release of the source code for HadoopDB can be found at http://db.cs.yale.edu/hadoopdb/hadoopdb.html. Although at this point this code is just an academic prototype and some ease-of-use features are yet to be implemented, I hope that this code will nonetheless be useful for your structured data analysis tasks!

Yale researchers create database-Hadoop hybrid

Yale University researchers on Monday released an open-source parallel database that they say combines the data-crunching prowess of a relational database with the scalability of next-generation technologies such as Hadoop and MapReduce.

ale University researchers have released an open-source parallel database that they say combines the data-crunching prowess of a relational database with the scalability of next-generation technologies such as Hadoop and MapReduce.

HadoopDB was announced on Monday by Yale computer science professor Daniel J. Abadi on his blog.

Abadi and his students created HadoopDB from components including the open-source database, PostgreSQL, the Apache Hadoop data-sorting technology and Hive, the internal Hadoop project created by Facebook Inc.

Queries are accepted in either MapReduce, the progenitor of Hadoop invented by Google Inc. for storing and indexing the entire World-Wide Web, or conventional SQL language.

Similarly, data processing is partly done in Hadoop and partly in "different PostgreSQL instances spread across many nodes in a shared-nothing cluster of machines," wrote Abadi.

"In essence, it is a hybrid of MapReduce and parallel DBMS technologies," he continued. But unlike already-developed projects and vendors such as Aster Data, Greenplum or Hive, HadoopDB "is not a hybrid simply at the language/interface level. It is a hybrid at a deeper, systems implementation level."

By combining the best of both approaches, HadoopDB can achieve the fault tolerance of massively parallel data infrastructures such as MapReduce, where a server failure has little effect on the overall grid. And it can perform complex analyses almost as quickly as existing commercial parallel databases, claims Abadi.

The source code for HadoopDB is available now.

Abadi's solution, while experimental, could appeal to Web 2.0 firms and other members of the burgeoning 'NoSQL' movement.

It might eventually also appeal to enterprises looking for less-expensive, more scalable alternatives to Oracle's Database, IBM's DB2 or Microsoft's SQL Server.

Abadi was one of the co-authors of a research paper released in April that found that for most users and applications, relational databases still beat MapReduce and Hadoop.

In an e-mail, Abadi said that his current research doesn't repudiate the previous paper, but comes to the strong conclusion that as databases continue to grow, systems such as HadoopDB will "scale much better than parallel databases."

Though built with PostgreSQL, HadoopDB can use other databases for engines. Abadi's team has already successfully used MySQL, said Abadi, and plan to also try using columnar databases such as Infobright and MonetDB to improve performance on analytical workloads.

"Although at this point this code is just an academic prototype and some ease-of-use features are yet to be implemented, I hope that this code will nonetheless be useful for your structured data analysis tasks!" Abadi said.

로그 수집 시스템 scribe

hadoop과 같이 분산된 환경에서는 hadoop 자체의 로그와 사용자가 실행시킨 Job의 로그를 보거나 분석하는 것이 짜증날때가 많습니다. 현재 hadoop의 웹 모니터링 도구에서 제공하는 것만으로는 한계가 있습니다. 이런 문제를 해결하기 위해 hadoop 프로젝트에서는 서브프로젝트로 chukwa(http://www.jaso.co.kr/332) 라는 것이 있습니다.
Facebook에서도 분산된 환경에서 로그 수집 기능을 제공하는 scribe라는 시스템을 공개하였습니다. scribe는 log4x의 사용자 정의 Appender를 만들어 특정 서버로 로그를 보내는 방식을 이용하고 있습니다.
자세한 내용은 다음 URL을 참고하세요.
http://developers.facebook.com/scribe/
http://www.cloudera.com/blog/2008/10/28/installing-scribe-for-log-collection/

Cloud Computing의 이점과 방해요소

'Culture' Biggest Hurdle To Cloud Computing 라는 제목으로 기사가 올라 왔습니다.
(http://www.informationweek.com/news/software/hosted/showArticle.jhtml?articleID=218900519)

의사결정권자들 대상으로 설문조사 결과 private cloud computing이 주는 잇점은 다음과 같다고 생각하고 있습니다.

improving efficiency 41%
resource scalability 18%
cutting costs 17%
experimenting with cloud computing 15%
improving IT responsiveness 9%

표에서 보는 것처럼 cloud computing은 비용 절감의 목적 보다는 자원의 효율적인 운영과 그것을 통한 유연성의 증대라고 볼 수 있습니다. 효율적인 자원 운영을 하다보면 비용 절감은 자연스럽게 따라오는 결과라고 생각합니다.

private cloud computing을 전파하기 위한 장애요소로는 다음과 같습니다.

Organizational culture 37%
complexity of managing 26%
security 21%
upfront costs 8%

IT 조직의 문화적인 이슈가 가장 큰 장애로 나타나고 있습니다. 국내에서도 동일한 것 같습니다. 국내의 경우 이것뿐만 아니라 기술에 대한 이해 부족과 기술력 부족 등도 포함되지 않을까 생각합니다.

Monday, August 3, 2009

SQL SERVER – Introduction to Cloud Computing

Introduction
“Cloud Computing,” to put it simply, means “Internet Computing.” The Internet is commonly visualized as clouds; hence the term “cloud computing” for computation done through the Internet. With Cloud Computing users can access database resources via the Internet from anywhere, for as long as they need, without worrying about any maintenance or management of actual resources. Besides, databases in cloud are very dynamic and scalable.

Cloud computing is unlike grid computing, utility computing, or autonomic computing. In fact, it is a very independent platform in terms of computing. The best example of cloud computing is Google Apps where any application can be accessed using a browser and it can be deployed on thousands of computer through the Internet.

Key Characteristics
Cloud computing is cost-effective. Here, cost is greatly reduced as initial expense and recurring expenses are much lower than traditional computing. Maintenance cost is reduced as a third party maintains everything from running the cloud to storing data. Cloud is characterized by features such as platform, location and device independency, which make it easily adoptable for all sizes of businesses, in particular small and mid-sized. However, owing to redundancy of computer system networks and storage system cloud may not be reliable for data, but it scores well as far as security is concerned. In cloud computing, security is tremendously improved because of a superior technology security system, which is now easily available and affordable. Yet another important characteristic of cloud is scalability, which is achieved through server virtualization.

In a nutshell, cloud computing means getting the best performing system with the best value for money.

Cloud Computing Architecture
Cloud computing architecture, just like any other system, is categorized into two main sections: Front End and Back End. Front End can be end user or client or any application (i.e. web browser etc.) which is using cloud services. Back End is the network of servers with any computer program and data storage system. It is usually assumed that cloud contains infinite storage capacity for any software available in market. Cloud has different applications that are hosted on their own dedicated server farms.

Cloud has centralized server administration system. Centralized server administers the system, balances client supply, adjusts demands, monitors traffic and avoids congestion. This server follows protocols, commonly known as middleware. Middleware controls the communication of cloud network among them.

Cloud Architecture runs on a very important assumption, which is mostly true. The assumption is that the demand for resources is not always consistent from client to cloud. Because of this reason the servers of cloud are unable to run at their full capacity. To avoid this scenario, server virtualization technique is applied. In sever virtualization, all physical servers are virtualized and they run multiple servers with either same or different application. As one physical server acts as multiple physical servers, it curtails the need for more physical machines.

As a matter of fact, data is the most important part of cloud computing; thus, data security is the top most priority in all the data operations of cloud. Here, all the data are backed up at multiple locations. This astoundingly increases the data storage to multiple times in cloud compared with a regular system. Redundancy of data is crucial, which is a must-have attribute of cloud computing.

Different forms of Cloud Computing
Google Apps., Salesforce.com, Zoho Office and various other online applications use cloud computing as Software-As-Service (SAAS) model. These applications are delivered through browser, and multiple customers can access it from various locations. This model has become the most common form of cloud computing because it is beneficial and practical for both the customers and the services providers. For customers, there is no upfront investment and they can Pay-As-They-Go and Pay-As-They-Grow. On the other hand, the service providers, can grow easily as their customer base grows.

Aamzon.com, Sun and IBM offer on-demand storage and computing resources. Web service and APIs enable developers to use all the cloud from internet and allow them to create large-scale, full-featured application. Cloud is not simply limited to providing data storage or computing resources, it can also provide managed services or specific application services through web.

Cloud Computing Concerns
Security of confidential data (e.g., SSN or Credit Card Numbers) is a very important area of concern as it can make way for very big problems if unauthorized users get access to it. Misuse of data can create big issues; hence, in cloud computing it is very important to be aware of data administrators and their extent of data access rights. Large organizations dealing with sensitive data often have well laid out regulatory compliance policies. However, these polices should be verified prior to engaging them in cloud computing. There is a possibility that in cloud computing network, sometimes the network utilizes resources from another country or they might not be fully protected; hence, the need arises for appropriate regulatory compliance policies.

In cloud computing, it is very common to store data of multiple customers at one common location. Cloud computing should have proper techniques where data is segregated properly for data security and confidentiality. Care must be taken to ensure that one customer’s data does not affect another customer’s data. In addition, Cloud computing providers must be equipped with proper disaster recovery policies to deal with any unfortunate event.

Selection of Provider
A good service provider is the key to good service. So, it is imperative to select the right service provider. One must make sure that the provider is reliable, well-reputed for their customer service and should have a proven track record in IT- related ventures. The Cloud Computing Incidents Database (CCID) records and monitors verifiable, noteworthy events that impact cloud computing providers. Visit the following Wikipedia link to obtain the list all such events. http://wiki.cloudcommunity.org/wiki/CCID

Relational Database and Cloud Computing
Comparison has often been drawn between Relational Database and Cloud Computing. They are related for sure but they should not be confused for being the same thing. In actual fact, they are not really competing with each other. There are some unique requirements of applications when they do not call for any advance query techniques but rather need fast access to database. In such scenarios cloud computing should be used. In cloud, data is stored across myriad geographic locations, and processing data from different geographic database leads to delay in receiving data. In case of applications where there is the need of processing huge database using complex queries, traditional relational database is best suited. Cloud has its limitations. As for now, it only supports distributed computing; transactional operations are not currently supported in cloud computing.

Summary
Cloud Computing is the next big thing in the arena of computing and storage. There are some concerns about security and its availability. However, different service providers are coming up with various solutions and suggestions in response to customers’ concerns. In any case, cloud is getting bigger and better, and as long as they are available through web services, without capital infrastructure investment at reasonable price, it is for sure going to proliferate and create robust demand in times to come.

Additional Reads
While writing this article, I really enjoyed reading Cloud Computing Manifesto (http://wiki.cloudcommunity.org/wiki/Cloud_Computing_Manifesto ) – public declaration of principles and intentions for cloud computing. Manifesto suggests 10 principles of cloud computing, namely User centric, Philanthropic, Openness, Transparency, Interoperability, Representation, Discrimination, Evolution, Balance, and Security.

Reference : Pinal Dave (http://blog.SQLAuthority.com), Dotnetslakers

Ads by Google
MS SQL Server 2005 Tools
MS SQL Server Database Admin Tools Download Now! Windows, Linux, OSX
www.aquafold.com
ClearCube Virtual Desktop
Maximize your ROI with ClearCube's Sentral(tm) Connection Broker.
www.clearcube.com/
Free Sql Server explorer
Powerful browsing and editing of relational data in SQL Server.
www.cogin.com/dboctopus/
Data
Data: visit the network source for IT Leaders
www.ciozone.com

Tuesday, August 4, 2009

Announcing release of HadoopDB

Yale researchers create database-Hadoop hybrid

로그 수집 시스템 scribe

Cloud Computing의 이점과 방해요소

Monday, August 3, 2009

SQL SERVER – Introduction to Cloud Computing

I like you very much, just as you are...

I like you very much, just as you are...

TODAY!!!!

515Club

Books

Contributors