Performance Comparison of AWS Athena and Google BigQuery
Traditional servers provide IT services where machines host the applications. But these enterprise servers come with disadvantages and challenges such as the high cost and the necessity for space in house data center or cloud and database admin for their maintenance. Also, the average life of these servers is about 3-5 years which means you have to replace or upgrade your machine frequently and are constrained to the limitations of the machine.
A few more challenges faced in this traditional approach are:
- Planning capacity for storage and compute.
- Handling different data formats, structured and unstructured (CSV, JSON, Parquet, Avro and so on).
- Learning complex programming models and languages like Map Reduce, Spark, Scala and others.
- Keeping costs under control.
- Managing availability, performance and security.
- Analyzing required data in Amazon S3.
- Having access only to aggregated data sets.
- Managing Hadoop cluster or data warehouse requirements.
But, with the advancements in the technology, we are now introduced to Serverless Architecture, the solution to overcome these constraints and disadvantages. This article describes how Serverless Interactive Query (SIQ) service improves the performance of business analytical query using standard SQL. This service is built on Apache Hive and Google’s query engine Presto.
Apache Kafka or Kinesis streams store high velocity data into a S3 bucket. S3 bucket storage is highly scalable, cheap and can be useful for storing data dump and logs for an extended period of time. The serverless interactive query service can directly query this data and create analytics without traditional NoSQL (Mongo, Cassandra) or relational database (Oracle, MSSQL) platforms to deliver all of those capabilities at a lower cost and higher return.
This approach includes the benefits listed below:
- Complex queries performance.
- Data compression.
- Supports different formats of data.
- Supports standard SQL for queries.
- Minimal infrastructure cost.
- No cluster and data warehouses.
Tools Available for Maximum Query Performance
Nowadays, unstructured databases are using JSON based document and this will grow in the coming years. However, there is a need to query and search these documents with high throughput and low latency. The volume of data is growing exponentially and so is the
As companies expand complexity of data replication setup and ingestion for big data analytics in the future.the usage of stored data, traditional methods prove to be inefficient. Hence, complex query performance is imperative, considering the challenging factors such as volume, variety, complexity and variability of data, delivering rapid query performance and so on.
SIQ services provide you with facilities to query the data stored in different formats with no infrastructure to manage and you can focus on analyzing data to find meaningful insights using familiar SQL. Among such tools and technologies, Athena (Amazon Web Services) and BigQuery (Google) are two of the best data analytics services available in cloud
Amazon Web Services (AWS) Athena
Let us take a look at the features and advantages of Amazon Athena:
- Serverless Analytical Columnar Database based on Facebook’s Presto.
- External Tables (*SV, JSON, ORC, PARQUET files in S3 bucket).
- Store files on S3 bucket.
- Convert to columnar and compressed format using EMR.
- ANSI SQL 2011.
- Priced at USD 5 per TB of scanned data and standard S3 storage and operation costs.
- Cost optimization in converting data into columnar format, partitioning and limiting queried columns
Listed below are the features and advantages of Google BigQuery:
- Serverless Analytical Columnar Database based on Google Dremel.
- Native Tables.
- External Tables (*SV, JSON, AVRO files stored in Google cloud storage bucket).
- File imports.
- Streaming API
- Federated Tables (files in bucket, Bigtable documentation or Google Spreadsheet).
- ANSI SQL 2011.
- Priced at USD 5 per TB of scanned data, including storage and streaming (if used).
- Cost optimization in partitioning and limiting queried columns, 24-hour cache and cold data.
Comparison of AWS Athena and Google BigQuery
Features of AWS Athena:
- Partition of data supported, used to reduce the amount of data scanned by query, thus, reducing costs further.
- Compression and conversion of data to open source columnar format results in greater performance and reduced cost.
- Infrastructure set up is not required.
- Clusters and data warehouses are not needed.
- S3 bucket location path data is only required.
- Schema for the table need to be specified, once the table is created, data analysis can be performed using SQL.
- Read only feature is available, hence, permits only Select (Update & delete cannot be performed).
- Save cost by compress, columnar format and partition in S3 storage.
- Result can be stored in S3 storage (Cost for load).
- Pay per query (data scan).
- DDL query is free.
- AWS QuickSight can be used to view query results and these query results can be downloaded.
- SQL client can connect to Athena server.
- Accessibility to JDBC.
- Spark can be used to convert into PARQUET and ORC format.
- No ETL required.
- No loading of data to server.
- Presto (SQL) and Hive (DDL).
- AVRO support will be launched soon.
When you manage the Business Intelligence (BI) application data, you are controlling the quality of the results. The infrastructure department concerned about quality can make sure that the data is properly prepared, stored, secured in appropriate data platforms and can build standardized systems and scalable self-service reporting. This entire process will take time to stabilize if the data growth is high or the infrastructure team is busy with other projects
Features of Google BigQuery
- Just upload your data and run SQL.
- No cluster deployment, no virtual machines, no setting keys or indexes, and no software.
- Built-in multitenancy
- No need to deploy multiple clusters and duplicate data into each one. Manage permissions on projects and datasets with access control lists. Seamlessly scales with usage.
- Separate storage and computing.
- Storage scales to Petabytes.
- Compute scales with usage, without cluster resizing.
- Thousands of cores are used per query.
- Only pay for what you use, not what you deploy.
- Deployed across multiple data centers by default, with multiple factors of replication to optimize maximum data durability and service uptime.
- Analyze terabytes of data in seconds.
- Stream millions of rows per second for real-time analysis.
Data Formats Support
*SV, JSON, PARQUET/z, ORC/z
External (*SV, JSON, AVRO)/Native
ANSI SQL support
On Any Key
User Defined Function
$5/TB (scanned) plus S3 storage
$5/TB (scanned) less cached data
Few observations: AWS Athena and Google BigQuery
- Tables can be partitioned on any column.
- Regex can be used to define a schema.
- Federated/external tables query is faster than BigQuery
- Partitioned/columnar format is faster and cheaper than BigQuery
- No streaming support.
- Converting data format is not easy.
- Performance issue with large datasets.
- Supports more features for native table along with performance, also querying is fast.
- Converting from non-columnar formats into a native table is easy for columnar queries
- Petabyte scale queries can be performed well.
- Supports UDFs and nested tables.
System-wide Cross-Platform Data Processing Performance Benchmark:
In order to benchmark the resource utilization for executing BI queries, we used the below test environments:
- Commodity Hardware (RAM-8GB, DISK-450GB): Excel document.
- Azure (RAM-28GB, CPU-4, DISK-240GB, MySQL 5.5.38 Community Version): MySQL database instance.
- AWS (Ec2) (Ec2 RAM-8GB, CPU-2, DISK-30GB, PostgreSQL 9.2.18): PostgreSQL database instance.
- AWS (Athena): Configured on AWS.
- AWS (QuickSight): Configured on AWS.
Sample data is inserted in three different formats (Excel, MySQL & PostgreSQL) in AWS S3 storage. Complex query with inner/outer joins, ‘where’ and ‘group by’ clauses are executed on (Excel, PostgreSQL & MySQL) data files from Athena environment.
Test data: We used text data from our internal alert and monitoring tool for monitoring a network with more than 1000+ elements.
Test Data Size:
Format: csv, gz, txt
Community Hardware: Excel document (incident_open.csv) 383MB
MySQL table dump from Azure VM: incident_MQ.txt (1.3GB), incident_open.txt (915MB)
PostgreSQL table compress dump from Amazon VM: incident_pg.txt.gz (172B)
Interacting with AWS Athena:
Web User Interface:
- Run queries and examine results.
- Manage databases and tables.
- Save queries and share across organization for re-use.
- Query history.
- Programmatic way to access AWS Athena.
- SQL Workbench, JetBrains DataGrip, sqlline.
- Your own app.
- Visualize Athena data with charts, pivots and dashboards.
Service Limits for AWS Athena:
- Only one query can be submitted at a time and it supports 5 concurrent queries per account.
- Queries will timeout in 30 minutes.
- Athena supports 100 databases.
- Only 100 tables per database.
- 20k partitions per table is recommended.
You may encounter a limit for Amazon S3 buckets per account, which is 100.
AWS Athena continues to innovate for SIQ services to provide the highest query performance with minimal cost. It processes the hottest data to achieve high performance without the need to restrict the available data. Athena Intelligent Service exploits the full scope of data (structured/non-structured) available by keeping cooler data economically stored on S3 storage. In addition, Athena delivers many other features and capabilities that provide high query performance. These include the industry’s best optimizer, efficient indexes, parallel query processing and several intelligent scan techniques to reduce the amount of data that must read during query processing.
AWS Athena is certainly the next best option to obtain maximized query performance with minimal infrastructure cost. What are your views on AWS Athena and Google BigQuery?
Learn more about Mindtree Amazon Web Services (AWS)