Hadoop, Spark Multinode Cluster in DataProc

Google offers a managed Spark and Hadoop service. They call it Google Cloud Data Proc. It is almost same as Amazon’s EMR. You can use Data Proc service to create a Hadoop and Spark cluster in less than two minutes. Setup your Google cloud account and a default project. Start your Google Cloud Console. Go

Google offers a managed Spark and Hadoop service. They call it Google Cloud Data Proc. It is almost same as Amazon’s EMR. You can use Data Proc service to create a Hadoop and Spark cluster in less than two minutes.

  • Setup your Google cloud account and a default project.
  • Start your Google Cloud Console.
  • Go to products and services menu.
  • Scroll down to Data Proc and select clusters.
  • Hit the create cluster button.
  • Give a name to your Cluster and choose your region.
  • Choose a machine type for your master node.
  • Select a cluster type.
  • Select the disk size for the master node.
  • Choose the data node configuration and the number of workers.
  • Select the disk size for each worker.
  • Select an initialization action (Optional)
  • Hit the create button.

Get initialization actions:

Wait a minute or two, and the Data Proc API will provision your cluster. You don’t need a download, no installation, and nothing. Your Spark cluster should be ready in just a few clicks.

How to access Spark Cluster in Cloud?

  1. Download and install cloud sdk: https://dl.google.com/dl/cloudsdk/channels/rapid/GoogleCloudSDKInstaller.exe?hl=vi

  2. Run the following command prompt:

gcloud compute ssh --zone=asia-east1-a --ssh-flag="-D"  --ssh-flag="10000" --ssh-flag="-N" "spark-6-m"

After Putty will automatic run, authenticate user:

"C:Program Files (x86)GoogleChromeApplicationchrome.exe" "http://spark-6-m:8088" 
    --proxy-server="socks5://localhost:10000" --host-resolver-rules="MAP * 0.0.0.0, EXCLUDE localhost" 
        --user-data-dir=c:testspark-6-m

I am starting chrome.exe with my YARN Resource Manager URL. Next one is the proxy server. It should use the socks5 protocol on my local machine’s port 10000. That’s the port where we started the SSH tunnel. Right?
The next flag is to avoid any DNS resolves by chrome. Finally, the last option is a non-existent directory path. This option allows chrome to start a brand-new session.
That’s it. You can access the resource manager in the new browser.
This video demonstrated to get you a Spark cluster. You can access that over the web and SSH. Execute your Jobs, play with it and later go back to your Data Proc clusters list and delete it. We don’t have an option to keep it there in shutdown state. Creating and removing a cluster is as simple as few clicks. You can create a new one every day, use it and then throw it away.

References more: Spark SQL, DataFrames and Datasets Guide

Nguồn: viblo.asia

Bài viết liên quan

Sự Khác Nhau Giữa Domain và Hosting Là Gì?

Sự khác nhau giữa domain và hosting là gì? Bài này giải thích ngắn và dễ hiểu nh

Shared Hosting hay VPS Hosting: Lựa chọn nào dành cho bạn?

Bài viết giải thích rõ shared hosting và vps hosting là gì và hướng dẫn chọn lựa

Thay đổi Package Name của Android Studio dể dàng với plugin APR

Nếu bạn đang gặp khó khăn hoặc bế tắc trong việc thay đổi package name trong And

Lỗi không Update Meta_Value Khi thay thế hình ảnh cũ bằng hình ảnh mới trong WordPress

Mã dưới đây hoạt động tốt có 1 lỗi không update được postmeta ” meta_key=