Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I've been debugging this for hours and I have failed so far.
I have a stand-alone spark cluster and a Minio server with docker compose. I did it based on "Adding some MinIO to your standalone Apache Spark cluster by Vasileios Anagnostopoulos."
I have had a whole different experience so far compared to what he had in that article.
after figthing with different bugs now I got one last problem, the spark cluster does not access the credentials!
I am running it locally I am not using an ec2 instance.
I know AWS looks for the credentials in .AWS/credentials the java system properties and environment variables.
As I am not using an ec2 instance I opted for the environment variables.
I have AWS_ACCESS_KEY_ID=theroot AND AWS_SECRET_ACCES_KEY=theroot123 in my docker-compose file for the master and worker node.
I have checked inside the containers and i do have the enviroment variables set.
I am coping my custom spark-defaults.conf to my container conf folder, the conf file looks like this:
spark.ui.reverseProxy true
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.committer.name directory
spark.hadoop.fs.s3a.committer.staging.tmp.path /tmp/spark_staging
spark.hadoop.fs.s3a.buffer.dir /tmp/spark_local_buf
spark.hadoop.fs.s3a.committer.staging.conflict-mode fail
spark.hadoop.fs.s3a.access.key theroot
spark.hadoop.fs.s3a.secret.key theroot123
spark.hadoop.fs.s3a.endpoint http://my-minio-server:9000
spark.hadoop.fs.s3a.connection.ssl.enabled false
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.attempts.maximum 0
spark-submit command:
spark-submit --packages org.apache.hadoop:hadoop-aws:3.3.4 --master spark://127.0.0.1:7077 spark-access-minio.py
Errors log:
Caused by: com.amazonaws.SdkClientException: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY))
Caused by: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider : com.amazonaws.SdkClientException: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY))
java.nio.file.AccessDeniedException: s3a://mybucket/addresses.csv: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider : com.amazonaws.SdkClientException: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY))
What am I missing?
Thanks!
I ended up making it work like this. but still I want to know how to set up in the conf file.
from pyspark.sql import SparkSession
from pyspark import SparkConf
spark = SparkSession\
.builder\
.appName("Test json")\
.config("spark.hadoop.fs.s3a.endpoint", "http://my-minio-server:9000")\
.config("spark.hadoop.fs.s3a.access.key", 'theroot')\
.config("spark.hadoop.fs.s3a.secret.key", 'theroot123')\
.config("spark.hadoop.fs.s3a.path.style.access", True)\
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")\
.getOrCreate()
log4jLogger = spark.sparkContext._jvm.org.apache.log4j
LOGGER = log4jLogger.LogManager.getLogger(__name__)
sourceBucket = "mybucket"
inputPath = f"s3a://{sourceBucket}/addresses.csv"
outputPath = f"s3a://{sourceBucket}/output_survey4.csv"
df = spark.read.option("header", "true").format("s3selectCSV").csv(inputPath)
df.write.mode("overwrite").parquet(outputPath)
spark.stop()
–
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.