Updated Mar-2022 Premium Associate-Developer-Apache-Spark Exam Engine pdf - Download Free Updated 179 Questions [Q84-Q100]

Share

Updated Mar-2022 Premium Associate-Developer-Apache-Spark Exam Engine pdf - Download Free Updated 179 Questions

Authentic Associate-Developer-Apache-Spark Dumps With 100% Passing Rate Practice Tests Dumps

NEW QUESTION 84
The code block shown below should return a DataFrame with columns transactionsId, predError, value, and f from DataFrame transactionsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__)

  • A. 1. filter
    2. "transactionId", "predError", "value", "f"
  • B. 1. select
    2. "transactionId, predError, value, f"
  • C. 1. select
    2. col(["transactionId", "predError", "value", "f"])
  • D. 1. select
    2. ["transactionId", "predError", "value", "f"]
  • E. 1. where
    2. col("transactionId"), col("predError"), col("value"), col("f")

Answer: D

Explanation:
Explanation
Correct code block:
transactionsDf.select(["transactionId", "predError", "value", "f"])
The DataFrame.select returns specific columns from the DataFrame and accepts a list as its only argument.
Thus, this is the correct choice here. The option using col(["transactionId", "predError",
"value", "f"]) is invalid, since inside col(), one can only pass a single column name, not a list. Likewise, all columns being specified in a single string like "transactionId, predError, value, f" is not valid syntax.
filter and where filter rows based on conditions, they do not control which columns to return.
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 85
In which order should the code blocks shown below be run in order to read a JSON file from location jsonPath into a DataFrame and return only the rows that do not have value 3 in column productId?
1. importedDf.createOrReplaceTempView("importedDf")
2. spark.sql("SELECT * FROM importedDf WHERE productId != 3")
3. spark.sql("FILTER * FROM importedDf WHERE productId != 3")
4. importedDf = spark.read.option("format", "json").path(jsonPath)
5. importedDf = spark.read.json(jsonPath)

  • A. 5, 1, 3
  • B. 4, 1, 3
  • C. 5, 1, 2
  • D. 4, 1, 2
  • E. 5, 2

Answer: C

Explanation:
Explanation
Correct code block:
importedDf = spark.read.json(jsonPath)
importedDf.createOrReplaceTempView("importedDf")
spark.sql("SELECT * FROM importedDf WHERE productId != 3")
Option 5 is the only correct way listed of reading in a JSON in PySpark. The option("format", "json") is not the correct way to tell Spark's DataFrameReader that you want to read a JSON file. You would do this through format("json") instead. Also, you can communicate the specific path of the JSON file to the DataFramReader using the load() method, not the path() method.
In order to use a SQL command through the SparkSession spark, you first need to create a temporary view through DataFrame.createOrReplaceTempView().
The SQL statement should start with the SELECT operator. The FILTER operator SQL provides is not the correct one to use here.
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 86
The code block displayed below contains an error. The code block should produce a DataFrame with color as the only column and three rows with color values of red, blue, and green, respectively.
Find the error.
Code block:
1.spark.createDataFrame([("red",), ("blue",), ("green",)], "color")
Instead of calling spark.createDataFrame, just DataFrame should be called.

  • A. Instead of color, a data type should be specified.
  • B. The commas in the tuples with the colors should be eliminated.
  • C. The colors red, blue, and green should be expressed as a simple Python list, and not a list of tuples.
  • D. The "color" expression needs to be wrapped in brackets, so it reads ["color"].

Answer: D

Explanation:
Explanation
Correct code block:
spark.createDataFrame([("red",), ("blue",), ("green",)], ["color"])
The createDataFrame syntax is not exactly straightforward, but luckily the documentation (linked below) provides several examples on how to use it. It also shows an example very similar to the code block presented here which should help you answer this question correctly.
More info: pyspark.sql.SparkSession.createDataFrame - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 87
Which of the following code blocks returns a single-column DataFrame showing the number of words in column supplier of DataFrame itemsDf?
Sample of DataFrame itemsDf:
1.+------+-----------------------------+-------------------+
2.|itemId|attributes |supplier |
3.+------+-----------------------------+-------------------+
4.|1 |[blue, winter, cozy] |Sports Company Inc.|
5.|2 |[red, summer, fresh, cooling]|YetiX |
6.|3 |[green, summer, travel] |Sports Company Inc.|
7.+------+-----------------------------+-------------------+

  • A. itemsDf.select(size(split("supplier", " ")))
  • B. spark.select(size(split(col(supplier), " ")))
  • C. itemsDf.split("supplier", " ").count()
  • D. itemsDf.select(word_count("supplier"))
  • E. itemsDf.split("supplier", " ").size()

Answer: A

Explanation:
Explanation
Output of correct code block:
+----------------------------+
|size(split(supplier, , -1))|
+----------------------------+
| 3|
| 1|
| 3|
+----------------------------+
This question shows a typical use case for the split command: Splitting a string into words. An additional difficulty is that you are asked to count the words. Although it is tempting to use the count method here, the size method (as in: size of an array) is actually the correct one to use. Familiarize yourself with the split and the size methods using the linked documentation below.
More info:
Split method: pyspark.sql.functions.split - PySpark 3.1.2 documentation Size method: pyspark.sql.functions.size - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 88
Which of the following code blocks returns a DataFrame that has all columns of DataFrame transactionsDf and an additional column predErrorSquared which is the squared value of column predError in DataFrame transactionsDf?

  • A. transactionsDf.withColumnRenamed("predErrorSquared", pow(predError, 2))
  • B. transactionsDf.withColumn("predErrorSquared", pow(col("predError"), lit(2)))
  • C. transactionsDf.withColumn("predError", pow(col("predErrorSquared"), 2))
  • D. transactionsDf.withColumn("predErrorSquared", pow(predError, lit(2)))
  • E. transactionsDf.withColumn("predErrorSquared", "predError"**2)

Answer: B

Explanation:
Explanation
While only one of these code blocks works, the DataFrame API is pretty flexible when it comes to accepting columns into the pow() method. The following code blocks would also work:
transactionsDf.withColumn("predErrorSquared", pow("predError", 2))
transactionsDf.withColumn("predErrorSquared", pow("predError", lit(2))) Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/26.html ,
https://bit.ly/sparkpracticeexams_import_instructions)

 

NEW QUESTION 89
Which of the following statements about Spark's DataFrames is incorrect?

  • A. Spark's DataFrames are equal to Python's DataFrames.
  • B. Spark's DataFrames are immutable.
  • C. The data in DataFrames may be split into multiple chunks.
  • D. Data in DataFrames is organized into named columns.
  • E. RDDs are at the core of DataFrames.

Answer: A

Explanation:
Explanation
Spark's DataFrames are equal to Python's or R's DataFrames.
No, they are not equal. They are only similar. A major difference between Spark and Python is that Spark's DataFrames are distributed, whereby Python's are not.

 

NEW QUESTION 90
Which of the following describes a difference between Spark's cluster and client execution modes?

  • A. In cluster mode, executor processes run on worker nodes, while they run on gateway nodes in client mode.
  • B. In cluster mode, the Spark driver is not co-located with the cluster manager, while it is co-located in client mode.
  • C. In cluster mode, a gateway machine hosts the driver, while it is co-located with the executor in client mode.
  • D. In cluster mode, the driver resides on a worker node, while it resides on an edge node in client mode.
  • E. In cluster mode, the cluster manager resides on a worker node, while it resides on an edge node in client mode.

Answer: D

Explanation:
Explanation
In cluster mode, the driver resides on a worker node, while it resides on an edge node in client mode.
Correct. The idea of Spark's client mode is that workloads can be executed from an edge node, also known as gateway machine, from outside the cluster. The most common way to execute Spark however is in cluster mode, where the driver resides on a worker node.
In practice, in client mode, there are tight constraints about the data transfer speed relative to the data transfer speed between worker nodes in the cluster. Also, any job in that is executed in client mode will fail if the edge node fails. For these reasons, client mode is usually not used in a production environment.
In cluster mode, the cluster manager resides on a worker node, while it resides on an edge node in client execution mode.
No. In both execution modes, the cluster manager may reside on a worker node, but it does not reside on an edge node in client mode.
In cluster mode, executor processes run on worker nodes, while they run on gateway nodes in client mode.
This is incorrect. Only the driver runs on gateway nodes (also known as "edge nodes") in client mode, but not the executor processes.
In cluster mode, the Spark driver is not co-located with the cluster manager, while it is co-located in client mode.
No, in client mode, the Spark driver is not co-located with the driver. The whole point of client mode is that the driver is outside the cluster and not associated with the resource that manages the cluster (the machine that runs the cluster manager).
In cluster mode, a gateway machine hosts the driver, while it is co-located with the executor in client mode.
No, it is exactly the opposite: There are no gateway machines in cluster mode, but in client mode, they host the driver.

 

NEW QUESTION 91
Which of the following statements about reducing out-of-memory errors is incorrect?

  • A. Concatenating multiple string columns into a single column may guard against out-of-memory errors.
  • B. Reducing partition size can help against out-of-memory errors.
  • C. Limiting the amount of data being automatically broadcast in joins can help against out-of-memory errors.
  • D. Decreasing the number of cores available to each executor can help against out-of-memory errors.
  • E. Setting a limit on the maximum size of serialized data returned to the driver may help prevent out-of-memory errors.

Answer: A

Explanation:
Explanation
Concatenating multiple string columns into a single column may guard against out-of-memory errors.
Exactly, this is an incorrect answer! Concatenating any string columns does not reduce the size of the data, it just structures it a different way. This does little to how Spark processes the data and definitely does not reduce out-of-memory errors.
Reducing partition size can help against out-of-memory errors.
No, this is not incorrect. Reducing partition size is a viable way to aid against out-of-memory errors, since executors need to load partitions into memory before processing them. If the executor does not have enough memory available to do that, it will throw an out-of-memory error. Decreasing partition size can therefore be very helpful for preventing that.
Decreasing the number of cores available to each executor can help against out-of-memory errors.
No, this is not incorrect. To process a partition, this partition needs to be loaded into the memory of an executor. If you imagine that every core in every executor processes a partition, potentially in parallel with other executors, you can imagine that memory on the machine hosting the executors fills up quite quickly. So, memory usage of executors is a concern, especially when multiple partitions are processed at the same time. To strike a balance between performance and memory usage, decreasing the number of cores may help against out-of-memory errors.
Setting a limit on the maximum size of serialized data returned to the driver may help prevent out-of-memory errors.
No, this is not incorrect. When using commands like collect() that trigger the transmission of potentially large amounts of data from the cluster to the driver, the driver may experience out-of-memory errors. One strategy to avoid this is to be careful about using commands like collect() that send back large amounts of data to the driver. Another strategy is setting the parameter spark.driver.maxResultSize. If data to be transmitted to the driver exceeds the threshold specified by the parameter, Spark will abort the job and therefore prevent an out-of-memory error.
Limiting the amount of data being automatically broadcast in joins can help against out-of-memory errors.
Wrong, this is not incorrect. As part of Spark's internal optimization, Spark may choose to speed up operations by broadcasting (usually relatively small) tables to executors. This broadcast is happening from the driver, so all the broadcast tables are loaded into the driver first. If these tables are relatively big, or multiple mid-size tables are being broadcast, this may lead to an out-of- memory error. The maximum table size for which Spark will consider broadcasting is set by the spark.sql.autoBroadcastJoinThreshold parameter.
More info: Configuration - Spark 3.1.2 Documentation and Spark OOM Error - Closeup. Does the following look familiar when... | by Amit Singh Rathore | The Startup | Medium

 

NEW QUESTION 92
Which of the following code blocks performs a join in which the small DataFrame transactionsDf is sent to all executors where it is joined with DataFrame itemsDf on columns storeId and itemId, respectively?

  • A. itemsDf.join(transactionsDf, broadcast(itemsDf.itemId == transactionsDf.storeId))
  • B. itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.storeId, "broadcast")
  • C. itemsDf.join(broadcast(transactionsDf), itemsDf.itemId == transactionsDf.storeId)
  • D. itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.storeId, "right_outer")
  • E. itemsDf.merge(transactionsDf, "itemsDf.itemId == transactionsDf.storeId", "broadcast")

Answer: C

Explanation:
Explanation
The issue with all answers that have "broadcast" as very last argument is that "broadcast" is not a valid join type. While the entry with "right_outer" is a valid statement, it is not a broadcast join. The item where broadcast() is wrapped around the equality condition is not valid code in Spark. broadcast() needs to be wrapped around the name of the small DataFrame that should be broadcast.
More info: Learning Spark, 2nd Edition, Chapter 7
Static notebook | Dynamic notebook: See test 1
tion and explanation?

 

NEW QUESTION 93
Which of the following describes characteristics of the Spark UI?

  • A. Some of the tabs in the Spark UI are named Jobs, Stages, Storage, DAGs, Executors, and SQL.
  • B. Via the Spark UI, stage execution speed can be modified.
  • C. Via the Spark UI, workloads can be manually distributed across executors.
  • D. There is a place in the Spark UI that shows the property spark.executor.memory.
  • E. The Scheduler tab shows how jobs that are run in parallel by multiple users are distributed across the cluster.

Answer: D

Explanation:
Explanation
There is a place in the Spark UI that shows the property spark.executor.memory.
Correct, you can see Spark properties such as spark.executor.memory in the Environment tab.
Some of the tabs in the Spark UI are named Jobs, Stages, Storage, DAGs, Executors, and SQL.
Wrong - Jobs, Stages, Storage, Executors, and SQL are all tabs in the Spark UI. DAGs can be inspected in the
"Jobs" tab in the job details or in the Stages or SQL tab, but are not a separate tab.
Via the Spark UI, workloads can be manually distributed across distributors.
No, the Spark UI is meant for inspecting the inner workings of Spark which ultimately helps understand, debug, and optimize Spark transactions.
Via the Spark UI, stage execution speed can be modified.
No, see above.
The Scheduler tab shows how jobs that are run in parallel by multiple users are distributed across the cluster.
No, there is no Scheduler tab.

 

NEW QUESTION 94
Which of the following code blocks returns a DataFrame with a single column in which all items in column attributes of DataFrame itemsDf are listed that contain the letter i?
Sample of DataFrame itemsDf:
1.+------+----------------------------------+-----------------------------+-------------------+
2.|itemId|itemName |attributes |supplier |
3.+------+----------------------------------+-----------------------------+-------------------+
4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|
5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |
6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|
7.+------+----------------------------------+-----------------------------+-------------------+

  • A. itemsDf.select(explode("attributes")).filter("attributes_exploded".contains("i"))
  • B. itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(attributes_exploded.contains("i"))
  • C. itemsDf.select(col("attributes").explode().alias("attributes_exploded")).filter(col("attributes_exploded").co
  • D. itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(col("attributes_exploded").contain
  • E. itemsDf.explode(attributes).alias("attributes_exploded").filter(col("attributes_exploded").contains("i"))

Answer: D

Explanation:
Explanation
Result of correct code block:
+-------------------+
|attributes_exploded|
+-------------------+
| winter|
| cooling|
+-------------------+
To solve this question, you need to know about explode(). This operation helps you to split up arrays into single rows. If you did not have a chance to familiarize yourself with this method yet, find more examples in the documentation (link below).
Note that explode() is a method made available through pyspark.sql.functions - it is not available as a method of a DataFrame or a Column, as written in some of the answer options.
More info: pyspark.sql.functions.explode - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 95
The code block shown below should return a DataFrame with only columns from DataFrame transactionsDf for which there is a corresponding transactionId in DataFrame itemsDf. DataFrame itemsDf is very small and much smaller than DataFrame transactionsDf. The query should be executed in an optimized way. Choose the answer that correctly fills the blanks in the code block to accomplish this.
__1__.__2__(__3__, __4__, __5__)

  • A. 1. transactionsDf
    2. join
    3. itemsDf
    4. transactionsDf.transactionId==itemsDf.transactionId
    5. "anti"
  • B. 1. transactionsDf
    2. join
    3. broadcast(itemsDf)
    4. transactionsDf.transactionId==itemsDf.transactionId
    5. "outer"
  • C. 1. transactionsDf
    2. join
    3. broadcast(itemsDf)
    4. "transactionId"
    5. "left_semi"
  • D. 1. itemsDf
    2. broadcast
    3. transactionsDf
    4. "transactionId"
    5. "left_semi"
  • E. 1. itemsDf
    2. join
    3. broadcast(transactionsDf)
    4. "transactionId"
    5. "left_semi"

Answer: C

Explanation:
Explanation
Correct code block:
transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi")
This question is extremely difficult and exceeds the difficulty of questions in the exam by far.
A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf - the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded.
When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([...]). The DataFrame class has no broadcast() method, so this answer option can be eliminated as well.
All two remaining answer options resolve to transactionsDf.join([...]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.

 

NEW QUESTION 96
The code block displayed below contains one or more errors. The code block should load parquet files at location filePath into a DataFrame, only loading those files that have been modified before
2029-03-20 05:44:46. Spark should enforce a schema according to the schema shown below. Find the error.
Schema:
1.root
2. |-- itemId: integer (nullable = true)
3. |-- attributes: array (nullable = true)
4. | |-- element: string (containsNull = true)
5. |-- supplier: string (nullable = true)
Code block:
1.schema = StructType([
2. StructType("itemId", IntegerType(), True),
3. StructType("attributes", ArrayType(StringType(), True), True),
4. StructType("supplier", StringType(), True)
5.])
6.
7.spark.read.options("modifiedBefore", "2029-03-20T05:44:46").schema(schema).load(filePath)

  • A. The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.
  • B. Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.
  • C. Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.
  • D. Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.
  • E. The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.

Answer: B

Explanation:
Explanation
Correct code block:
schema = StructType([
StructField("itemId", IntegerType(), True),
StructField("attributes", ArrayType(StringType(), True), True),
StructField("supplier", StringType(), True)
])
spark.read.options(modifiedBefore="2029-03-20T05:44:46").schema(schema).parquet(filePath) This question is more difficult than what you would encounter in the exam. In the exam, for this question type, only one error needs to be identified and not "one or multiple" as in the question.
Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.
Correct! Columns in the schema definition should use the StructField type. Building a schema from pyspark.sql.types, as here using classes like StructType and StructField, is one of multiple ways of expressing a schema in Spark. A StructType always contains a list of StructFields (see documentation linked below). So, nesting StructType and StructType as shown in the question is wrong.
The modification date threshold should be specified by a keyword argument like options(modifiedBefore="2029-03-20T05:44:46") and not two consecutive non-keyword arguments as in the original code block (see documentation linked below).
Spark cannot identify the file format correctly, because either it has to be specified by using the DataFrameReader.format(), as an argument to DataFrameReader.load(), or directly by calling, for example, DataFrameReader.parquet().
Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.
No. If StructField would be used for the columns instead of StructType (see above), the third argument specified whether the column is nullable. The original schema shows that columns should be nullable and this is specified correctly by the third argument being True in the schema in the code block.
It is correct, however, that the modification date threshold is specified incorrectly (see above).
The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.
Wrong. The attributes array is specified correctly, following the syntax for ArrayType (see linked documentation below). That Spark cannot identify the file format is correct, see correct answer above. In addition, the DataFrameReader is called correctly through the SparkSession spark.
Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.
Incorrect, the object types in the schema definition are correct and syntax of the call to Spark's DataFrameReader is correct.
The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.
False. The data type of the schema is StructType and an accepted data type for the DataFrameReader.schema() method. It is correct however that the modification date threshold is specified incorrectly (see correct answer above).

 

NEW QUESTION 97
Which of the following code blocks returns a DataFrame where columns predError and productId are removed from DataFrame transactionsDf?
Sample of DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId|f |
3.+-------------+---------+-----+-------+---------+----+
4.|1 |3 |4 |25 |1 |null|
5.|2 |6 |7 |2 |2 |null|
6.|3 |3 |null |25 |3 |null|
7.+-------------+---------+-----+-------+---------+----+

  • A. transactionsDf.drop(["predError", "productId", "associateId"])
  • B. transactionsDf.drop("predError", "productId", "associateId")
  • C. transactionsDf.drop(col("predError", "productId"))
  • D. transactionsDf.dropColumns("predError", "productId", "associateId")
  • E. transactionsDf.withColumnRemoved("predError", "productId")

Answer: D

Explanation:
Explanation
The key here is to understand that columns that are passed to DataFrame.drop() are ignored if they do not exist in the DataFrame. So, passing column name associateId to transactionsDf.drop() does not have any effect.
Passing a list to transactionsDf.drop() is not valid. The documentation (link below) shows the call structure as DataFrame.drop(*cols). The * means that all arguments that are passed to DataFrame.drop() are read as columns. However, since a list of columns, for example ["predError",
"productId", "associateId"] is not a column, Spark will run into an error.
More info: pyspark.sql.DataFrame.drop - PySpark 3.1.1 documentation
Static notebook | Dynamic notebook: See test 1

 

NEW QUESTION 98
Which of the following describes the role of tasks in the Spark execution hierarchy?

  • A. Within one task, the slots are the unit of work done for each partition of the data.
  • B. Tasks are the smallest element in the execution hierarchy.
  • C. Tasks are the second-smallest element in the execution hierarchy.
  • D. Tasks with wide dependencies can be grouped into one stage.
  • E. Stages with narrow dependencies can be grouped into one task.

Answer: B

Explanation:
Explanation
Stages with narrow dependencies can be grouped into one task.
Wrong, tasks with narrow dependencies can be grouped into one stage.
Tasks with wide dependencies can be grouped into one stage.
Wrong, since a wide transformation causes a shuffle which always marks the boundary of a stage. So, you cannot bundle multiple tasks that have wide dependencies into a stage.
Tasks are the second-smallest element in the execution hierarchy.
No, they are the smallest element in the execution hierarchy.
Within one task, the slots are the unit of work done for each partition of the data.
No, tasks are the unit of work done per partition. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.

 

NEW QUESTION 99
The code block shown below should read all files with the file ending .png in directory path into Spark.
Choose the answer that correctly fills the blanks in the code block to accomplish this.
spark.__1__.__2__(__3__).option(__4__, "*.png").__5__(path)

  • A. 1. read()
    2. format
    3. "binaryFile"
    4. "recursiveFileLookup"
    5. load
  • B. 1. open
    2. as
    3. "binaryFile"
    4. "pathGlobFilter"
    5. load
  • C. 1. read
    2. format
    3. binaryFile
    4. pathGlobFilter
    5. load
  • D. 1. open
    2. format
    3. "image"
    4. "fileType"
    5. open
  • E. 1. read
    2. format
    3. "binaryFile"
    4. "pathGlobFilter"
    5. load

Answer: E

Explanation:
Explanation
Correct code block:
spark.read.format("binaryFile").option("recursiveFileLookup", "*.png").load(path) Spark can deal with binary files, like images. Using the binaryFile format specification in the SparkSession's read API is the way to read in those files. Remember that, to access the read API, you need to start the command with spark.read. The pathGlobFilter option is a great way to filter files by name (and ending). Finally, the path can be specified using the load operator - the open operator shown in one of the answers does not exist.

 

NEW QUESTION 100
......

Verified Pass Associate-Developer-Apache-Spark Exam in First Attempt Guaranteed: https://www.itpassleader.com/Databricks/Associate-Developer-Apache-Spark-dumps-pass-exam.html

Databricks Associate-Developer-Apache-Spark Real Exam Questions Guaranteed Updated Dump from ITPassLeader: https://drive.google.com/open?id=1zcVURZQTA1d-E-TFXge42SZ7UCwI4JIc

0
0
0
0