final class DataFrameWriter[T] extends AnyRef
Interface used to write a Dataset to external storage systems (e.g. file systems,
key-value stores, etc). Use Dataset.write
to access this.
- Annotations
- @Stable()
- Source
- DataFrameWriter.scala
- Since
1.4.0
- Alphabetic
- By Inheritance
- DataFrameWriter
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
bucketBy(numBuckets: Int, colName: String, colNames: String*): DataFrameWriter[T]
Buckets the output by the given columns.
Buckets the output by the given columns. If specified, the output is laid out on the file system similar to Hive's bucketing scheme, but with a different bucket hash function and is not compatible with Hive's bucketing.
This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark 2.1.0.
- Annotations
- @varargs()
- Since
2.0
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
csv(path: String): Unit
Saves the content of the
DataFrame
in CSV format at the specified path.Saves the content of the
DataFrame
in CSV format at the specified path. This is equivalent to:format("csv").save(path)
You can set the following CSV-specific option(s) for writing CSV files:
sep
(default,
): sets a single character as a separator for each field and value.quote
(default"
): sets a single character used for escaping quoted values where the separator can be part of the value. If an empty string is set, it usesu0000
(null character).escape
(default\
): sets a single character used for escaping quotes inside an already quoted value.charToEscapeQuoteEscaping
(defaultescape
or\0
): sets a single character used for escaping the escape for the quote character. The default value is escape character when escape and quote characters are different,\0
otherwise.escapeQuotes
(defaulttrue
): a flag indicating whether values containing quotes should always be enclosed in quotes. Default is to escape all values containing a quote character.quoteAll
(defaultfalse
): a flag indicating whether all values should always be enclosed in quotes. Default is to only escape values containing a quote character.header
(defaultfalse
): writes the names of columns as the first line.nullValue
(default empty string): sets the string representation of a null value.emptyValue
(default""
): sets the string representation of an empty value.encoding
(by default it is not set): specifies encoding (charset) of saved csv files. If it is not set, the UTF-8 charset will be used.compression
(defaultnull
): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none
,bzip2
,gzip
,lz4
,snappy
anddeflate
).dateFormat
(defaultyyyy-MM-dd
): sets the string that indicates a date format. Custom date formats follow the formats at Datetime Patterns. This applies to date type.timestampFormat
(defaultyyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
): sets the string that indicates a timestamp format. Custom date formats follow the formats at Datetime Patterns. This applies to timestamp type.ignoreLeadingWhiteSpace
(defaulttrue
): a flag indicating whether or not leading whitespaces from values being written should be skipped.ignoreTrailingWhiteSpace
(defaulttrue
): a flag indicating defines whether or not trailing whitespaces from values being written should be skipped.lineSep
(default\n
): defines the line separator that should be used for writing. Maximum length is 1 character.
- Since
2.0.0
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
def
format(source: String): DataFrameWriter[T]
Specifies the underlying output data source.
Specifies the underlying output data source. Built-in options include "parquet", "json", etc.
- Since
1.4.0
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
insertInto(tableName: String): Unit
Inserts the content of the
DataFrame
to the specified table.Inserts the content of the
DataFrame
to the specified table. It requires that the schema of theDataFrame
is the same as the schema of the table.- Since
1.4.0
- Note
Unlike
,saveAsTable
,insertInto
ignores the column names and just uses position-based resolution. For example:SaveMode.ErrorIfExists and SaveMode.Ignore behave as SaveMode.Append in
insertInto
asinsertInto
is not a table creating operation.scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1") scala> Seq((3, 4)).toDF("j", "i").write.insertInto("t1") scala> Seq((5, 6)).toDF("a", "b").write.insertInto("t1") scala> sql("select * from t1").show +---+---+ | i| j| +---+---+ | 5| 6| | 3| 4| | 1| 2| +---+---+
Because it inserts data to an existing table, format or options will be ignored.
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
def
jdbc(url: String, table: String, connectionProperties: Properties): Unit
Saves the content of the
DataFrame
to an external database table via JDBC.Saves the content of the
DataFrame
to an external database table via JDBC. In the case the table already exists in the external database, behavior of this function depends on the save mode, specified by themode
function (default to throwing an exception).Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
You can set the following JDBC-specific option(s) for storing JDBC:
truncate
(defaultfalse
): useTRUNCATE TABLE
instead ofDROP TABLE
.
In case of failures, users should turn off
truncate
option to useDROP TABLE
again. Also, due to the different behavior ofTRUNCATE TABLE
among DBMS, it's not always safe to use this. MySQLDialect, DB2Dialect, MsSqlServerDialect, DerbyDialect, and OracleDialect supports this while PostgresDialect and default JDBCDirect doesn't. For unknown and unsupported JDBCDirect, the user optiontruncate
is ignored.- url
JDBC database url of the form
jdbc:subprotocol:subname
- table
Name of the table in the external database.
- connectionProperties
JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least a "user" and "password" property should be included. "batchsize" can be used to control the number of rows per insert. "isolationLevel" can be one of "NONE", "READ_COMMITTED", "READ_UNCOMMITTED", "REPEATABLE_READ", or "SERIALIZABLE", corresponding to standard transaction isolation levels defined by JDBC's Connection object, with default of "READ_UNCOMMITTED".
- Since
1.4.0
-
def
json(path: String): Unit
Saves the content of the
DataFrame
in JSON format ( JSON Lines text format or newline-delimited JSON) at the specified path.Saves the content of the
DataFrame
in JSON format ( JSON Lines text format or newline-delimited JSON) at the specified path. This is equivalent to:format("json").save(path)
You can set the following JSON-specific option(s) for writing JSON files:
compression
(defaultnull
): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none
,bzip2
,gzip
,lz4
,snappy
anddeflate
).dateFormat
(defaultyyyy-MM-dd
): sets the string that indicates a date format. Custom date formats follow the formats at Datetime Patterns. This applies to date type.timestampFormat
(defaultyyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
): sets the string that indicates a timestamp format. Custom date formats follow the formats at Datetime Patterns. This applies to timestamp type.encoding
(by default it is not set): specifies encoding (charset) of saved json files. If it is not set, the UTF-8 charset will be used.lineSep
(default\n
): defines the line separator that should be used for writing.ignoreNullFields
(defaulttrue
): Whether to ignore null fields when generating JSON objects.
- Since
1.4.0
-
def
mode(saveMode: String): DataFrameWriter[T]
Specifies the behavior when data or table already exists.
Specifies the behavior when data or table already exists. Options include:
overwrite
: overwrite the existing data.append
: append the data.ignore
: ignore the operation (i.e. no-op).error
orerrorifexists
: default option, throw an exception at runtime.
- Since
1.4.0
-
def
mode(saveMode: SaveMode): DataFrameWriter[T]
Specifies the behavior when data or table already exists.
Specifies the behavior when data or table already exists. Options include:
SaveMode.Overwrite
: overwrite the existing data.SaveMode.Append
: append the data.SaveMode.Ignore
: ignore the operation (i.e. no-op).SaveMode.ErrorIfExists
: throw an exception at runtime.
When writing to data source v1, the default option is
ErrorIfExists
. When writing to data source v2, the default option isAppend
.- Since
1.4.0
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
def
option(key: String, value: Double): DataFrameWriter[T]
Adds an output option for the underlying data source.
Adds an output option for the underlying data source.
All options are maintained in a case-insensitive way in terms of key names. If a new option has the same key case-insensitively, it will override the existing option.
- Since
2.0.0
-
def
option(key: String, value: Long): DataFrameWriter[T]
Adds an output option for the underlying data source.
Adds an output option for the underlying data source.
All options are maintained in a case-insensitive way in terms of key names. If a new option has the same key case-insensitively, it will override the existing option.
- Since
2.0.0
-
def
option(key: String, value: Boolean): DataFrameWriter[T]
Adds an output option for the underlying data source.
Adds an output option for the underlying data source.
All options are maintained in a case-insensitive way in terms of key names. If a new option has the same key case-insensitively, it will override the existing option.
- Since
2.0.0
-
def
option(key: String, value: String): DataFrameWriter[T]
Adds an output option for the underlying data source.
Adds an output option for the underlying data source.
All options are maintained in a case-insensitive way in terms of key names. If a new option has the same key case-insensitively, it will override the existing option.
You can set the following option(s):
timeZone
(default session local timezone): sets the string that indicates a time zone ID to be used to format timestamps in the JSON/CSV datasources or partition values. The following formats oftimeZone
are supported:- Region-based zone ID: It should have the form 'area/city', such as 'America/Los_Angeles'.
- Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.
Other short names like 'CST' are not recommended to use because they can be ambiguous. If it isn't set, the current value of the SQL config
spark.sql.session.timeZone
is used by default.- Since
1.4.0
-
def
options(options: Map[String, String]): DataFrameWriter[T]
Adds output options for the underlying data source.
Adds output options for the underlying data source.
All options are maintained in a case-insensitive way in terms of key names. If a new option has the same key case-insensitively, it will override the existing option.
You can set the following option(s):
timeZone
(default session local timezone): sets the string that indicates a time zone ID to be used to format timestamps in the JSON/CSV datasources or partition values. The following formats oftimeZone
are supported:- Region-based zone ID: It should have the form 'area/city', such as 'America/Los_Angeles'.
- Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.
Other short names like 'CST' are not recommended to use because they can be ambiguous. If it isn't set, the current value of the SQL config
spark.sql.session.timeZone
is used by default.- Since
1.4.0
-
def
options(options: Map[String, String]): DataFrameWriter[T]
(Scala-specific) Adds output options for the underlying data source.
(Scala-specific) Adds output options for the underlying data source.
All options are maintained in a case-insensitive way in terms of key names. If a new option has the same key case-insensitively, it will override the existing option.
You can set the following option(s):
timeZone
(default session local timezone): sets the string that indicates a time zone ID to be used to format timestamps in the JSON/CSV datasources or partition values. The following formats oftimeZone
are supported:- Region-based zone ID: It should have the form 'area/city', such as 'America/Los_Angeles'.
- Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.
Other short names like 'CST' are not recommended to use because they can be ambiguous. If it isn't set, the current value of the SQL config
spark.sql.session.timeZone
is used by default.- Since
1.4.0
-
def
orc(path: String): Unit
Saves the content of the
DataFrame
in ORC format at the specified path.Saves the content of the
DataFrame
in ORC format at the specified path. This is equivalent to:format("orc").save(path)
You can set the following ORC-specific option(s) for writing ORC files:
compression
(default is the value specified inspark.sql.orc.compression.codec
): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names(none
,snappy
,zlib
, andlzo
). This will overrideorc.compress
andspark.sql.orc.compression.codec
. Iforc.compress
is given, it overridesspark.sql.orc.compression.codec
.
- Since
1.5.0
-
def
parquet(path: String): Unit
Saves the content of the
DataFrame
in Parquet format at the specified path.Saves the content of the
DataFrame
in Parquet format at the specified path. This is equivalent to:format("parquet").save(path)
You can set the following Parquet-specific option(s) for writing Parquet files:
compression
(default is the value specified inspark.sql.parquet.compression.codec
): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names(none
,uncompressed
,snappy
,gzip
,lzo
,brotli
,lz4
, andzstd
). This will overridespark.sql.parquet.compression.codec
.
- Since
1.4.0
-
def
partitionBy(colNames: String*): DataFrameWriter[T]
Partitions the output by the given columns on the file system.
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
- year=2016/month=01/
- year=2016/month=02/
Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.
This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark 2.1.0.
- Annotations
- @varargs()
- Since
1.4.0
-
def
save(): Unit
Saves the content of the
DataFrame
as the specified table.Saves the content of the
DataFrame
as the specified table.- Since
1.4.0
-
def
save(path: String): Unit
Saves the content of the
DataFrame
at the specified path.Saves the content of the
DataFrame
at the specified path.- Since
1.4.0
-
def
saveAsTable(tableName: String): Unit
Saves the content of the
DataFrame
as the specified table.Saves the content of the
DataFrame
as the specified table.In the case the table already exists, behavior of this function depends on the save mode, specified by the
mode
function (default to throwing an exception). Whenmode
isOverwrite
, the schema of theDataFrame
does not need to be the same as that of the existing table.When
mode
isAppend
, if there is an existing table, we will use the format and options of the existing table. The column order in the schema of theDataFrame
doesn't need to be same as that of the existing table. UnlikeinsertInto
,saveAsTable
will use the column names to find the correct column positions. For example:scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1") scala> Seq((3, 4)).toDF("j", "i").write.mode("append").saveAsTable("t1") scala> sql("select * from t1").show +---+---+ | i| j| +---+---+ | 1| 2| | 4| 3| +---+---+
In this method, save mode is used to determine the behavior if the data source table exists in Spark catalog. We will always overwrite the underlying data of data source (e.g. a table in JDBC data source) if the table doesn't exist in Spark catalog, and will always append to the underlying data of data source if the table already exists.
When the DataFrame is created from a non-partitioned
HadoopFsRelation
with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.- Since
1.4.0
-
def
sortBy(colName: String, colNames: String*): DataFrameWriter[T]
Sorts the output in each bucket by the given columns.
Sorts the output in each bucket by the given columns.
This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark 2.1.0.
- Annotations
- @varargs()
- Since
2.0
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
text(path: String): Unit
Saves the content of the
DataFrame
in a text file at the specified path.Saves the content of the
DataFrame
in a text file at the specified path. The DataFrame must have only one column that is of string type. Each row becomes a new line in the output file. For example:// Scala: df.write.text("/path/to/output") // Java: df.write().text("/path/to/output")
The text files will be encoded as UTF-8.
You can set the following option(s) for writing text files:
compression
(defaultnull
): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none
,bzip2
,gzip
,lz4
,snappy
anddeflate
).lineSep
(default\n
): defines the line separator that should be used for writing.
- Since
1.6.0
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()