Optimizing KyroCoder in beam backend

Current implementation of KryoCoder writes class for every object on the output stream. (https://github.com/twitter/scalding/blob/b0ba993ac817e6b1e52126e8b1cfb1054cc00dad/scalding-beam/src/main/scala/com/twitter/scalding/beam_backend/KryoCoder.scala#L16)
This was done because beam can split the stream in between and if registration is only in the beginning of the stream, the latter part of the stream will fail. However we don't want to write className for classes which are already registered.

We can set `setRegistrationRequired(true)` when creating the Instantiator (https://github.com/twitter/scalding/blob/b0ba993ac817e6b1e52126e8b1cfb1054cc00dad/scalding-beam/src/main/scala/com/twitter/scalding/beam_backend/BeamBackend.scala#L22). 

Then in KryoCoder we can keep a mapping of classes which have registration available (We can do a `Try {pool.hasRegistration}` and save the output in a map for future) and for those we use `kryoPool.toBytesWithoutClass` and for others we do `kryoPool.toBytesWithClass`

Is there a better way to achieve this?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizing KyroCoder in beam backend #1955

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimizing KyroCoder in beam backend #1955

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions