Alpakka S3 connectors for Parquet files

Susmit07 · October 9, 2024, 9:58pm

Hi Team

We are stuck and need your help. We are reading a PARQUET file used ParquetAvro Connectors from HDFS

override def streamParquetFileAsAvro(filePath: String): Source[GenericRecord, NotUsed] = {
    ugi.doAs(new PrivilegedExceptionAction[Source[GenericRecord, NotUsed]] {
      override def run(): Source[GenericRecord, NotUsed] = {
        try {
          val hadoopPath = new HadoopPath(filePath)
          hadoopConfig.setBoolean(AvroReadSupport.AVRO_COMPATIBILITY, true)
          val inputFile = HadoopInputFile.fromPath(hadoopPath, hadoopConfig)
          val reader = AvroParquetReader.builder[GenericRecord](inputFile).withConf(hadoopConfig).build()
          AvroParquetSource(reader)
        } catch {
          case ex: Exception =>
            handleParquetFileReadError(filePath, ex)
            Source.empty[GenericRecord]
        }
      }
    })
  }

Now we want to upload it in s3, the upload to s3 file bucket works, but the file gets corrupted during the time of serialization

// Step2: Read the Parquet file as a Source[GenericRecord, NotUsed]
    val parquetSource: Source[GenericRecord, NotUsed] = fileHandler.streamParquetFileAsAvro(filePath)

    // Step 3: Flow to convert GenericRecord to ByteString
    val recordToByteStringFlow: Flow[GenericRecord, ByteString, NotUsed] = Flow[GenericRecord].map(serializeRecord)

    val s3key: String = DataSyncUtils.getRelativePath(filePath)
 val s3Sink: Sink[ByteString, Future[MultipartUploadResult]] = S3.multipartUpload("bg0975-cef-ccmedev-data", s3key)

    // Step 5: Run the stream
    val uploadResult: Future[MultipartUploadResult] = parquetSource
      .via(recordToByteStringFlow)
      .runWith(s3Sink)

private def serializeRecord(record: GenericRecord): ByteString = {
    val writer = new GenericDatumWriter[GenericRecord](record.getSchema)
    val outputStream = new ByteArrayOutputStream()
    val encoder: BinaryEncoder = EncoderFactory.get().binaryEncoder(outputStream, null)
    writer.write(record, encoder)
    encoder.flush()
    ByteString(outputStream.toByteArray)
  }

I believe the serialization i am doing wrong

Its a very basic use case of uploading Parquet to S3

Topic		Replies	Views
Alpakka S3 Connectors For Parquet File format Akka Libraries akka-cluster , alpakka , streams	2	30	November 12, 2024
Newbie question about Alpakka S3 Akka Streams & Alpakka alpakka , streams	1	1336	March 7, 2019
Alpakka S3 Dynamic Bucket Key? Akka Streams & Alpakka java , alpakka	2	576	March 10, 2021
Alpakka s3 connector Akka Streams & Alpakka	3	1469	June 15, 2018
Alpakka S3 doesn't read S3's response in case of an Exception in the upload stream Akka Libraries alpakka	0	602	February 1, 2022

Alpakka S3 connectors for Parquet files

Related topics