Host your maven artifacts on the cloud using CloudStorageMaven

One of the major issues when dealing with large codebases in our teams has to do with artifact sharing and artifact storage.

There are various options out there that provide many features such as jfrog, nexus, archiva etc.

I have been into using them, setting them up and configuring and they certainly provide you with many features. Also having you own repository installation gives you a lot of flexibility. Furthermore docker has made things a lot easier and thus setting them up takes almost no time.

Now if you use a cloud provider like amazon, azure etc there is a more lightweight option and pretty easy to setup. By using a cloud provider such as amazon, azure or google you have cheap and easy access to storage. The storage options that they provide can also be used in order to host your private artifacts or even your public ones.

To do so you need to use a maven wagon which is capable to communicate with the storage options that your cloud provider has and this is exactly what the CloudStorageMaven project deals with.

The CloudStorageMaven project provides you with wagons interacting with Amazon S3, Azure Blob Storage and Google Cloud Storage.

If you already use one of these cloud services hosting your artificats on them seems like a no brainer and theese wagons make it a lot easier to do so.

I have compiled some tutorials on how to get started with each one of them

Happy coding!

Host your maven artifacts using Azure Blob Storage

If you use Microsoft Azure and you use Java for your projects then Azure Blob Storage is a great place to host your teams artifcats.

It is easy to setup and pretty cheap. Also it is much simpler than setting one of the existing repository options (jfrog, nexus, archiva etc) if you are not particularly interested in their features.

To get started you need to specify a maven wagon which supports azure blob storage.
We will use the Azure storage wagon.

Let’s get started by creating a maven project

mvn archetype:generate -DgroupId=com.test.apps -DartifactId=AzureWagonTest -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

We are going to add a simple service.

package com.test.apps;

public class HelloService {

    public String sayHello() {

        return "Hello";
    }
}

Then we are going to add the maven wagon which will upload and fetch our binaries to azure blob storage.

    <build>
        <extensions>
            <extension>
                <groupId>com.gkatzioura.maven.cloud</groupId>
                <artifactId>azure-storage-wagon</artifactId>
                <version>1.0</version>
            </extension>
        </extensions>
    </build>

Then we shall create the azure storage account that will host our artifacts.

Then we shall create a new container called snapshot. This container will contain our snapshot repositories.

We can go through the same process in order to create a release repository.
Be aware that there is no need to to create different containers for each repository. You can have repositories under the same container.

Now that we have set up our storage account in azure we shall set the distribution management on our maven project.

    <distributionManagement>
        <snapshotRepository>
            <id>my-repo-bucket-snapshot</id>
            <url>bs://mavenrepository/snapshot</url>
        </snapshotRepository>
        <repository>
            <id>my-repo-bucket-release</id>
            <url>bs://mavenrepository/release</url>
        </repository>
    </distributionManagement>

From the maven documentation

Where as the repositories element specifies in the POM the location and manner in which Maven may download remote artifacts for use by the current project, distributionManagement specifies where (and how) this project will get to a remote repository when it is deployed. The repository elements will be used for snapshot distribution if the snapshotRepository is not defined.

The next step is the most crucial and this has to to do with authenticating to azure.

What you need is your storage account name and the key of the storage account.
In order to retrieve both you have to navigate to the Access keys of your Storage Account at the Settings section.

Then we shall specify our storage account credentials on the ~/.m2/settings.xml

  <servers>
    <server>
      <id>my-repo-bucket-snapshot</id>
      <username>mavenrepository</username>
      <password>eXampLEkeyEMI/K7EXAMP/bPxRfiCYEXAMPLEKEY</password>
    </server>
    <server>
      <id>my-repo-bucket-release</id>
      <username>mavenrepository</username>
      <password>eXampLEkeyEMI/K7EXAMP/bPxRfiCYEXAMPLEKEY</password>
    </server>
  </servers>

Be aware that you have to specify credentials for each repository specified.

And now the easiest part which is deploying.

mvn deploy

Now since your artifact has been deployed you can use it in another repo by specifying your repository and your wagon.

    <repositories>
        <repository>
            <id>my-repo-bucket-snapshot</id>
            <url>bs://mavenrepository/snapshot</url>
        </repository>
        <repository>
            <id>my-repo-bucket-release</id>
            <url>bs://mavenrepository/release</url>
        </repository>
    </repositories>

    <build>
        <extensions>
            <extension>
                <groupId>com.gkatzioura.maven.cloud</groupId>
                <artifactId>azure-storage-wagon</artifactId>
                <version>1.0</version>
            </extension>
        </extensions>
    </build>

That’s it! Next thing you know your artifact will be downloaded by maven through azure blob storage and used as a dependency in your new project.

Run WordCount with Scala and Spark on HDInsight

Previously we tried to solve the word count problem with a Scala and Spark approach.
The next step is to deploy our solution to HDInsight using spark, hdfs, and scala

We shall provision a Sprak cluster.

screenshot-from-2017-02-22-23-12-22

Since we are going to use HDInsight we can utilize hdfs and therefore use the azure storage.

screenshot-from-2017-02-22-23-12-59

Then we choose our instance types.

screenshot-from-2017-02-22-23-13-21

And we are ready to create the Spark cluster.

screenshot-from-2017-02-22-23-13-55

Our data shall be uploaded to the hdfs file system
To do so we will upload our text files to the azure storage account which is integrated with hdfs.

For more information on managing a storage account with azure cli check the official guide. Any text file will work.

azure storage blob upload mytextfile.txt sparkclusterscala  example/data/mytextfile.txt

Since we use hdfs we shall make some changes to the original script

val text = sc.textFile("wasb:///example/data/mytextfile.txt")
val counts = text.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
counts.collect

Then we can upload our scala class to the head node using ssh

scp WordCountscala.scala demon@{your cluster}-ssh.azurehdinsight.net:/home/demo/WordCountscala.scala

Again in order to run the script, things are pretty straightforward.

spark-shell -i WordCountscala.scala

And once the task is done we are presented with the spark prompt. Plus we can now save our results to the hdfs file system.

scala> counts.saveAsTextFile("/wordcount_results")

And do a quick check.

hdfs dfs -ls /wordcount_results/
hdfs dfs -text /wordcount_results/part-00000

Run Scala implemented Storm topologies on HDInsight

Previously we set up a Scala implemented storm topology in order to count words.

What comes next is uploading our topology to HDInsight.

So we shall proceed in creating a Storm topology on HDInsight.

screenshot-from-2017-02-22-07-10-08

Then we choose the instance types.

screenshot-from-2017-02-22-07-32-37

Next step is to upload our jar file to the head node in order to deploy it. We can use scp for this purpose.

scp target/scala-2.12/ScalaStorm-assembly-1.0.jar  {your user}@{your azure endpoint}:/home/demo

Now we can ssh to our storm cluster’s head node and issue the storm command.

storm jar ScalaStorm-assembly-1.0.jar com.gkatzioura.scala.storm.WordCountTopology word-count-stream-scala

Then we can check our topology by navigating to https://{your cluster}.azurehdinsight.net/stormui

Run Scala implemented Hadoop Jobs on HDInsight

Previously we set up a Scala application in order to execute a simple word count on hadoop.

What comes next is uploading our application to HDInsight.

So we shall proceed in creating a Hadoop cluster on HDInsight.

screenshot-from-2017-02-14-07-20-45

Then we will create the hadoop cluster.

screenshot-from-2017-02-16-07-55-42

As you can see we specify the admin console credentials and the ssh user to login to the head node.

Our hadoop cluster will be backed by an azure storage account.

screenshot-from-2017-02-16-07-57-07

Then it is time to upload our text files to the azure storage account.

For more information on managing a storage account with azure cli check the official guide. Any text file will work.

azure storage blob upload mytext.txt scalahadoopexample  example/data/input.txt

Now we can ssh to our Hadoop node.

First let’s run the examples that come packaged with the HInsight hadoop cluster.

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar wordcount /example/data/input.txt /example/data/results

Check the results

hdfs dfs -text /example/data/results/part-r-00000

And then we are ready to scp the scala code to our hadoop node and issue as wordcount.

hadoop jar ScalaHadoop-assembly-1.0.jar /example/data/input.txt /example/data/results2

And again check the results

hdfs dfs -text /example/data/results2/part-r-00000

That’s it! HDinsight makes it pretty straight forward!